The Data Stack Show - 75: How To Become a Data Engineer with Parham Parvizi of the Data Stack Academy
Episode Date: February 16, 2022Highlights from this week’s conversation include:Par’s background and current role (2:48)About Talend (6:46)Nonlinear pathways to data engineering roles (11:08)What a data engineer needs to be suc...cessful (17:37)Before “data engineer” was a title (27:59)Signs you should be a data engineer (32:39)Curiosity and data engineering (38:31)Defining the modern data stack (45:07)How to get a feel for data engineering (52:52)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show. Today, we are going to talk with Par, and he has a long history
working with data. In fact, he was one of the first couple people at Talend who has been around
for a really long time. And then he actually brought the Hadoop Spark infrastructure into Talon and really influenced
the shape of the product, which is fascinating.
And today he runs a consultancy and also a school that teaches data engineering.
It's called the Data Stack Academy.
Maybe we can find out if he got the name from us.
But I'm really excited to talk with him.
Costas, I have a background in education.
So I think one of the
things that really interests me in terms of PAR's school for data engineering is what are sort of
the key foundational principles or tools that he thinks that a data engineer needs to have
to build a foundation for a career? Because it can be hard to kind of distill that down.
So that's what I'm going to ask about. How about you?
Yeah. First of all, I want to ask him about like the evolution of data engineering. He's been
around for a long time and he has been around like since, like back then we didn't have like
the term, we didn't use like the term data engineer, right? So it would be great to hear from him what's the evolution
and how from whatever was happening in the early 2000s to today has changed.
And also, of course, hearing his opinion about what's going to happen in the future, right?
So that's definitely something that I would like to discuss with him.
And then, yeah, it would be great to discuss a little bit more about the technologies
and in general what it takes to be a data engineer and how it feels to be a data engineer.
Is it fun or is it not?
We'll find out.
Let's find out and go talk with Par.
Par, welcome to the Data Stack Show.
We're super excited to talk with you.
Thanks.
Thanks for having me.
Thanks for having me, Eric.
Super excited to be here as well.
Cool. Thanks. Thanks for having me. Thanks for having me, Eric. Super excited to be here as well. Well, we've talked with lots of your friends as guests on the show, so it's only appropriate that
we can have you on as well. Can you give us your background and tell us what you do today?
Oh, thanks. Thanks. Yeah. I feel like I'm already a cousin of the show. You guys have had
almost like every one of my friends in the industry and I always like oh oh there you there
there you go yeah oh man my background it's it's been a while I think I've got just lucky
and continue to get lucky throughout my career in my life I was very lucky to be
born in a house that my father was a civil engineer but he really had a knack for computers
so we had like a IBM 8080 computer you know when i was like four years old and i would load up my floppy disks
and play my games and one time i actually formatted our entire c drive and that was
not good i don't know if you remember dos but if you do format and not specify to drive, it defaults to C, which is horrible design, right?
It's like, that's exactly what it is.
And then, yeah, I went to actually school for computer engineering.
So I have a background in engineering and computer science.
Worked actually as a chip designer for a little bit.
But I would say I really got started lucky when I left hardware
engineering, came to software engineering. I got connected with a company that you all might know
now, very large company now, Talent, but I got very lucky when I got connected with them. There
were just three or four people in US. So again, day one, like my laptop was on my lap.
They're their only technical employee at that point and pretty much grew with them, you know, as they grew to the US or trained a lot of folks brought out and build it, you know, technical team around, around, around talent on myself and, you know, around North America, then we moved to like Asia, and other other markets lucky again at some point in the middle there,
one of our account representative was like, hey, there's this thing Hadoop I keep hearing about,
like, can you just figure out what this is over the weekend? So I went and downloaded it and
finally got it to compile and work without errors, you know, fairly early on. And then I saw the
value of it immediately. I was like, wow, this is, this, this is the next thing. Like I either, I got to learn this or I'm going to be
obsolete. So I, I started learning and started contributing a little bit. And then later on,
it got actually connected with some of the founders of Hadoop, you know, Doug Cutting,
Tom White, and all, all those guys in meetings. And that was, that was just incredible. And being
in that market, I moved to a company at the time
was called Green Plum. They evolved to Pivotal. I was one of their big data Hadoop solution
architects. So I was part of the elite team they would send out to fix problems where there was
large, like a thousand node clusters. And all right, here you have two days, go figure out
why this is not working or optimize this,
this ETL process that they have, see what, you know, what's wrong with it. Did that with some guys, you know, like Don Minor that you guys might know, we were part of that team. And then 2017
decided to kind of go and build things on my own, the dream that I've had. So I started a consulting
company, that consulting company is still something that I run. But just about a year, year and a half ago, it dawned on us that
I worked with a lot of people who've been in this industry for a while. And, you know, we've kind of
been around the block and we're like, wow, there's really not too much resources around learning to
become a data engineer. So we decided to develop a curriculum around that.
And, you know, it was just us working at nights after hours for a while, and then
became more serious. And we just launched this program, which is actually by coincidence,
it's called Data Stack Academy. So it's a bootcamp to teach people how to become a data
engineer and the skills necessary for that.
Well, that is super exciting.
And I want to dig into sort of what I see as a nonlinear career path into data engineering, maybe as you compare it with sort of software engineering and you getting, you know, computer engineering degree, but really quickly, if we can, you know, talent has been around for a long time. You know, it's sort of predated a lot
of the big trends that we've seen even over the last decade, say. So could you give us just a
little bit of history, especially for the listeners who may not be as familiar with Talend.
But when you started with them, what were they building? And when was that? That was back in
the mid-2000s? Yeah, 2006. I think that was back in 2006 that I got connected with them. Yeah,
I think it was still V1 back then. Yeah, wow. Yeah, so originally, I mean, so I mean, the field comes from data warehousing,
right? And you take data warehousing and break it into the pieces that it has. And some of the
things that we don't hear too much about anymore, like databases and MPPs. And that also has kind
of gone away those talks a little bit, right? But then the other piece of that, of course,
was the BI, the reporting and that that's still there, of course.
Everybody has to do BI.
And one of the other spokes of that wheel was ETL and data engineering.
And back then, I think it was mostly known as extract, transfer, load, or ETL and the process of getting the data to the point that is ready to be analyzed or viewed by the BI tools.
In Talon, we were an ETL tool.
So in ETL, there's a lot of things that you do that I would say are mundane, right?
It's everyday stuff that could be templated, like how you open a connection to a database,
how you write the database, how you read a file.
Those things are easily be templated in a data engineering job.
So in Talon, we had made a very nice visual tool that made you be able to use those connectors
to different data sources and data sinks and drag and drop a couple clicks.
And it was a code generator.
I think that was really what was the friendship of talent early on because a
lot of the ETL tools like Informatica and so forth that were on the market,
they had an engine behind the back.
So you had to know their own language and they had their own engine,
but the talent engine was just Java and it generated Java code.
So it also allowed you to do a lot of things that are not common. You know,
all the ATL tools would get you like 70% of the way there, right? But that 30%, you still need
to code and you need to do something custom. So it allowed for developers to do that, but do the
mundane things like just open a database connection, like all of that stuff very fast visually and then there was that shift again i think where
the engine was quick quickly becoming map reduced and hadoop and spark so and i was actually
what my sort of design again to be very humble but i was the person who started that kind of
development and talent so i made the very first Spark and Hadoop connectors.
And I was like, no, this,
like even Java is not going to suffice anymore.
Like we need to go to something
that's highly distributed, highly parallel,
like Spark and MapReduce.
And we, so then we changed the connectors
to produce those code versus just pure,
you know, a photo or a jar file. Yeah. Yeah. Very cool.
No, it's, I just, it's really fun to get those little anecdotes of history sort of in the world
of, you know, the world of data that we live in, especially, you know, I enjoy it. And I hope our
listeners do too, comparing the things that, you know, are just wildly different with, you know,
the things that are like, well, I mean, some of that sounds kind of functionally pretty similar, you
know, some of the modern tooling, which is really fun. So thanks for sharing that. Okay. Let's,
let's talk about the role of a data engineer, because that's something that, I mean, you know,
I would guess that a ton of our listeners have some sort of data engineering or data engineering
related role. So it's certainly not a new term to anyone on the show. I don't know if we've
actually ever stepped back to define the term. And that can be a little tricky sometimes, you know,
in terms like super familiar, but we don't actually put a sharp point on it. But let's start
with sort of the nonlinear pathways that people travel into data engineering roles.
And would you say that happened for you? I mean, you started out as a hardware engineer.
Absolutely. And you hit that too, like almost every data engineer that I know probably has
a different background. And we've all had that. Yeah, some people come from a data,
sort of science, software engineer background,
you know, traditional like four-year degree
and followed by master degree.
Some people come from a business intelligence background
where they don't transform from being a business analyst.
And some folks are just sort of self-taught.
They come from like actually complete different backgrounds
that I've seen, you know,
like even from like being a server or a bartender,
you know, and they learn everything,
all the tricks to do so.
So it is very interesting.
And it's kind of, I think,
it's part of being a data engineer
because data engineer, as you said,
that role is not very well defined.
And in tech, we are kind of jack of all trades and master of none.
Like there's like we know a little bit about a lot of different technologies, but we're not really mastering anything.
And that kind of define what a data engineer does.
Data engineer is like a glue within a company, right? Is the person who brings data
from all the different sources
that data exists within a company
and meshes those together.
So you do have to be connected
with all the different arms of a company.
You have to understand what those do,
what those data mean,
and you have to be able to bring those together.
And that makes you kind of very essential to the company. It makes you very essential as a way that you have to actually
know even the business of the company, what the company does. So, you know, like what those data
are and how to treat them. But you also have to technically then know how to grab the data from
the different applications that are stored and so forth. But to give you an example, I like to kind of maybe start by an example,
like what a data engineer does,
like it's something that we might all have know or use,
like Lyft or Lyft app, you know, a ride share app.
Yeah.
What does a Lyft data engineer do?
And data engineers, I promise you that all of us have interacted with data do, right? And data engineers are, I promise you that all of us
have interacted with data engineers, right?
They're always in the background.
They're always there.
You just might not hear exactly what they do
because we're background people.
We're not developing apps, right?
We're not developing like web
or everything everybody uses.
So in a Lyft example, so, you know, in a company generates data.
So Lyft has people who go download the app, use the app, and that generates a lot of data.
There's a lot of data as far as like you getting the ride.
There's a lot of data as far as like actually the ride, like your GPS updates, where you
go in, all of those stuff.
And the app itself is usually developed by
your full stack developer app developers. But it's data engineer job to work with those folks
to grab those data and then store it in a matter that can be analyzed and move it probably to a
cloud, move it to the servers that it needs to be.
So the engineer works with them very directly to do that data acquisition, right?
But it's also data engineer job to hand it off to other folks like data scientists and
business analysts who are the business side of that, you know, who, let's say, do something
useful with that data.
Like, for example, in data science,
you know, example could be in the Lyft example, like a prediction algorithm where you want to
tell when I go to grab a ride, maybe the app tells me, hey, based on the patterns that we've seen,
you might want to wait 15 minutes, it's going to be cheaper for you to get this right. Or maybe
it's like a notification. Maybe it's an anomaly
detection where you're like, okay, as you're going through the ride, the app pops up and says, hey,
your driver just seems to take too many wrong turns, right? It's like we've seen like you've
just take based on GPS data, you've taken too many wrong turns. You might want to just want to know
that. And those are like the data
science or machine learning algorithms.
But it's again, the data engineer to provide the data scientists with all that data.
And again, grab the results from the data science algorithms and provide that back to
the user.
So it's, it's that glue roll again.
And even in the other sense is
the data scientists typically work
with a smaller data set, right?
They work in a prototyping fashion
where they only might have like
developed a data science model
looking at a thousand or 10,000 population.
It's data engineering tasks again
to really operationalize those
and in a company take those to the masses.
Okay, now we take this data science model
in this like prototyping phase,
but now we're going to scale it
to like the million or billion users
that this company has.
Make sure that data continues
to come through the pipeline,
come through the system
and all the users get the results, right?
So that's the data engineer role.
Also data engineers,
we also the glue between
all the different applications so between your crm application sales application all of that stuff
we play that role as well okay so super fascinating and i i love and actually would love to hear uh
from our listeners if you're listening to the show write in and tell us if you have a similar
opinion you know because we haven't defined it before. But that was super helpful. And I love using the
example. One question I have actually for you, Par and Costas, because you both come from,
actually, I think you both come from sort of, you have like hardware engineering and software
engineering backgrounds. In software engineering, and I, you know, this is something
I learned from being in an education business that focused on software engineering for a while.
And our instructors were always very insistent on sort of teaching core principles because
that was way more important than, you know, the specific syntax of a particular language, right? Like if you understand these core concepts of building software,
you can apply them to different syntax within the context of a different language.
Would you say that's true for data engineering as well?
Are there some sort of foundational things, concepts, skills that you need to sort of master to have a really good foundation to sort of
build a long-term successful career as a data engineer? Yeah, that's a very interesting question
from Eric. And it's a question, to be honest, that comes up not only for data engineering,
but also for software engineering. What you're describing, Eric, is also the difference between computer science and computer engineering, right?
For example, like, which is completely different disciplines, to be honest.
Computer, completely different.
I mean, of course, they have like overlaps, right?
And if you are like a computer scientist, you can also become a computer engineer. But there is a reason that one thing is called a science
and the other thing is called an engineer discipline.
And one of the things that happens a lot with the curriculums for software engineering
is that they are heavily dominated by
computer science topics. That's my experience also, right?
My degree is in electrical and computer engineering,
but anything that had to do with like the computer side,
it was like more of computer science, right?
Like I had to prove the complexity of algorithms, right?
It wasn't just to know the complexity of an algorithm.
And I had to figure out like what tools you can use to do that
and why we
have, for example, these different complexity classes and all these things. Now, there's a
debate there, how much you need from that to become a software engineer, right? I'm biased,
obviously. I like the science side of things. So I think that it's important to know that also when you move into engineering.
But the engineering track also has some very,
very important elements that you need to know
if you are going to build products and put technology into,
like, let's say, make it useful for people, right?
Everyday people.
And that's, I think, what we are seeing happening also right now with data engineering.
Like, I think the biggest change that has happened, like, probably in the past, like,
one to two years is that all the principles that we have in software engineering or computer
engineering, you see them, see them being applied in data engineering
and how we deal with data, right? Quality, QA, tests, version control, all these tools that the
software engineer is using every day to go and deliver what they have to deliver, we start
introducing them also into data.
Now, do you need to have, let's say,
there are like some kind of principles that are, let's say, fundamental?
I would say it's probably the same thing
with software engineering.
It's the same principles.
I don't see like some huge difference there
in terms of like how someone should shape the way they think in order to solve the problems there.
At the end, it's again like engineering problems.
And there are problems that you are solving with software.
And either you build software or you operate software.
And I think that's the main, the very interesting characteristic about data engineering.
And I've said that like many times in the past
is that it's a hybrid between ops
and software engineering,
which maybe might change in the future.
I don't know.
We have like data ops and data engineering
and it would be like to separate.
I don't know.
But for now, I think that the data engineer
has to do both.
So yeah, that's, I don't know. But for now, I think that the data engineer has to do both. So yeah,
that's, I know it wasn't like the most straight answer, but in my mind, at least I don't see any
difference in the fundamentals between the two, between software engineering and data engineering
at the end. I want to bring it back a second and say that how big actually data engineering is.
And it's not just us.
There's a lot of numbers behind this.
I know there was a DICE report from the pandemic year where data engineering was the fastest
growing field in tech.
And they analyzed 6 million job postings in the US for that.
And it rose by 50%.
It had double digit lead over the second on the list,
which was 32%, which was data science.
So that's, I mean, that's huge.
That tells you everything that this is a field
that is the fastest growing field again in tech.
And that shows in the salaries.
So when you look at the salaries on Indeed right now,
average data engineer salary is
119K. That's higher than data scientists, higher than a food stack developer, higher than a software
engineer. And I want to say that in some sense, we are very privileged to be data engineers,
right? We have very comfortable jobs. We do have salaries like that.
One of the other stats that Indeed posts is unlimited time off for data engineering, which
I don't even know what that means, but I kind of do, right? Like we can, we're very flexible. We
work very flexible hours. We have great 401ks. We have great healthcare, and some of those are just not available to a lot of us
Americans these days. And a lot of, I want to kind of take that conversation to the folks in,
to that sense that for the folks that are trying to get into this market, right? And what path they
would choose, what do they need to learn to be a data engineer and it is true there's a lot of different paths
that you you would take and even if you're in software engineering and you want to level up
to be a data engineer especially nowadays be a cloud data engineer I think that that's very key
what do you what do you have to do I do feel like from my experience, there are some things that I can say that would work.
Yes, there might not be a general path.
And if you look at a lot of colleges out there, they're just beginning to have like
a master of data engineering program.
There's some boot camps now that do like a specific data engineering career.
But here's what I think would work really as the path. I cannot emphasize
enough the importance of being a cloud data engineer these days anymore. And if you want to,
if you're on this self-learning path, start by actually looking at one of the cloud certifications.
All the major cloud vendors, AWS, Azure, GCP have now a data engineering
certification. On their website, they actually list a lot of good free resources that you can go
to learn the skills to be that person. I would also say it's very important, something that gets
missed, that as we're doing things and as we're doing projects you do those
on github now to get your github profile right it's like the most important thing out there
and a lot of companies when they hire they go on github and that's how they vet the resources i do
that myself because code doesn't lie right at the end the end of the day, you can look at someone's GitHub profile and see what they've done.
The other thing I would say is to get your hands on a lot of real-world projects.
So there are sites like Kaggle, Data Hub IO, where you can go get a lot of these real-world
data sets, real-world projects, and then build those and build those on GitHub and be able to show that
to employers later on. I would also say, I mean, there's tons of resources out there as far as
learning where you, you know, DataCamp and Udemy, Coursera, all those courses that have
very budgeted, I would say, programs around that. But then there's also this other kind of learning path,
which is your bootcamps.
And they're now just becoming some bootcamps
around data and specific around data engineering.
And they can provide some things
that the other programs can't, right?
I think one, they provide this like declared intention
that's by far the most effective.
When you say you want to be a
data engineer and you go through these bootcamps, you sign yourself up for this like three, four
month experience where you submerge yourself in data engineering, submerge yourself in learning.
And there's just something magical that happens when people have that declared intention, right?
They're like, okay, I'm going to do this. And second, I think the benefit they have
is around that shared learning, right?
When you're around other people,
they're just at your own level.
And you can maybe replicate that a lot
out there on your own too.
There's, you know, Twitch is a great platform now.
A lot of people like kind of learn together on Twitch.
You know, somebody says, hey, on Wednesday night,
I'm going to go learn this tool.
And if you want to join, join me.
We're all going to do it together.
And that's a really great resource.
I highly recommend that.
But I think lastly, some of these boot camps that have career services, that's something
that you probably won't be able to get anywhere else.
And that would be very important, especially if you're new to tech, to have somebody actually
advocate for you. Somebody go and show you the ropes of where to get a job and how to get a job.
That's going to be really important. That's short of my spiel on the path to data engineering.
And again, I do want to really emphasize the importance of learning a lot of these skills on the cloud.
And I see that's where the industry going.
I do have a very opinionated opinion on the sort of five tools
that you have to learn as a data engineer.
Because I know if you kind of go out there and kind of try to research your own,
you would immediately become very confused with a lot of different opinions
that people have in this stack that you have to learn.
Because we all know as data engineers, we're also very opinionated.
That's for sure.
A hundred percent.
I have a question.
What was there before the title of data engineer?
Like companies had problems with data like since forever, right? Like, companies had problems with data, like, since forever, right?
Like, they had, like, to manage data in a different way, obviously.
Like, we didn't have cloud back then.
Data warehouses were, like, bundles of hardware together with software.
But what was there before that?
Like, when you were in talent, right?
Yeah.
Mid of, like, 2000.
How did you call these people?
That's a great, great question.
I think it's kind of evolved
with the different technologies, right?
So again, if you look at it,
it was software engineering
and that kind of evolved to data warehouses, right?
And that kind of evolved to data lakes
and now it's going to data clouds.
And if you backtrack that, I think some of the
skills that we've had to pick up along the way, right? We went from data warehouses where it was
very much an afterthought. Business intelligence, in a sense, is an afterthought. It's like,
after the events have happened, after the data has happened, you come and collect things and
then make sense of it and say, okay, let's get the numbers on an aggregate level and see what happened, how much we sold, how much inventory we
have. And it was very purely, again, an aggregation in a very mathematical sense, right? Where now
in kind of like the 10s or the teens, 2010s, 2020, now time, that become much more real time.
And that become that.
And then we saw ML and machine learning and data science, right?
So the need when, okay, not so much an afterthought, but what do we do now?
Like this data is coming in.
How do we interact with the user in an intelligent way and how to use machine learning and data science
to give them some perspective of what they're doing and give them some pointers.
And I think that's where data engineering kind of grew alongside this data science, right?
When in the modern world, a lot of these are real time. If I backtrack that just a little bit before that,
we could say someone was a big data engineer.
And again, even big data was to a point was again, an afterthought.
MapReduce and Spark was an afterthought.
We just did it at a larger scale.
Yes, other things came like Spark and Kafka
and other of these technologies that made that again
more real time. And that was kind of the shift. And now if we even step one level back from big
data engineer, I think you hit it very right onward in the talent days where it was like
ETL developer, right? Now we're in data, really in data warehouse realm where you were an ETL
extract transform loan developer.
You worked alongside a business analyst at very close.
And your job was just to provide pretty much the aggregate level tables to get from raw
level data to aggregate tables.
And then the business analyst put a visualization, a dashboard around that.
And that went to the executives and step beyond that.
Like when we were initially hiring a talent, there was,
yes,
I would say you would be a software developer or a DBA,
a database administrator, you know,
something that you were just purely in charge of like storing data.
And that, that kind of, I think, that was, like, the evolution.
So I would say, like, kind of software developer, right?
Went into, like, ETL.
And from there, you went to, like, a big data engineer.
And then big data engineers somewhere along the way meshed to data engineer.
Now, like, cloud engineer, per se.
Yeah.
Makes a lot of sense yeah that actually i think it's like a very very good
timeline of like what was there before and how it's let's say grew into the role of the data
engineer today i totally agree with you so i have a little bit of a provocative question for you
based on also like the experience the personal experience that you have by moving from like different roles like from hardware to software then become like getting
into data engineering let's say i'm a software engineer right i write backend code i don't know
something like that something a little bit more common how do i know or like what indications i
might have that i'll be happy as a data engineer and it's worth for me like to invest and transition from being a backend engineer into a data engineer
another great great question Costas you nailed it in the head so
there are some characteristics of a data engineer, I think, personality-wise, that I've seen. There might not be good.
But I would tell them.
I mean, financially, we can all agree that there are evidence that this is the fastest-growing field in tech.
It's not going away because data is not going away.
It's not a hype.
And we've all been through a lot of hypes in our career.
But at the end of it, it's data is not going away. That's for sure. There's evidence that say, again, these DICE report that came out that ranked data engineers the number one growing field in the field. There's the evidence of the average salaries for the engineer, which is higher than any other field in tech again. So of course, there's financial benefits of being a
data engineer. And you and I can both agree that we can both pick up the phone and tomorrow we have
a job, right? Just saying our skills. So we're very privileged in that, again, in that sense.
Now, the personal
characteristics of a data engineer, which kind of hurts me to say, is, is, first of all,
we're background people, right? Nobody hears our name. They hear our name mostly when things go
wrong. You know, if all the data gets to where it needs to go, and everything's fine, nobody would
come knocking on your door. But as soon as,
you know, people don't get their notification, people don't say their emails, people don't,
you know, like all these apps fail, or there's a data breach or something like that, then
everybody's going to know your name. People are going to knock on your door. So in a sense,
I like to say that us data engineers are kind of silent heroes.
You know, we're in the background, but nobody hears our name.
We're the Q, right, in the Bond sense, right?
I hate to use that analogy, but that's somewhat true.
From working with data engineers, one thing that we do, because in our field, the devil is in the detail right as a data engineer you have to
make sure that you've accounted for every piece that a software could go wrong that accounted
for all the corner cases and in that sense a lot of data engineers are very particular in attention
like the attention to detail i want to say even like we're almost OCD, right? And we are. Like if you see my own apartment, it's very neat.
I am very OCD.
And data engineering, but that's what makes a good data engineer, the attention to detail, right?
So if you have those characteristics, I think data engineering is something good for you.
If you are someone who thinks in steps, if you played with too much Legos as a child, you know, you build pieces together.
I did. I played with Legos till I was 13. But that helps because again, you have as data
engineering, you have all these pieces and you have to figure out how you put them together
and you build something bigger from that, right? So those are some good facts. I would say the one
thing that i i
want to debunk if you if you're out there on the internet a lot of people say that data engineering
is too complex to get into it's too hard to learn i honestly absolutely disagree with that
and and even like i see where that is coming from because people say oh because you have to learn
spark and spark this hugely massive you know distributed oh, because you have to learn Spark and Spark this hugely massive,
you know, distributed processing engine, or you have to learn these things like Kafka. Again,
this very complex software engineering concepts like distributed processing, right? But those
things have been made so simplified now. Like it is Spark, it is Spark because a lot of smart people worked for years to abstract away all of that complexity and make it something very simple to understand and very simple to use.
And I would say absolutely like anybody can go spend two weeks and be a very solid Spark developer, right?
They can understand the concept of data frames.
They can use it to aggregate data. They can use it to aggregate data.
They can use it to process data.
And so that's the one thing that I want to completely debunk here.
If you learn that data engineering is complex, that is untrue.
And it's mostly like a little bit of gatekeeping talk that there is in a lot of tech, I would say.
Okay.
Kassus and Parth, question for you here,
and this might be provocative as well.
And, you know, maybe some, you know,
software engineers who are listening may not like this,
but one thing that's interesting
when you think about software engineering is
you're building something for an end user, right?
So you want to develop empathy for the end user.
So if you think about a software engineer going back to Lyft,
as opposed to a data engineer,
the software engineer, you know, in an ideal world,
is trying to build empathy with someone who's trying to get a ride,
to book a ride, right?
And, you know, what are the, And what are the sort of what's happening
that creates friction there?
And I want to have empathy as I build this, right?
What's interesting about data engineering
is your end user is someone in the business.
So would you say that if you think about
a software engineer to your question, Kostas,
and would love your take on this as well, Kostas,
if you have kind of an interest in the mechanics of the business itself,
maybe even beyond sort of the experience of the end user,
that's kind of a false dichotomy, right? Because they're, they're, you know,
inherently related, but you said earlier part,
you have to understand the business, right. And
sort of the way that it works and generates revenue and all that sort of stuff. Would you
say that a predisposition to being curious about the business is a good sort of prerequisite for
being a data engineer or does that matter? I wouldn't say it's a prereq. It's something that I picked up along
the way that I think that you can kind of pick up and by necessity, you kind of pick up along the
way a little bit. But yes, it is essential that you have those ears open, right? You have your eyes open and you have your ears and listening
for the ways that your process is making other people work easier. Like, right, just like you
said, yes, like a software developer goes to a company to build, you know, empathy around the
experience. Our job is to build empathy for that software engineer
to make all the pieces that they need, that they can do their job, right? That the platform that
they need, and then again, get the data and also be able to grab data and then hand it to the,
again, data scientist, and then tell they get back to this software engineer, hey, these are
the ways that your software is being used. Maybe these are some of the things that you haven't seen. And again, make that loop possible, right? So yes,
you have to know the business because you touch, again, you are the centerpiece. You are the piece
that moves data around a company. So you're going to touch all sides of business.
And it's very important when you are in those meetings with the different business stakeholders that you really listen.
I think a big part of software data engineers' job, our job, is to listen, to be very honest. And then taking those, those things that you've heard, taking back with you
and turn it into requirements of what you have to do about your job. Yeah. How now that tells you
how you have, how you should sort the data that kind of mashes those needs, those business needs.
Right. And so, and to kind of close that up, your empathy, like as a data engineer, I feel like we live for the process, right?
Like, yes, maybe it's not that glorified app that we designed and we made it so much simpler for the person to click, you know, the buttons and get that right quicker.
But we made that process possible.
We made all those pieces work.
Even though we were in the front-facing of that,
we're in the back, again, connecting the dots, right?
Yeah, I agree with Par.
I would say, let's say we were interviewing
for data engineer role, right?
I don't think that
i would pay that much attention on how much interested the person is like in the business
itself at that point i think that's relevant for everyone at the end who works in the business
if you ask me for example the first thing that comes to my mind when i was listening to you eric
is data analysis.
Like, yeah, if we are talking about the data analysis, being curious about the business
is important because you have to understand the business to go and do data analysis, right?
And understand, does this make sense or it doesn't make sense?
So you can go and figure out if something went wrong, like on working with the data.
But for a data engineer, I don't know.
As Par said, we are talking about people that are in the background.
And that's part of how it is.
And it's a good thing.
I mean, it's not bad, right?
Like it's not, there's nothing wrong about that.
But you learn about the business, let's say,
anyway. You cannot not learn, let's say, about that. And that's relevant for every engineering role, right? Go and speak with someone in operations, for example. They know a lot about
the business because depending on the business, they know if they have to be on poll or not right so i i wouldn't say that there's some difference between like
the data engineer and anyone else the main difference that i would see there is that
data engineers the the the customer of the data engineer is internal right it's like the marketing
team it's the sales team like the rev ops team or whatever.
And it's not necessarily like, let's say only the person who uses the lift
app to call a lift and go somewhere.
So that's the main, that's the main difference, but I wouldn't say that like,
okay, it's, let's put it like in a different way, let's say I have like a
person who is more curious in the business
than in the data-related problems of, for example,
how much data you deal, you have to work with,
and what kind of systems you have.
If someone was more interested in the first thing
and didn't ask any questions about the second,
I would be worried.
Probably there's something wrong with the career path
of the person's choice.
Yeah, probably.
If I have someone who's really interested in the data-related problems and also makes the connection with the business, that's amazing.
But if it doesn't happen, that's again fine.
Yep.
Yeah, I think that's a super helpful perspective.
Okay, we're closing in on the end here.
So we have time for one more topic.
Brooks, just let us know. One thing we'd like to chat about is you mentioned a term as we were
prepping for the show part, the modern data stack is something that we've discussed.
I'd love your take on two things. And I'm sure Costas will have some questions as well.
You mentioned that you have pretty strong opinions about, you know, what are the five
tools that a data engineer needs to know?
So I'd love to get that list from you because we always love, you know, an opinionated take
on the tooling.
And then the other thing I would like to know is if we think back to 2005 and Talend and then Hadoop Spark, you've seen the modernization of the data stack.
And there are lots of definitions for that. It's hard to nail down. But I just love your take on
what is the modern data stack and what does that mean to you in the context of what you've seen
over the past decade and a half? Yeah. Yeah, great question. So if I had to really summarize it,
I would say the modern data stack now, again, is the cloud and it's, it's, it's on the cloud. And I,
and I explained that in a second, let me, let me get into the five tools that you need to learn.
And these are, I would say backed again by a lot of data, by that DICE report that I
mentioned to you, that these are some of the skills that were top listed on those. I would say as a
data engineer, you got to learn the basics, right? The basic, and this is, I would say, number one
basics. And I would categorize that as just your basic bash terminal programming, Python SQL.
And I chose Python as my language.
I know there's a lot of languages out there for data engineers,
but Python by far is the most dominant one.
And it's, again, that bridge between data scientists and data engineers.
So it's a great language, again, built for data engineering.
The second, number two, i would say docker kubernetes that's becoming
especially for designing cloud agnostic data pipelines pipelines that work across clouds
and contrarianization now everybody is now you know and designing serverless microservice these
sort of architecture which is again part of the modern data architecture.
And Kubernetes and Docker are the heart of that.
And this was, again, the number,
I think Kubernetes actually was the number one skill
for data engineering in that Dice report.
If you go to third,
something traditional that's been around there,
your number one big data tool, Spark.
It's still, you know, your bash processing, billions of records, Spark is the tool to go.
And again, it's very easy to learn Spark nowadays because there's great documentation, a lot of good resources out there.
Number four, it's now I'm going to move to a little bit of stream processing
and real-time processing. And this is a tool that is just a foundation for connecting almost
all the data pipelines in a real-time sense, which is Kafka, right? Your Kafka, Apache Kafka. And number five, I would come a little bit on the orchestration side.
And there's a tool now that's becoming heavily dominant on the orchestration side because, again, as data engineers, part of our job is to connect the pieces together, orchestrate data flows for data pipelines.
And that's Apache Airflow.
Apache Airflow now is almost becoming the number one orchestration tool.
But I want to now come back and say, as a modern data engineer,
what I've seen in the industry in almost every project that I do nowadays,
it starts on the cloud and they're on the cloud,
and everything's moving on the cloud. So you need to learn these skills on the cloud and they're on the cloud and everything's moving on the cloud.
So you need to learn these skills on the cloud. The first two are agnostic on the cloud, right?
Bash, Python, SQL, Docker, Kubernetes, that even cloud vendors don't even have a different name
for a Kubernetes service. They call it Kubernetes. But the last three, each cloud vendor has kind of
their own version and they call it something else, right? Spark, each cloud vendor has kind of their own version,
and they call it something else, right?
Spark on Amazon has got Amazon EMR,
and Google is called Dataproc,
and Azure is Databricks, the company who is behind the Spark.
Kafka on Google is called PubSub,
and Azure is Event Hub, on Amazon is Kinesis, Airflow has different names, etc.
And I say that again, that even in my own job, I would say we wouldn't even take jobs that are not on the cloud anymore because they take so much longer to set up and maintain.
And that makes us to be able to deliver a lot later, even as consultants.
And that's a bad thing in consulting, right?
You want to go as fast as you can.
And the cloud vendors just have made that so easy.
They've removed all the complexity of managing these systems,
all the complexity of the scalability.
There is a huge move towards this serverless event-driven architecture,
like the use of cloud functions.
That is huge on the cloud.
The cloud functions, they're event-triggered.
They're just a small piece of code that you write that could do a data
transform that could do nearly everything.
And the cloud itself completely takes care of the scalability, the fault tolerance,
all of those topics that we have to worry about. Immediately, your function can scale to millions
of data points and can be triggered in real time to act on data. And they're so easy to deploy.
With one bash command line, you can deploy your code running from your machine to the cloud that is scalable to millions and millions of instances.
That is so powerful these days.
And if I take a step back again, I think it's very, in a modern data stack, there is a distinction between solutions and products. And we have to be very
careful with that. There are a lot of different products that are out there, but there are very
few solutions. And I want to say, if you think of product as like building a house sort of analogy, a product is a nail gun. A solution is, is a framed house, right?
And so don't get bugged down so much as far as the products,
I would say, and clouds are kind of solution now,
because they give you all of those products.
They give you that framed house almost. And there are some other, I mean,
of course there's other solutions out there
you know something like snowflake data data breaks those are now solutions they provide
they can meet you know they provide a need that they're not a product there are a lot of products
out there you know like products around like different ml different machine learning libraries
the different like audio processing library, different video processing library,
code security, scanning code, you know, for faults.
Those are again, products.
And at the end, those products,
I think would move to cloud at some point
in the near future, right?
All of those products, all of those technologies,
if you talk to their, if you're on their board,
they're kind of like, okay, what have we do to maybe sell to one of these cloud vendors
at the end of the day?
That's the exit strategy, right?
So coming back to the solution, I think it's very important in a modern data stack.
What are the companies that are providing that framed house, right?
And that's very important for us to look at.
And again, as far as being a data engineer,
just going back to those Python, SQL, Kubernetes, Docker,
Spark, Kafka, Airflow, I think if you learn,
I know that's more than five.
I kind of do Python and SQL as one.
But if you learn those and learn those on the cloud,
I can guarantee you
that you would have a job
in this industry.
That gives you the base.
You can learn everything else from there.
Yeah.
This is great, Par.
One last question.
And I think we can conclude
this conversation today.
I mean, although we have to have at least another one,
I think there are like many things that we can discuss
and there's value in doing that.
So let's say I want to take a taste of data engineering, right?
Like I'm not a data engineer.
Can you give me like a sample, like small project
that would be like a good way for me to take a taste into what it feels like to
be a data engineer something that potentially i could also put let's say on my github and i can
demonstrate it yeah yeah well i would highly recommend going on kaggle right kaggle has like
even the data like they're great data science data engineering tool they have
a lot of challenge projects that you can actually like get into these live challenges where you can
compete with other data engineers and data scientists but you can look at the historical
like historical projects that they've had pick up on some of those I think that's a great great
source there's tons and tons of examples and
projects there and projects that have data with it. So, and I know you kind of can agree to this
because it's very hard to get your hands on big real world data sets. A hundred percent. Yeah.
And there are some sites for that too. DataHub.io, I think is a good site for that. There's a lot of governmental agencies.
Like in our course, we actually took all the flight data.
So FFA actually publishes all domestic flights data,
like where these flights took off,
where they were going, what airline.
A lot of that is public data.
And we use that and it's a great volume of data.
We use that to build our course.
And not to take this opportunity to promote our program, Data Stack Academy, but we do
have a lot of these projects.
Actually, the first two chapters, we have a 10-chapter course.
The first two chapters are free.
If you go on our site, datastack.academy, there's a site to get started for free.
We actually send you two chapters
that has a lot of these projects,
has again, that data set,
this FAA data set that we're talking about.
And you can get started.
There's a lot of good projects there.
And we force you actually to go on GitHub.
So you have to develop those on GitHub
kind of by design.
That's amazing.
That's amazing.
Eric, anything else that you would like to add?
This has been a great show, Par.
We really appreciate you taking the time.
And it's been fun to just hammer on
the definition of data engineer
and talk about some of the specifics of the role.
I think that's helpful both for people
looking to get into it and people who have been doing it for a long time and even running teams. So appreciate
the perspective. Thanks again. Thanks. Thanks so much for having us. I'm a huge fan of the show,
longtime listener. You guys are doing amazing stuff. Please, please continue to do what you're
doing. And we really appreciate you as listeners. Well, thank you so much.
We love that feedback.
And if anyone's listening, please give us feedback.
You can do that on the form on our website at datastyleshow.com.
So thanks again, Par.
Take care.
All right, Costas.
Why don't you answer the question?
Is being a data engineer fun or is it not fun?
I don't know.
I would say that it's fun.
But I think that's the outcome of this conversation
that we had with Par and what is interesting
really really interesting it's not that much like about
okay we discussed about the technologies
and all these things but it was very interesting to hear
from him also the personality
traits that someone
has like a dev engineer
and I found this like super, super, super interesting.
So it's not for everyone, obviously, but there's huge demand out there.
So if anyone's like thinking about it, give it a try.
Yeah, for sure.
I think one of the things that I, this is the very beginning of the episode,
but well, Paul mentioned this in the middle of the things that I, this is the very beginning of the episode, but, well,
Paul mentioned this in the middle of the episode, data engineering is so big.
And I thought, I mean, that's a simple statement, but it's really true.
And one thing that I recalled from the early part of the conversation when he said that
was when he was talking about the early days at Talent and how it was so cool that they output Java code that you could use to sort of customize that last 30% of your pipeline that you're building.
I thought, I bet we have listeners who might not be familiar with Talent because they're young and early in their career and they're just using a completely different set of tools.
And we probably have other listeners who remember when Talend implemented the Hadoop Spark componentry and that was game changing. And I just thought, man, data engineering is big and it now spans multiple decades and it's just fascinating. It's really fun to just be able to talk about things that,
that hit on both the history and the modern, the modern stuff that we use.
So that was my takeaway. I appreciated it. The history lesson, if you will.
Yeah, absolutely. And hopefully we will have him back like in the future.
We have many more topics to cover with him.
I know we need to actually have Brooks start bringing some people back on
because we always say we're going to do that. And then we get busy and we don't do that. So that's our New Year's
resolution for the Data Stack Show. All right. Well, of course, subscribe if you haven't. You
can get notified of new episodes and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric
at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you
by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at
rutterstack.com.