The Data Stack Show - 75: How To Become a Data Engineer with Parham Parvizi of the Data Stack Academy

Episode Date: February 16, 2022

Highlights from this week’s conversation include:Par’s background and current role (2:48)About Talend (6:46)Nonlinear pathways to data engineering roles (11:08)What a data engineer needs to be suc...cessful (17:37)Before “data engineer” was a title (27:59)Signs you should be a data engineer (32:39)Curiosity and data engineering (38:31)Defining the modern data stack (45:07)How to get a feel for data engineering (52:52)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Today, we are going to talk with Par, and he has a long history working with data. In fact, he was one of the first couple people at Talend who has been around for a really long time. And then he actually brought the Hadoop Spark infrastructure into Talon and really influenced
Starting point is 00:00:46 the shape of the product, which is fascinating. And today he runs a consultancy and also a school that teaches data engineering. It's called the Data Stack Academy. Maybe we can find out if he got the name from us. But I'm really excited to talk with him. Costas, I have a background in education. So I think one of the things that really interests me in terms of PAR's school for data engineering is what are sort of
Starting point is 00:01:12 the key foundational principles or tools that he thinks that a data engineer needs to have to build a foundation for a career? Because it can be hard to kind of distill that down. So that's what I'm going to ask about. How about you? Yeah. First of all, I want to ask him about like the evolution of data engineering. He's been around for a long time and he has been around like since, like back then we didn't have like the term, we didn't use like the term data engineer, right? So it would be great to hear from him what's the evolution and how from whatever was happening in the early 2000s to today has changed. And also, of course, hearing his opinion about what's going to happen in the future, right?
Starting point is 00:01:57 So that's definitely something that I would like to discuss with him. And then, yeah, it would be great to discuss a little bit more about the technologies and in general what it takes to be a data engineer and how it feels to be a data engineer. Is it fun or is it not? We'll find out. Let's find out and go talk with Par. Par, welcome to the Data Stack Show. We're super excited to talk with you.
Starting point is 00:02:20 Thanks. Thanks for having me. Thanks for having me, Eric. Super excited to be here as well. Cool. Thanks. Thanks for having me. Thanks for having me, Eric. Super excited to be here as well. Well, we've talked with lots of your friends as guests on the show, so it's only appropriate that we can have you on as well. Can you give us your background and tell us what you do today? Oh, thanks. Thanks. Yeah. I feel like I'm already a cousin of the show. You guys have had almost like every one of my friends in the industry and I always like oh oh there you there
Starting point is 00:02:45 there you go yeah oh man my background it's it's been a while I think I've got just lucky and continue to get lucky throughout my career in my life I was very lucky to be born in a house that my father was a civil engineer but he really had a knack for computers so we had like a IBM 8080 computer you know when i was like four years old and i would load up my floppy disks and play my games and one time i actually formatted our entire c drive and that was not good i don't know if you remember dos but if you do format and not specify to drive, it defaults to C, which is horrible design, right? It's like, that's exactly what it is. And then, yeah, I went to actually school for computer engineering.
Starting point is 00:03:34 So I have a background in engineering and computer science. Worked actually as a chip designer for a little bit. But I would say I really got started lucky when I left hardware engineering, came to software engineering. I got connected with a company that you all might know now, very large company now, Talent, but I got very lucky when I got connected with them. There were just three or four people in US. So again, day one, like my laptop was on my lap. They're their only technical employee at that point and pretty much grew with them, you know, as they grew to the US or trained a lot of folks brought out and build it, you know, technical team around, around, around talent on myself and, you know, around North America, then we moved to like Asia, and other other markets lucky again at some point in the middle there, one of our account representative was like, hey, there's this thing Hadoop I keep hearing about,
Starting point is 00:04:31 like, can you just figure out what this is over the weekend? So I went and downloaded it and finally got it to compile and work without errors, you know, fairly early on. And then I saw the value of it immediately. I was like, wow, this is, this, this is the next thing. Like I either, I got to learn this or I'm going to be obsolete. So I, I started learning and started contributing a little bit. And then later on, it got actually connected with some of the founders of Hadoop, you know, Doug Cutting, Tom White, and all, all those guys in meetings. And that was, that was just incredible. And being in that market, I moved to a company at the time was called Green Plum. They evolved to Pivotal. I was one of their big data Hadoop solution
Starting point is 00:05:11 architects. So I was part of the elite team they would send out to fix problems where there was large, like a thousand node clusters. And all right, here you have two days, go figure out why this is not working or optimize this, this ETL process that they have, see what, you know, what's wrong with it. Did that with some guys, you know, like Don Minor that you guys might know, we were part of that team. And then 2017 decided to kind of go and build things on my own, the dream that I've had. So I started a consulting company, that consulting company is still something that I run. But just about a year, year and a half ago, it dawned on us that I worked with a lot of people who've been in this industry for a while. And, you know, we've kind of been around the block and we're like, wow, there's really not too much resources around learning to
Starting point is 00:06:01 become a data engineer. So we decided to develop a curriculum around that. And, you know, it was just us working at nights after hours for a while, and then became more serious. And we just launched this program, which is actually by coincidence, it's called Data Stack Academy. So it's a bootcamp to teach people how to become a data engineer and the skills necessary for that. Well, that is super exciting. And I want to dig into sort of what I see as a nonlinear career path into data engineering, maybe as you compare it with sort of software engineering and you getting, you know, computer engineering degree, but really quickly, if we can, you know, talent has been around for a long time. You know, it's sort of predated a lot of the big trends that we've seen even over the last decade, say. So could you give us just a
Starting point is 00:07:01 little bit of history, especially for the listeners who may not be as familiar with Talend. But when you started with them, what were they building? And when was that? That was back in the mid-2000s? Yeah, 2006. I think that was back in 2006 that I got connected with them. Yeah, I think it was still V1 back then. Yeah, wow. Yeah, so originally, I mean, so I mean, the field comes from data warehousing, right? And you take data warehousing and break it into the pieces that it has. And some of the things that we don't hear too much about anymore, like databases and MPPs. And that also has kind of gone away those talks a little bit, right? But then the other piece of that, of course, was the BI, the reporting and that that's still there, of course.
Starting point is 00:07:46 Everybody has to do BI. And one of the other spokes of that wheel was ETL and data engineering. And back then, I think it was mostly known as extract, transfer, load, or ETL and the process of getting the data to the point that is ready to be analyzed or viewed by the BI tools. In Talon, we were an ETL tool. So in ETL, there's a lot of things that you do that I would say are mundane, right? It's everyday stuff that could be templated, like how you open a connection to a database, how you write the database, how you read a file. Those things are easily be templated in a data engineering job.
Starting point is 00:08:28 So in Talon, we had made a very nice visual tool that made you be able to use those connectors to different data sources and data sinks and drag and drop a couple clicks. And it was a code generator. I think that was really what was the friendship of talent early on because a lot of the ETL tools like Informatica and so forth that were on the market, they had an engine behind the back. So you had to know their own language and they had their own engine, but the talent engine was just Java and it generated Java code.
Starting point is 00:09:01 So it also allowed you to do a lot of things that are not common. You know, all the ATL tools would get you like 70% of the way there, right? But that 30%, you still need to code and you need to do something custom. So it allowed for developers to do that, but do the mundane things like just open a database connection, like all of that stuff very fast visually and then there was that shift again i think where the engine was quick quickly becoming map reduced and hadoop and spark so and i was actually what my sort of design again to be very humble but i was the person who started that kind of development and talent so i made the very first Spark and Hadoop connectors. And I was like, no, this,
Starting point is 00:09:47 like even Java is not going to suffice anymore. Like we need to go to something that's highly distributed, highly parallel, like Spark and MapReduce. And we, so then we changed the connectors to produce those code versus just pure, you know, a photo or a jar file. Yeah. Yeah. Very cool. No, it's, I just, it's really fun to get those little anecdotes of history sort of in the world
Starting point is 00:10:14 of, you know, the world of data that we live in, especially, you know, I enjoy it. And I hope our listeners do too, comparing the things that, you know, are just wildly different with, you know, the things that are like, well, I mean, some of that sounds kind of functionally pretty similar, you know, some of the modern tooling, which is really fun. So thanks for sharing that. Okay. Let's, let's talk about the role of a data engineer, because that's something that, I mean, you know, I would guess that a ton of our listeners have some sort of data engineering or data engineering related role. So it's certainly not a new term to anyone on the show. I don't know if we've actually ever stepped back to define the term. And that can be a little tricky sometimes, you know,
Starting point is 00:10:56 in terms like super familiar, but we don't actually put a sharp point on it. But let's start with sort of the nonlinear pathways that people travel into data engineering roles. And would you say that happened for you? I mean, you started out as a hardware engineer. Absolutely. And you hit that too, like almost every data engineer that I know probably has a different background. And we've all had that. Yeah, some people come from a data, sort of science, software engineer background, you know, traditional like four-year degree and followed by master degree.
Starting point is 00:11:32 Some people come from a business intelligence background where they don't transform from being a business analyst. And some folks are just sort of self-taught. They come from like actually complete different backgrounds that I've seen, you know, like even from like being a server or a bartender, you know, and they learn everything, all the tricks to do so.
Starting point is 00:11:55 So it is very interesting. And it's kind of, I think, it's part of being a data engineer because data engineer, as you said, that role is not very well defined. And in tech, we are kind of jack of all trades and master of none. Like there's like we know a little bit about a lot of different technologies, but we're not really mastering anything. And that kind of define what a data engineer does.
Starting point is 00:12:19 Data engineer is like a glue within a company, right? Is the person who brings data from all the different sources that data exists within a company and meshes those together. So you do have to be connected with all the different arms of a company. You have to understand what those do, what those data mean,
Starting point is 00:12:40 and you have to be able to bring those together. And that makes you kind of very essential to the company. It makes you very essential as a way that you have to actually know even the business of the company, what the company does. So, you know, like what those data are and how to treat them. But you also have to technically then know how to grab the data from the different applications that are stored and so forth. But to give you an example, I like to kind of maybe start by an example, like what a data engineer does, like it's something that we might all have know or use, like Lyft or Lyft app, you know, a ride share app.
Starting point is 00:13:18 Yeah. What does a Lyft data engineer do? And data engineers, I promise you that all of us have interacted with data do, right? And data engineers are, I promise you that all of us have interacted with data engineers, right? They're always in the background. They're always there. You just might not hear exactly what they do because we're background people.
Starting point is 00:13:37 We're not developing apps, right? We're not developing like web or everything everybody uses. So in a Lyft example, so, you know, in a company generates data. So Lyft has people who go download the app, use the app, and that generates a lot of data. There's a lot of data as far as like you getting the ride. There's a lot of data as far as like actually the ride, like your GPS updates, where you go in, all of those stuff.
Starting point is 00:14:03 And the app itself is usually developed by your full stack developer app developers. But it's data engineer job to work with those folks to grab those data and then store it in a matter that can be analyzed and move it probably to a cloud, move it to the servers that it needs to be. So the engineer works with them very directly to do that data acquisition, right? But it's also data engineer job to hand it off to other folks like data scientists and business analysts who are the business side of that, you know, who, let's say, do something useful with that data.
Starting point is 00:14:43 Like, for example, in data science, you know, example could be in the Lyft example, like a prediction algorithm where you want to tell when I go to grab a ride, maybe the app tells me, hey, based on the patterns that we've seen, you might want to wait 15 minutes, it's going to be cheaper for you to get this right. Or maybe it's like a notification. Maybe it's an anomaly detection where you're like, okay, as you're going through the ride, the app pops up and says, hey, your driver just seems to take too many wrong turns, right? It's like we've seen like you've just take based on GPS data, you've taken too many wrong turns. You might want to just want to know
Starting point is 00:15:22 that. And those are like the data science or machine learning algorithms. But it's again, the data engineer to provide the data scientists with all that data. And again, grab the results from the data science algorithms and provide that back to the user. So it's, it's that glue roll again. And even in the other sense is the data scientists typically work
Starting point is 00:15:47 with a smaller data set, right? They work in a prototyping fashion where they only might have like developed a data science model looking at a thousand or 10,000 population. It's data engineering tasks again to really operationalize those and in a company take those to the masses.
Starting point is 00:16:04 Okay, now we take this data science model in this like prototyping phase, but now we're going to scale it to like the million or billion users that this company has. Make sure that data continues to come through the pipeline, come through the system
Starting point is 00:16:18 and all the users get the results, right? So that's the data engineer role. Also data engineers, we also the glue between all the different applications so between your crm application sales application all of that stuff we play that role as well okay so super fascinating and i i love and actually would love to hear uh from our listeners if you're listening to the show write in and tell us if you have a similar opinion you know because we haven't defined it before. But that was super helpful. And I love using the
Starting point is 00:16:48 example. One question I have actually for you, Par and Costas, because you both come from, actually, I think you both come from sort of, you have like hardware engineering and software engineering backgrounds. In software engineering, and I, you know, this is something I learned from being in an education business that focused on software engineering for a while. And our instructors were always very insistent on sort of teaching core principles because that was way more important than, you know, the specific syntax of a particular language, right? Like if you understand these core concepts of building software, you can apply them to different syntax within the context of a different language. Would you say that's true for data engineering as well?
Starting point is 00:17:37 Are there some sort of foundational things, concepts, skills that you need to sort of master to have a really good foundation to sort of build a long-term successful career as a data engineer? Yeah, that's a very interesting question from Eric. And it's a question, to be honest, that comes up not only for data engineering, but also for software engineering. What you're describing, Eric, is also the difference between computer science and computer engineering, right? For example, like, which is completely different disciplines, to be honest. Computer, completely different. I mean, of course, they have like overlaps, right? And if you are like a computer scientist, you can also become a computer engineer. But there is a reason that one thing is called a science
Starting point is 00:18:28 and the other thing is called an engineer discipline. And one of the things that happens a lot with the curriculums for software engineering is that they are heavily dominated by computer science topics. That's my experience also, right? My degree is in electrical and computer engineering, but anything that had to do with like the computer side, it was like more of computer science, right? Like I had to prove the complexity of algorithms, right?
Starting point is 00:18:58 It wasn't just to know the complexity of an algorithm. And I had to figure out like what tools you can use to do that and why we have, for example, these different complexity classes and all these things. Now, there's a debate there, how much you need from that to become a software engineer, right? I'm biased, obviously. I like the science side of things. So I think that it's important to know that also when you move into engineering. But the engineering track also has some very, very important elements that you need to know
Starting point is 00:19:35 if you are going to build products and put technology into, like, let's say, make it useful for people, right? Everyday people. And that's, I think, what we are seeing happening also right now with data engineering. Like, I think the biggest change that has happened, like, probably in the past, like, one to two years is that all the principles that we have in software engineering or computer engineering, you see them, see them being applied in data engineering and how we deal with data, right? Quality, QA, tests, version control, all these tools that the
Starting point is 00:20:15 software engineer is using every day to go and deliver what they have to deliver, we start introducing them also into data. Now, do you need to have, let's say, there are like some kind of principles that are, let's say, fundamental? I would say it's probably the same thing with software engineering. It's the same principles. I don't see like some huge difference there
Starting point is 00:20:43 in terms of like how someone should shape the way they think in order to solve the problems there. At the end, it's again like engineering problems. And there are problems that you are solving with software. And either you build software or you operate software. And I think that's the main, the very interesting characteristic about data engineering. And I've said that like many times in the past is that it's a hybrid between ops and software engineering,
Starting point is 00:21:11 which maybe might change in the future. I don't know. We have like data ops and data engineering and it would be like to separate. I don't know. But for now, I think that the data engineer has to do both. So yeah, that's, I don't know. But for now, I think that the data engineer has to do both. So yeah,
Starting point is 00:21:26 that's, I know it wasn't like the most straight answer, but in my mind, at least I don't see any difference in the fundamentals between the two, between software engineering and data engineering at the end. I want to bring it back a second and say that how big actually data engineering is. And it's not just us. There's a lot of numbers behind this. I know there was a DICE report from the pandemic year where data engineering was the fastest growing field in tech. And they analyzed 6 million job postings in the US for that.
Starting point is 00:22:00 And it rose by 50%. It had double digit lead over the second on the list, which was 32%, which was data science. So that's, I mean, that's huge. That tells you everything that this is a field that is the fastest growing field again in tech. And that shows in the salaries. So when you look at the salaries on Indeed right now,
Starting point is 00:22:24 average data engineer salary is 119K. That's higher than data scientists, higher than a food stack developer, higher than a software engineer. And I want to say that in some sense, we are very privileged to be data engineers, right? We have very comfortable jobs. We do have salaries like that. One of the other stats that Indeed posts is unlimited time off for data engineering, which I don't even know what that means, but I kind of do, right? Like we can, we're very flexible. We work very flexible hours. We have great 401ks. We have great healthcare, and some of those are just not available to a lot of us Americans these days. And a lot of, I want to kind of take that conversation to the folks in,
Starting point is 00:23:12 to that sense that for the folks that are trying to get into this market, right? And what path they would choose, what do they need to learn to be a data engineer and it is true there's a lot of different paths that you you would take and even if you're in software engineering and you want to level up to be a data engineer especially nowadays be a cloud data engineer I think that that's very key what do you what do you have to do I do feel like from my experience, there are some things that I can say that would work. Yes, there might not be a general path. And if you look at a lot of colleges out there, they're just beginning to have like a master of data engineering program.
Starting point is 00:23:57 There's some boot camps now that do like a specific data engineering career. But here's what I think would work really as the path. I cannot emphasize enough the importance of being a cloud data engineer these days anymore. And if you want to, if you're on this self-learning path, start by actually looking at one of the cloud certifications. All the major cloud vendors, AWS, Azure, GCP have now a data engineering certification. On their website, they actually list a lot of good free resources that you can go to learn the skills to be that person. I would also say it's very important, something that gets missed, that as we're doing things and as we're doing projects you do those
Starting point is 00:24:46 on github now to get your github profile right it's like the most important thing out there and a lot of companies when they hire they go on github and that's how they vet the resources i do that myself because code doesn't lie right at the end the end of the day, you can look at someone's GitHub profile and see what they've done. The other thing I would say is to get your hands on a lot of real-world projects. So there are sites like Kaggle, Data Hub IO, where you can go get a lot of these real-world data sets, real-world projects, and then build those and build those on GitHub and be able to show that to employers later on. I would also say, I mean, there's tons of resources out there as far as learning where you, you know, DataCamp and Udemy, Coursera, all those courses that have
Starting point is 00:25:39 very budgeted, I would say, programs around that. But then there's also this other kind of learning path, which is your bootcamps. And they're now just becoming some bootcamps around data and specific around data engineering. And they can provide some things that the other programs can't, right? I think one, they provide this like declared intention that's by far the most effective.
Starting point is 00:26:04 When you say you want to be a data engineer and you go through these bootcamps, you sign yourself up for this like three, four month experience where you submerge yourself in data engineering, submerge yourself in learning. And there's just something magical that happens when people have that declared intention, right? They're like, okay, I'm going to do this. And second, I think the benefit they have is around that shared learning, right? When you're around other people, they're just at your own level.
Starting point is 00:26:32 And you can maybe replicate that a lot out there on your own too. There's, you know, Twitch is a great platform now. A lot of people like kind of learn together on Twitch. You know, somebody says, hey, on Wednesday night, I'm going to go learn this tool. And if you want to join, join me. We're all going to do it together.
Starting point is 00:26:48 And that's a really great resource. I highly recommend that. But I think lastly, some of these boot camps that have career services, that's something that you probably won't be able to get anywhere else. And that would be very important, especially if you're new to tech, to have somebody actually advocate for you. Somebody go and show you the ropes of where to get a job and how to get a job. That's going to be really important. That's short of my spiel on the path to data engineering. And again, I do want to really emphasize the importance of learning a lot of these skills on the cloud.
Starting point is 00:27:26 And I see that's where the industry going. I do have a very opinionated opinion on the sort of five tools that you have to learn as a data engineer. Because I know if you kind of go out there and kind of try to research your own, you would immediately become very confused with a lot of different opinions that people have in this stack that you have to learn. Because we all know as data engineers, we're also very opinionated. That's for sure.
Starting point is 00:27:56 A hundred percent. I have a question. What was there before the title of data engineer? Like companies had problems with data like since forever, right? Like, companies had problems with data, like, since forever, right? Like, they had, like, to manage data in a different way, obviously. Like, we didn't have cloud back then. Data warehouses were, like, bundles of hardware together with software. But what was there before that?
Starting point is 00:28:18 Like, when you were in talent, right? Yeah. Mid of, like, 2000. How did you call these people? That's a great, great question. I think it's kind of evolved with the different technologies, right? So again, if you look at it,
Starting point is 00:28:35 it was software engineering and that kind of evolved to data warehouses, right? And that kind of evolved to data lakes and now it's going to data clouds. And if you backtrack that, I think some of the skills that we've had to pick up along the way, right? We went from data warehouses where it was very much an afterthought. Business intelligence, in a sense, is an afterthought. It's like, after the events have happened, after the data has happened, you come and collect things and
Starting point is 00:29:02 then make sense of it and say, okay, let's get the numbers on an aggregate level and see what happened, how much we sold, how much inventory we have. And it was very purely, again, an aggregation in a very mathematical sense, right? Where now in kind of like the 10s or the teens, 2010s, 2020, now time, that become much more real time. And that become that. And then we saw ML and machine learning and data science, right? So the need when, okay, not so much an afterthought, but what do we do now? Like this data is coming in. How do we interact with the user in an intelligent way and how to use machine learning and data science
Starting point is 00:29:46 to give them some perspective of what they're doing and give them some pointers. And I think that's where data engineering kind of grew alongside this data science, right? When in the modern world, a lot of these are real time. If I backtrack that just a little bit before that, we could say someone was a big data engineer. And again, even big data was to a point was again, an afterthought. MapReduce and Spark was an afterthought. We just did it at a larger scale. Yes, other things came like Spark and Kafka
Starting point is 00:30:24 and other of these technologies that made that again more real time. And that was kind of the shift. And now if we even step one level back from big data engineer, I think you hit it very right onward in the talent days where it was like ETL developer, right? Now we're in data, really in data warehouse realm where you were an ETL extract transform loan developer. You worked alongside a business analyst at very close. And your job was just to provide pretty much the aggregate level tables to get from raw level data to aggregate tables.
Starting point is 00:30:57 And then the business analyst put a visualization, a dashboard around that. And that went to the executives and step beyond that. Like when we were initially hiring a talent, there was, yes, I would say you would be a software developer or a DBA, a database administrator, you know, something that you were just purely in charge of like storing data. And that, that kind of, I think, that was, like, the evolution.
Starting point is 00:31:27 So I would say, like, kind of software developer, right? Went into, like, ETL. And from there, you went to, like, a big data engineer. And then big data engineers somewhere along the way meshed to data engineer. Now, like, cloud engineer, per se. Yeah. Makes a lot of sense yeah that actually i think it's like a very very good timeline of like what was there before and how it's let's say grew into the role of the data
Starting point is 00:31:54 engineer today i totally agree with you so i have a little bit of a provocative question for you based on also like the experience the personal experience that you have by moving from like different roles like from hardware to software then become like getting into data engineering let's say i'm a software engineer right i write backend code i don't know something like that something a little bit more common how do i know or like what indications i might have that i'll be happy as a data engineer and it's worth for me like to invest and transition from being a backend engineer into a data engineer another great great question Costas you nailed it in the head so there are some characteristics of a data engineer, I think, personality-wise, that I've seen. There might not be good. But I would tell them.
Starting point is 00:32:50 I mean, financially, we can all agree that there are evidence that this is the fastest-growing field in tech. It's not going away because data is not going away. It's not a hype. And we've all been through a lot of hypes in our career. But at the end of it, it's data is not going away. That's for sure. There's evidence that say, again, these DICE report that came out that ranked data engineers the number one growing field in the field. There's the evidence of the average salaries for the engineer, which is higher than any other field in tech again. So of course, there's financial benefits of being a data engineer. And you and I can both agree that we can both pick up the phone and tomorrow we have a job, right? Just saying our skills. So we're very privileged in that, again, in that sense. Now, the personal
Starting point is 00:33:45 characteristics of a data engineer, which kind of hurts me to say, is, is, first of all, we're background people, right? Nobody hears our name. They hear our name mostly when things go wrong. You know, if all the data gets to where it needs to go, and everything's fine, nobody would come knocking on your door. But as soon as, you know, people don't get their notification, people don't say their emails, people don't, you know, like all these apps fail, or there's a data breach or something like that, then everybody's going to know your name. People are going to knock on your door. So in a sense, I like to say that us data engineers are kind of silent heroes.
Starting point is 00:34:26 You know, we're in the background, but nobody hears our name. We're the Q, right, in the Bond sense, right? I hate to use that analogy, but that's somewhat true. From working with data engineers, one thing that we do, because in our field, the devil is in the detail right as a data engineer you have to make sure that you've accounted for every piece that a software could go wrong that accounted for all the corner cases and in that sense a lot of data engineers are very particular in attention like the attention to detail i want to say even like we're almost OCD, right? And we are. Like if you see my own apartment, it's very neat. I am very OCD.
Starting point is 00:35:10 And data engineering, but that's what makes a good data engineer, the attention to detail, right? So if you have those characteristics, I think data engineering is something good for you. If you are someone who thinks in steps, if you played with too much Legos as a child, you know, you build pieces together. I did. I played with Legos till I was 13. But that helps because again, you have as data engineering, you have all these pieces and you have to figure out how you put them together and you build something bigger from that, right? So those are some good facts. I would say the one thing that i i want to debunk if you if you're out there on the internet a lot of people say that data engineering
Starting point is 00:35:50 is too complex to get into it's too hard to learn i honestly absolutely disagree with that and and even like i see where that is coming from because people say oh because you have to learn spark and spark this hugely massive you know distributed oh, because you have to learn Spark and Spark this hugely massive, you know, distributed processing engine, or you have to learn these things like Kafka. Again, this very complex software engineering concepts like distributed processing, right? But those things have been made so simplified now. Like it is Spark, it is Spark because a lot of smart people worked for years to abstract away all of that complexity and make it something very simple to understand and very simple to use. And I would say absolutely like anybody can go spend two weeks and be a very solid Spark developer, right? They can understand the concept of data frames.
Starting point is 00:36:43 They can use it to aggregate data. They can use it to aggregate data. They can use it to process data. And so that's the one thing that I want to completely debunk here. If you learn that data engineering is complex, that is untrue. And it's mostly like a little bit of gatekeeping talk that there is in a lot of tech, I would say. Okay. Kassus and Parth, question for you here, and this might be provocative as well.
Starting point is 00:37:10 And, you know, maybe some, you know, software engineers who are listening may not like this, but one thing that's interesting when you think about software engineering is you're building something for an end user, right? So you want to develop empathy for the end user. So if you think about a software engineer going back to Lyft, as opposed to a data engineer,
Starting point is 00:37:34 the software engineer, you know, in an ideal world, is trying to build empathy with someone who's trying to get a ride, to book a ride, right? And, you know, what are the, And what are the sort of what's happening that creates friction there? And I want to have empathy as I build this, right? What's interesting about data engineering is your end user is someone in the business.
Starting point is 00:37:57 So would you say that if you think about a software engineer to your question, Kostas, and would love your take on this as well, Kostas, if you have kind of an interest in the mechanics of the business itself, maybe even beyond sort of the experience of the end user, that's kind of a false dichotomy, right? Because they're, they're, you know, inherently related, but you said earlier part, you have to understand the business, right. And
Starting point is 00:38:27 sort of the way that it works and generates revenue and all that sort of stuff. Would you say that a predisposition to being curious about the business is a good sort of prerequisite for being a data engineer or does that matter? I wouldn't say it's a prereq. It's something that I picked up along the way that I think that you can kind of pick up and by necessity, you kind of pick up along the way a little bit. But yes, it is essential that you have those ears open, right? You have your eyes open and you have your ears and listening for the ways that your process is making other people work easier. Like, right, just like you said, yes, like a software developer goes to a company to build, you know, empathy around the experience. Our job is to build empathy for that software engineer
Starting point is 00:39:25 to make all the pieces that they need, that they can do their job, right? That the platform that they need, and then again, get the data and also be able to grab data and then hand it to the, again, data scientist, and then tell they get back to this software engineer, hey, these are the ways that your software is being used. Maybe these are some of the things that you haven't seen. And again, make that loop possible, right? So yes, you have to know the business because you touch, again, you are the centerpiece. You are the piece that moves data around a company. So you're going to touch all sides of business. And it's very important when you are in those meetings with the different business stakeholders that you really listen. I think a big part of software data engineers' job, our job, is to listen, to be very honest. And then taking those, those things that you've heard, taking back with you
Starting point is 00:40:27 and turn it into requirements of what you have to do about your job. Yeah. How now that tells you how you have, how you should sort the data that kind of mashes those needs, those business needs. Right. And so, and to kind of close that up, your empathy, like as a data engineer, I feel like we live for the process, right? Like, yes, maybe it's not that glorified app that we designed and we made it so much simpler for the person to click, you know, the buttons and get that right quicker. But we made that process possible. We made all those pieces work. Even though we were in the front-facing of that, we're in the back, again, connecting the dots, right?
Starting point is 00:41:13 Yeah, I agree with Par. I would say, let's say we were interviewing for data engineer role, right? I don't think that i would pay that much attention on how much interested the person is like in the business itself at that point i think that's relevant for everyone at the end who works in the business if you ask me for example the first thing that comes to my mind when i was listening to you eric is data analysis.
Starting point is 00:41:46 Like, yeah, if we are talking about the data analysis, being curious about the business is important because you have to understand the business to go and do data analysis, right? And understand, does this make sense or it doesn't make sense? So you can go and figure out if something went wrong, like on working with the data. But for a data engineer, I don't know. As Par said, we are talking about people that are in the background. And that's part of how it is. And it's a good thing.
Starting point is 00:42:20 I mean, it's not bad, right? Like it's not, there's nothing wrong about that. But you learn about the business, let's say, anyway. You cannot not learn, let's say, about that. And that's relevant for every engineering role, right? Go and speak with someone in operations, for example. They know a lot about the business because depending on the business, they know if they have to be on poll or not right so i i wouldn't say that there's some difference between like the data engineer and anyone else the main difference that i would see there is that data engineers the the the customer of the data engineer is internal right it's like the marketing team it's the sales team like the rev ops team or whatever.
Starting point is 00:43:06 And it's not necessarily like, let's say only the person who uses the lift app to call a lift and go somewhere. So that's the main, that's the main difference, but I wouldn't say that like, okay, it's, let's put it like in a different way, let's say I have like a person who is more curious in the business than in the data-related problems of, for example, how much data you deal, you have to work with, and what kind of systems you have.
Starting point is 00:43:33 If someone was more interested in the first thing and didn't ask any questions about the second, I would be worried. Probably there's something wrong with the career path of the person's choice. Yeah, probably. If I have someone who's really interested in the data-related problems and also makes the connection with the business, that's amazing. But if it doesn't happen, that's again fine.
Starting point is 00:43:57 Yep. Yeah, I think that's a super helpful perspective. Okay, we're closing in on the end here. So we have time for one more topic. Brooks, just let us know. One thing we'd like to chat about is you mentioned a term as we were prepping for the show part, the modern data stack is something that we've discussed. I'd love your take on two things. And I'm sure Costas will have some questions as well. You mentioned that you have pretty strong opinions about, you know, what are the five
Starting point is 00:44:27 tools that a data engineer needs to know? So I'd love to get that list from you because we always love, you know, an opinionated take on the tooling. And then the other thing I would like to know is if we think back to 2005 and Talend and then Hadoop Spark, you've seen the modernization of the data stack. And there are lots of definitions for that. It's hard to nail down. But I just love your take on what is the modern data stack and what does that mean to you in the context of what you've seen over the past decade and a half? Yeah. Yeah, great question. So if I had to really summarize it, I would say the modern data stack now, again, is the cloud and it's, it's, it's on the cloud. And I,
Starting point is 00:45:14 and I explained that in a second, let me, let me get into the five tools that you need to learn. And these are, I would say backed again by a lot of data, by that DICE report that I mentioned to you, that these are some of the skills that were top listed on those. I would say as a data engineer, you got to learn the basics, right? The basic, and this is, I would say, number one basics. And I would categorize that as just your basic bash terminal programming, Python SQL. And I chose Python as my language. I know there's a lot of languages out there for data engineers, but Python by far is the most dominant one.
Starting point is 00:45:52 And it's, again, that bridge between data scientists and data engineers. So it's a great language, again, built for data engineering. The second, number two, i would say docker kubernetes that's becoming especially for designing cloud agnostic data pipelines pipelines that work across clouds and contrarianization now everybody is now you know and designing serverless microservice these sort of architecture which is again part of the modern data architecture. And Kubernetes and Docker are the heart of that. And this was, again, the number,
Starting point is 00:46:34 I think Kubernetes actually was the number one skill for data engineering in that Dice report. If you go to third, something traditional that's been around there, your number one big data tool, Spark. It's still, you know, your bash processing, billions of records, Spark is the tool to go. And again, it's very easy to learn Spark nowadays because there's great documentation, a lot of good resources out there. Number four, it's now I'm going to move to a little bit of stream processing
Starting point is 00:47:07 and real-time processing. And this is a tool that is just a foundation for connecting almost all the data pipelines in a real-time sense, which is Kafka, right? Your Kafka, Apache Kafka. And number five, I would come a little bit on the orchestration side. And there's a tool now that's becoming heavily dominant on the orchestration side because, again, as data engineers, part of our job is to connect the pieces together, orchestrate data flows for data pipelines. And that's Apache Airflow. Apache Airflow now is almost becoming the number one orchestration tool. But I want to now come back and say, as a modern data engineer, what I've seen in the industry in almost every project that I do nowadays, it starts on the cloud and they're on the cloud,
Starting point is 00:48:04 and everything's moving on the cloud. So you need to learn these skills on the cloud and they're on the cloud and everything's moving on the cloud. So you need to learn these skills on the cloud. The first two are agnostic on the cloud, right? Bash, Python, SQL, Docker, Kubernetes, that even cloud vendors don't even have a different name for a Kubernetes service. They call it Kubernetes. But the last three, each cloud vendor has kind of their own version and they call it something else, right? Spark, each cloud vendor has kind of their own version, and they call it something else, right? Spark on Amazon has got Amazon EMR, and Google is called Dataproc,
Starting point is 00:48:34 and Azure is Databricks, the company who is behind the Spark. Kafka on Google is called PubSub, and Azure is Event Hub, on Amazon is Kinesis, Airflow has different names, etc. And I say that again, that even in my own job, I would say we wouldn't even take jobs that are not on the cloud anymore because they take so much longer to set up and maintain. And that makes us to be able to deliver a lot later, even as consultants. And that's a bad thing in consulting, right? You want to go as fast as you can. And the cloud vendors just have made that so easy.
Starting point is 00:49:13 They've removed all the complexity of managing these systems, all the complexity of the scalability. There is a huge move towards this serverless event-driven architecture, like the use of cloud functions. That is huge on the cloud. The cloud functions, they're event-triggered. They're just a small piece of code that you write that could do a data transform that could do nearly everything.
Starting point is 00:49:41 And the cloud itself completely takes care of the scalability, the fault tolerance, all of those topics that we have to worry about. Immediately, your function can scale to millions of data points and can be triggered in real time to act on data. And they're so easy to deploy. With one bash command line, you can deploy your code running from your machine to the cloud that is scalable to millions and millions of instances. That is so powerful these days. And if I take a step back again, I think it's very, in a modern data stack, there is a distinction between solutions and products. And we have to be very careful with that. There are a lot of different products that are out there, but there are very few solutions. And I want to say, if you think of product as like building a house sort of analogy, a product is a nail gun. A solution is, is a framed house, right?
Starting point is 00:50:49 And so don't get bugged down so much as far as the products, I would say, and clouds are kind of solution now, because they give you all of those products. They give you that framed house almost. And there are some other, I mean, of course there's other solutions out there you know something like snowflake data data breaks those are now solutions they provide they can meet you know they provide a need that they're not a product there are a lot of products out there you know like products around like different ml different machine learning libraries
Starting point is 00:51:22 the different like audio processing library, different video processing library, code security, scanning code, you know, for faults. Those are again, products. And at the end, those products, I think would move to cloud at some point in the near future, right? All of those products, all of those technologies, if you talk to their, if you're on their board,
Starting point is 00:51:44 they're kind of like, okay, what have we do to maybe sell to one of these cloud vendors at the end of the day? That's the exit strategy, right? So coming back to the solution, I think it's very important in a modern data stack. What are the companies that are providing that framed house, right? And that's very important for us to look at. And again, as far as being a data engineer, just going back to those Python, SQL, Kubernetes, Docker,
Starting point is 00:52:14 Spark, Kafka, Airflow, I think if you learn, I know that's more than five. I kind of do Python and SQL as one. But if you learn those and learn those on the cloud, I can guarantee you that you would have a job in this industry. That gives you the base.
Starting point is 00:52:30 You can learn everything else from there. Yeah. This is great, Par. One last question. And I think we can conclude this conversation today. I mean, although we have to have at least another one, I think there are like many things that we can discuss
Starting point is 00:52:49 and there's value in doing that. So let's say I want to take a taste of data engineering, right? Like I'm not a data engineer. Can you give me like a sample, like small project that would be like a good way for me to take a taste into what it feels like to be a data engineer something that potentially i could also put let's say on my github and i can demonstrate it yeah yeah well i would highly recommend going on kaggle right kaggle has like even the data like they're great data science data engineering tool they have
Starting point is 00:53:25 a lot of challenge projects that you can actually like get into these live challenges where you can compete with other data engineers and data scientists but you can look at the historical like historical projects that they've had pick up on some of those I think that's a great great source there's tons and tons of examples and projects there and projects that have data with it. So, and I know you kind of can agree to this because it's very hard to get your hands on big real world data sets. A hundred percent. Yeah. And there are some sites for that too. DataHub.io, I think is a good site for that. There's a lot of governmental agencies. Like in our course, we actually took all the flight data.
Starting point is 00:54:09 So FFA actually publishes all domestic flights data, like where these flights took off, where they were going, what airline. A lot of that is public data. And we use that and it's a great volume of data. We use that to build our course. And not to take this opportunity to promote our program, Data Stack Academy, but we do have a lot of these projects.
Starting point is 00:54:36 Actually, the first two chapters, we have a 10-chapter course. The first two chapters are free. If you go on our site, datastack.academy, there's a site to get started for free. We actually send you two chapters that has a lot of these projects, has again, that data set, this FAA data set that we're talking about. And you can get started.
Starting point is 00:54:56 There's a lot of good projects there. And we force you actually to go on GitHub. So you have to develop those on GitHub kind of by design. That's amazing. That's amazing. Eric, anything else that you would like to add? This has been a great show, Par.
Starting point is 00:55:12 We really appreciate you taking the time. And it's been fun to just hammer on the definition of data engineer and talk about some of the specifics of the role. I think that's helpful both for people looking to get into it and people who have been doing it for a long time and even running teams. So appreciate the perspective. Thanks again. Thanks. Thanks so much for having us. I'm a huge fan of the show, longtime listener. You guys are doing amazing stuff. Please, please continue to do what you're
Starting point is 00:55:40 doing. And we really appreciate you as listeners. Well, thank you so much. We love that feedback. And if anyone's listening, please give us feedback. You can do that on the form on our website at datastyleshow.com. So thanks again, Par. Take care. All right, Costas. Why don't you answer the question?
Starting point is 00:55:56 Is being a data engineer fun or is it not fun? I don't know. I would say that it's fun. But I think that's the outcome of this conversation that we had with Par and what is interesting really really interesting it's not that much like about okay we discussed about the technologies and all these things but it was very interesting to hear
Starting point is 00:56:17 from him also the personality traits that someone has like a dev engineer and I found this like super, super, super interesting. So it's not for everyone, obviously, but there's huge demand out there. So if anyone's like thinking about it, give it a try. Yeah, for sure. I think one of the things that I, this is the very beginning of the episode,
Starting point is 00:56:44 but well, Paul mentioned this in the middle of the things that I, this is the very beginning of the episode, but, well, Paul mentioned this in the middle of the episode, data engineering is so big. And I thought, I mean, that's a simple statement, but it's really true. And one thing that I recalled from the early part of the conversation when he said that was when he was talking about the early days at Talent and how it was so cool that they output Java code that you could use to sort of customize that last 30% of your pipeline that you're building. I thought, I bet we have listeners who might not be familiar with Talent because they're young and early in their career and they're just using a completely different set of tools. And we probably have other listeners who remember when Talend implemented the Hadoop Spark componentry and that was game changing. And I just thought, man, data engineering is big and it now spans multiple decades and it's just fascinating. It's really fun to just be able to talk about things that, that hit on both the history and the modern, the modern stuff that we use.
Starting point is 00:57:50 So that was my takeaway. I appreciated it. The history lesson, if you will. Yeah, absolutely. And hopefully we will have him back like in the future. We have many more topics to cover with him. I know we need to actually have Brooks start bringing some people back on because we always say we're going to do that. And then we get busy and we don't do that. So that's our New Year's resolution for the Data Stack Show. All right. Well, of course, subscribe if you haven't. You can get notified of new episodes and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
Starting point is 00:58:22 podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.