Software Huddle - The Real Work of Data Engineering with Joe Reis

Episode Date: February 27, 2024

Today, we have Joe Reis on the show. Joe is the co author of the book, Fundamentals of Data Engineering, probably the best and most comprehensive book on data engineering you could think to read. We ...talk about the culture of Data Engineering, Relationship with Data Science, the downside of chasing bleeding edge technology in approaches to Data Modeling. Joe's got lots to say, lots of opinions and is super knowledgeable. So even if Data Engineering, Data Science isn't your thing. We think you're still going to really enjoy listening to the interview.

Transcript
Discussion (0)
Starting point is 00:00:00 How do you define a data engineer? And I guess, like, how is that different than a software engineer for those that are maybe less intimately familiar with this world? We define a data engineer as somebody who manages the, you know, the data lifecycle as it pertains to the role as a data engineer, right? Well, you know, embracing the undercurrents of security, orchestration, data management, architecture, software engineering, and so forth. Do you think that data engineers are not always given as much credit or respect within an organization as maybe other forms of engineering? I think that's definitely been historically the case for sure. But I would actually extend that to data in general. Data is typically misunderstood.
Starting point is 00:00:41 Yeah. Do you think the data lake, as historically know it, it's just going to go away? Hey folks, it's Sean from Software Huddle. And boy, do I have a great one today because Joe Reese is on the show. Joe is the co-author of The Fundamentals of Data Engineering, probably the best and most comprehensive book on data engineering you could think to read. One of the things I really liked about the book is that it's really focused on fundamentals, which, you know, hence the name. But it's not about specific technologies, but more the core principles of that are widely applicable to any data stack. And beyond the book, we talk about the culture, data engineering, relationship with data science, the downside of chasing bleeding edge technology,
Starting point is 00:01:21 and approaches to data modeling. Joe's got lots to say, lots of opinions, and is super knowledgeable. So even if data engineering, data science isn't your thing, I think you're still going to really enjoy listening to the interview. All right, last thing before I get you over to the show, Alex and I will be in Miami in April at the Shift Developer Conference, speaking and also doing interviews. If you're in the area, you should come on by, see some great talks, say hi to us. You can learn more about the conference at shift.infobip.com slash US. And as always, feel free to reach out to Alex or myself on LinkedIn or X if you want to connect or share feedback about the show. All right, let's head over to my interview with Joe Reese.
Starting point is 00:01:58 Joe, welcome to Software Huddle. Hey, what's going on? Not too much. As I was mentioning, I'm here recording, not my usual spot today. I'm recording from my day job's office, so don't have my normal backdrop. Got this rather boring backdrop. But hopefully the sound, the office crowd doesn't come in until much later in the day and it doesn't get too loud in here.
Starting point is 00:02:20 Okay, perfect. Sounds good for now. Yeah. Anyway, I'm excited to talk to you. I've listened to a lot of hours of your podcast. I know you're not going to hold back. You're going to bring the heat, which I like. So I thought to start off, we could talk a bit about your journey to where you are today. I know you studied math in university. And how did you kind of go from there to machine learning, data science, data engineering to where you are now? I mean, back then, I would say it's a pretty natural progression career-wise. Back when I was studying math, there weren't a lot of career options available.
Starting point is 00:02:51 Data wasn't a cool thing either. So yeah, I think that the path, this is in the late 90s, early 2000s, your path of a math degree was teach, go work for the government, become an actuary uh you could probably go be a bartender um i was actually a dj at clubs so that's how i made my money um yeah then i got a job actually as an analyst but more doing data sciencey type work so you know a lot of predictive work a lot of um optimization type work as well so that
Starting point is 00:03:27 was i guess a good application of my degree but fast forward to you know the 2010s and that's when machine learning i think was taking off i think i started getting interested for real in it back in 2009 i think it was it was like i gotta probably transition to this because it's been gpus are starting to become available available to the public for machine learning purposes. I think the cloud was also sort of facilitate, you know, machine learning. I remember computers back then were pretty,
Starting point is 00:03:55 pretty crappy. So it's like you needed all the horsepower you could get. So, yeah, I think, I think we were playing and using like an Xbox back then for a machine learning purposes. But yeah, anyway, that's how I got into it.
Starting point is 00:04:10 But yeah, I haven't looked back since. And that was a self-taught journey at that point in terms of learning? Yeah, self-taught. I mean, there weren't a lot of resources out there at the time. I mean, I think popular packages might have been like Weka or something like that back in the day. Remember that one? Yeah. But that kind of got me into data engineering because I was working at a machine learning startup. like that back in the day. Remember that one? Yeah. But it was just, you know,
Starting point is 00:04:27 that kind of got me into data engineering because I was working at a machine learning startup. We were doing automated machine learning. This was in the early 2010s. And, you know, back then there wasn't a playbook for, I guess, whatever you call MLOps now, right? You'd have to kind of roll your own model hosting and all this stuff. And I came up
Starting point is 00:04:43 with the automated feature engineering system way back in the day. And it wasn't really a rule book for any of that. So you just kind of had to make it up as you went along. But you got to do what you got to do. Right. Yeah, so I read your book recently, The Fundamentals of Data Engineering.
Starting point is 00:04:58 Condolences and thank you. Well, I had a couple observations. So first thing, I've interviewed a number of people who've written books. And I'm not going to name names or anything, but let's just say not all business books are created equal. There's a large number of business books that are essentially like a 1,500-word blog post stretched into 140 pages of filler. And I would say your book is absolutely not this. It's basically a pretty meaty textbook. And I was actually reading it a few weeks ago on a Saturday night after my wife and I got the kids down. And my wife came into the living room and she said, saw me reading
Starting point is 00:05:30 a textbook on the couch. And she's like, are you reading a textbook? Are you some kind of sociopath? And the other thing, there's, I think, a lot of like tactic-based books in the space of data engineering, you know, Spark for data engineers, Hadoop, or Databricks for dummies or something like that. But your book is essentially, for the most part, technology agnostic. So why was it? Or why did you think it was important to sort of focus on fundamentals and start there? I mean, there really wasn't any coverage of the fundamentals of data engineering back when Matt Housley and I decided to write the book as you point out there's no shortage of blog posts there's no shortage of uh books really on you
Starting point is 00:06:11 know data engineering on you know technology x y or z or you know various cloud platforms and those are great books but i but up to that point i don't think anyone had really taken the time to describe data engineering from first principles you know and so that that was actually the the hard part of this book was if you kind of peel back the field like what is it exactly and so that was the task that interestingly enough uh you know our acquisitions editor at o'reilly she told us not to do this book uh because she said it was going to be really hard for two first-time authors to try and define an entire field from you know with the ground up but i don't know we told her we're kind of kind of dumb kind of crazy so why not then she eventually came around to the gar wave of the world and you know but i can't say it was easy you know when
Starting point is 00:06:56 you try and define something from first principles and you don't want to have the crutch of a technology it's i would say it's it's not easy yeah i mean you have to kind of take a step back at each time that you're describing sort of these things it's like if you're talking if you you know think about a particular technology it's like well what does that technology actually enable people to do and then how do you sort of define that as a larger thing and where it fits into this you know world that we call data engineering exactly and the crux is definitely understanding what's not going to change as quickly in a field that changes very quickly right so and we distilled that down to a few things it's
Starting point is 00:07:31 like you know um and i don't see any situation where you're not going to obtain data from some sort of a source system you know you're not there's no situation where you're probably not going to have some storage mechanism for that data, ingestion patterns and queries, transformation modeling, and then serving it. I think all those are fairly rugged. They'll be around for a bit. Then all the undercurrents, I think, are what supports it all.
Starting point is 00:07:56 It was an interesting thought experiment. I don't know that I would change anything on it. So I'll probably add a couple things, but that's about it. You can come up with a new version in a few years. Matt and I are thinking already about what we'd add in the new version. I think orchestration probably is going to be a standalone chapter, but I'm not going to give away too much yet because I haven't formulated it. But, you know, you start nitpicking things about your book after it's written.
Starting point is 00:08:18 And the idea was for it to be relevant at least five years from now. And I think it will be. Yeah. I mean, if you look at like sort of classic first principles books, you know, from like Knuth's books, for example, on the fundamentals or the art of programming, you know, like those, you know, he made up a programming language to describe these things. That's like, so that you're not even walked into the technology of like the programming language and sort of the, you know, languages fall in and out of popularity. He basically, well, I'm not even going to, I'm just going to ignore that and essentially make up my own language to kind of describe these different things.
Starting point is 00:08:53 And those started in, I don't know, like, you know, 40 years ago, and they're still as fundamental as they are today. So I think that the sort of first principles approach approach doesn't if you have that, you create these kind of classic works that people can go back to 10, 20 years down the road. Oh, for sure. Yeah, all these books were great. Then also the Martin Kleppman's Designing Data-Intensive Applications. I thought that was definitely the de facto data engineering book at the time, even though it was actually written for software engineers and not data people, i thought was interesting but you know that was the one that everyone
Starting point is 00:09:27 gravitated towards it's it's one of the greatest books of all time i think at the same time you know if you look at how data engineering changed from the time martin published that book in what 2016 2017 to today or when we started writing our book in 2020 i think a lot changed right the tools became a lot more abstract your your need to know a lot of the underlying gory details of the outlines in the book, I would argue most data engineers at some point should be familiar with it, but that shouldn't. You don't need to operate at that level in your day-to-day job typically anymore, right? Things have just gotten a lot simpler. So that allows you to kind of step back.
Starting point is 00:10:04 I would say our book is a prequel to his book. You know, I'm currently reviewing the next version of his book, and it's good. We'll come along. Okay. Yeah. In your book, you mentioned, at least in the time of writing, there was 91,000 unique definitions of the data engineer. So how do you define a data engineer? And I guess like, how is that different than a software engineer for those that are maybe less intimately familiar with this world? Yeah, yeah. The number came from the amount of just what is a data engineer? I did a unique query search in Google on that, and that came up with 91,000 results. And so that was interesting. But we define a data engineer as somebody who manages the data lifecycle as it pertains to
Starting point is 00:10:48 the role as a data engineer, right? While embracing the undercurrents of security, orchestration, data management, architecture, software engineering, and so forth. And the TLDR is definitely a data engineer who gets data from somewhere, does something useful to it, and
Starting point is 00:11:04 serves it for downstream stakeholders like analysts and machine learning and AI use cases and maybe reverse ETL and similar use cases. So there's sort of the bridge between software engineers and downstream data use cases, so to speak. Yeah, so then if you think about software engineering as potentially building, I don't know, an application that's going to be used by some end user
Starting point is 00:11:29 in a B2B, B2C sense, the sort of end user, the data engineers, the analysts, the data science, the AI engineer. Yeah, exactly. And I think that's as it stands today. I mean, there's definitely an argument that these roles will be kind of melding together at some point and perhaps evolving, especially as data use cases traditionally, which are kind of maybe more internal facing, become more external facing and application based.
Starting point is 00:11:55 And then at that point, I think there's more of a feedback loop between data and whatever software engineers are doing. Right. And so I think that's sort of the next progression. We should talk about the last chapter of the book. We call it the, quote, live data stack, kind of a tongue-in-cheek plan, where it's the modern data stack. But the notion really is that feedback loops become tighter,
Starting point is 00:12:15 streaming becomes a first-class citizen, and then event-based architectures means that there's really no definitive line between applications and quote data use cases at the end of the day. It's all sort of the same thing. But we'll see if that happens. I think it will. Do you think that data engineers
Starting point is 00:12:35 are not always given as much credit or respect within an organization as maybe other forms of engineering? I think that's definitely been historically the case for sure. But I would actually extend that to data in general. Data is typically misunderstood, probably misapplied and underutilized. I think part of that is there's a sense of FOMO around data
Starting point is 00:12:58 where if I'm not doing data, then I'm obviously doing something wrong. So you'll hire a bunch of data people but have really no idea how to properly utilize them. And so that's, whereas the software engineers, for example, right?
Starting point is 00:13:08 Like an application is, is a very, I guess the impact of what a software engineer does is very immediate, right? I make a feature and the features out there, I make tweaks to the feature and those tweaks are, you know, available for use and so forth.
Starting point is 00:13:24 So, whereas data is a bit more silent in some ways, right? It's sort of, it's like air. It's everywhere, but you don't see it. But it impacts a lot of things and it powers a lot of things. But I would say a lot of it's just due to the immaturity of the data field. I always like to say that data feels like it's about, you know, 5, 10, 15 years behind software. So, yeah, where would you say,
Starting point is 00:13:46 so if you have software engineering, maybe that's sort of the most developed discipline in the engineering space. And then you have data engineering. And then I guess like now you have sort of AI, ML, engineering. Like where would you put data engineering between those on like spectrum of maturity? I think you kind of nailed it, really.
Starting point is 00:14:07 The data engineering is sort of in the middle. ML engineering, because it's just newer, that's going to be the least mature. But ultimately, we do have a guidebook, though, and a good rubric, and that's just paying attention to what software engineers have already done for a while, and I think modeling what works and tweaking it for you know our specific use cases but it's um software's done a lot of great things you know they've done a lot of things correctly i think you know they've uh kind of paved the way for for the fields if it's funny because if you look at like there's data ops there's ml ops and these are all just borrowing practices from software engineering.
Starting point is 00:14:51 Yeah, I mean, it's a sign of maturity of any discipline where you get more specialization. If you look at being a medical doctor from 100 plus years ago, one doctor was delivering a baby, pulling a splinter and performing surgery. But now you have people who just specialize in surgery of one specific organ or one specific type and they're specialists in that thing. And I think we're getting, we're not quite, maybe quite there in engineering, but there's more and more specialization because the scale of the systems are much more complicated, bigger. It's hard for anybody, one person essentially, to be able to be an expert in everything and keep all that stuff in their head.
Starting point is 00:15:30 I mean, yeah, software engineering, it definitely has progressed. I think medical analogy is a really interesting one, right? Because you used to be kind of the general doctor that would go deliver babies in a barn and, you know... Apply the leeches. Yeah, leeching and bloodletting and all that stuff back in the day and that's and you realize it's things things things mature and you know there's a lot to borrow i'd be but when i say so you know that it feels like the data feels about 5 10 15 years
Starting point is 00:15:56 behind software that that's not it's not always going to be a linear um kind of look back if you know what i mean like there's there's definitely things to borrow, so hopefully that means that the data field can catch up a lot more quickly. You already have a good rubric to go off of. Stuff does work. I guess the more that these fields kind of blend together with the use cases of data powering more applications and so forth and AI models becoming more front and center, I think you're going to see
Starting point is 00:16:24 a pretty interesting intersection over the next few years that is going to change stuff and it's already changing workflows too of software engineers i mean if you look at what's going on a lot of people are using copilot now and things like that to generate code and you know that's a that wasn't a thing a few years ago. Now it's kind of the default. Yeah, and I would say the other thing too that helps mature an area as well as bring new talent where they might have gone to more traditional software engineering
Starting point is 00:16:56 is how important or what's the hotness of it. And data is really the love language of AI. And of course, there's a lot of stuff going on in the AI world right now. So people are trying to figure out like, you know, how do I fit into here? How do I kind of, I want to go where, you know, to the companies that are at the forefront of technology work with the best people. So you're going to, I think, see a natural, natural, like gravitation towards people seeking more of a data role than probably before as well oh yeah yeah and it's interesting seeing you know the role is definitely you know the popularity of data has changed like you said
Starting point is 00:17:31 when i got into it it was like the probably the least cool job you could think of i mean you know at that time everyone's getting their mba right i wanted to be a quant i wanted to go to finance i was maybe the path i was in i was you know i so i'm gonna be an actuary because that's super exciting too um joking uh but you know it's it's it got popular right data science became the the cool job and everyone became a data scientist and then and they realized that it's kind of hard to do data science without data so that's what data engineering came into being. And now it's all about AI. And so who knows? It'll be pretty interesting. With respect to data engineering, how important do you think it is
Starting point is 00:18:12 sort of understanding the fundamentals of computer science, like algorithms, data structures? You know, is understanding that a DAG is used for, you know, querying planning and optimization or, you know, maybe a B-tree for some forms of indexing.
Starting point is 00:18:26 Like, does that matter to a data engineer's day-to-day? I think it does. I wouldn't say that you, as a data engineer, a good junior data engineer needs to know all that. I would say what we cover in the book is probably what I would say from a beginner to intermediate level data engineer. But yeah, as you become more senior,
Starting point is 00:18:42 I would expect that you're going to know a lot of this stuff. How would you read an explain plan in a database, for example, right? For a query, if you don't understand, you know, various things like B-trees and so forth, right? That an explain plan will give you. And so I think in various indexites, yeah, definitely algorithmic complexity and O notation and that kind of stuff is super necessary, especially as you're starting to custom build. I think early on when you're starting out,
Starting point is 00:19:16 if you're just using a data ingestion tool like a Fivetran or an Airbyte or Estuary or Portable or or something like that and just moving data to a snowflake like probably don't need to know a ton of stuff right you would want to know how to write performant queries and hopefully understand ingestion patterns but you can get i mean a lot of the tooling is abstracted away you know a lot of that stuff that's kind of goes back to the whole notion of our book versus say martin clevin's book which is more about the internals of how all this stuff works right but i think at some point for you to graduate towards being not very competent professionally obviously you need to start knowing this stuff and having a computer science background i would say gives
Starting point is 00:19:54 you a huge advantage over this stuff because you already learn this stuff but i mean you'll have to learn it one way or the other but there's great resources out there you know database internals terrific book um i think everyone should read. It's a bit dense, but that's kind of what you got to do. Yeah, you mentioned this earlier as well, but essentially the tooling's gotten much better. Things are easier in many ways. So where are sort of the hard problems in data engineering today?
Starting point is 00:20:25 I think the hard, you know, thinking a lot about this, I actually don't think that there's much of a tooling gap at this point for solving classical data problems. And by that, I mean, classical analytical problems, or data warehouse or data lake houses needed, right? What I do think is happening is that there's actually a skills gap and a knowledge gap and a competency gap between the people using the tools and the potential that the tools provide i actually think this is the biggest gap in our industry right now in the data industry is actually not really understanding the fundamental practices things like data modeling for example
Starting point is 00:20:59 right things like how do databases work i think because a lot of the tooling abstracts away, you know, a lot of functionality, and just, which is great. That's how that's what technology should do. But at the same time, it helps you to understand what's going on under the hood and understand, again, for data modeling, for example, correct ways of how to think about your data, you know, at a conceptual level, for example, and how it relates back to the business or the organization you're in, and then translating that data down to, you know, something that's performant, you know, from a storage and query, the physical layer of data modeling, right?
Starting point is 00:21:37 And knowing all the techniques. You know, if I talk to data professionals these days, actually a lot of them don't understand or haven't heard of the classical data modeling techniques. mentioned relational modeling to people they kind of look at you with a blank stare they may have heard of it once but couldn't really explain it and you know but i mean understanding why that's why it's important to use relational modeling and when you'd want to use it i think is you know that's just data people i would say software engineers do you know i think a lot of practices are just just kind of thrown by the wayside or forgotten
Starting point is 00:22:05 about and this is... But then consultancies like mine or companies like mine, when we come into situations, we're certainly glad that there's a lack of knowledge and best practices. Because you wouldn't have a business otherwise. I mean, at the end of the day, consulting is just knowledge arbitrage. It's literally all it is, right? But at the same time, I'm is just knowledge arbitrage it's literally all it is right but but at the same time you know i'm writing books on this kind of stuff and i feel like that's you know and i you know i'm kind of um i don't see moving away from consulting but
Starting point is 00:22:33 definitely uh i think my biggest focus right now is just education i feel like that is single handedly the biggest gap that we have as an industry like i said there's no shortage of technologies at this point i mean new technologies always come about to solve new problems that's the nature of technology but um i think there's just a huge gap between what we're capable of doing and where we need to go. Where do you think that skills gap is coming? Are we moving away from perhaps teaching some of these fundamentals at the university level? Or is it people are taking different paths to a career in data where they might be skipping over the fundamentals and be focused more on tooling. It's kind of like, you know, if you go to a boot camp,
Starting point is 00:23:07 80% of graduates from boot camp are usually focused on front end, they're learning React, but then you, you know, you might not be learning sort of the fundamentals of actual like computer science and how a programming language works. And you're kind of just focused purely on the technology. I do agree with his observation, 100%, so we're if you look at how um you know data data boot camps for example right it's like okay so what's the first thing we're going to learn probably python and sql right because that's widely used tools i mean the whole intent is to get somebody a job right so you can check check the boxes on a resume say okay this person gets python and sql and so forth but if i were to ask you um okay so given
Starting point is 00:23:46 given this um the setup you know let's say a company for example right and this is what the company looks like this is what they do how would you think about their data needs right like what what what are we trying to do in the first place i think giving people the ability to see and observe a situation is this um situation is lacking in the techniques to assess that. And then obviously, as you point out with computer science, understanding big O notation, for example, right? Like, oh, am I going to write like four, like three, four loops nested together? Is that a good idea? You know, or do they just create like, you know qubit complexity in my uh you know what i
Starting point is 00:24:26 just did there you know this is it but if you all you know are for loops for example you don't understand the the impact of nesting these things you know like i see this all the time right or the difference between state stateless and stateful programming right so again um you know if you're using a distributed system you you want to write stateless code right uh and you don't want to write things that are stateful for very obvious reasons but this is again isn't really taught as far as i can tell so you know i see a lot of things and so i think there's a few reasons for this right um obviously it's people come into it from different angles and are trying to you know i think as quickly as possible get the skills that they need to to check the boxes on a job
Starting point is 00:25:04 description so they can get a job and they can't blame them. That was what you probably want to do if you want to get a job. And it just takes a lot of time to master the fundamentals, you know, and it's just not one of the things I think people are incentivized to do, especially at their jobs. It's like, you know, and I'm going to blame a couple of things. I'm going to blame maybe the cult of agile for people that are working at jobs because the cult of agile, you know, agile, the manifesto started out as, as being a basically, um, a manifesto that describes how we would continuously deliver software. Right. Um, what this also in an iterative fashion, but what this also means is that people took
Starting point is 00:25:39 that and started thinking that, okay, two weeksweek sprints is the same as being agile. But it's really hard to, I think, to sit down and master the fundamentals and really think if you're only operating in two-week sprints, for example, right? And I think this is one of the dichotomies in data is that we're trying to apply a software engineering framework to data, which is fundamentally different in a lot of ways, right? With data, you're trying to compile data, in a large case, to get context to the entire enterprise. This is not the same thing as delivering features
Starting point is 00:26:13 as you would in software engineering. But now we have to, you know, data is very much a thinking person's sport, as I say in a lot of my talks. And this is a fundamentally different thinking exercise than plowing away at two-week sprints. So I think that's also part of it, where you just don't have the time to sit back and really and um you know plowing away at two-week sprints so i think that's also part of it where you just don't have the time to sit back and really assess you know what are we trying to do from first principles and like what should i learn from my best practice standpoint
Starting point is 00:26:35 and so um yeah i'm sure i could blame a lot of other things but i'll i'll blame those for now well i think when you sort of you know, blindly potentially apply like a methodology, you might not realize there's also consequences to choosing that path. Like, it's just like, you know, if you think about, you know, certain KPIs, you know, we tend to optimize the metrics that we measure. So if you measure the wrong thing, what does that end up doing? It might actually steer you in the wrong direction. Like, I know during my time at Google, at Google, everybody's focused on performance reviews that happen
Starting point is 00:27:08 twice a year. So if something is a project that takes longer than six months, people are more reluctant to do it because even if it potentially has more impact and it's the right thing to do, but what do I show on my next performance review? And that impacts the bonus I get at the end of the year. It impacts my ability to get promoted and all this sort of stuff. So you become sort of hyper-focused on these short-term wins and that leads to short-sighted, I think, choices
Starting point is 00:27:35 when it comes to building a product. That's really interesting, Sean. Yeah, you're absolutely right. I think it was Charlie Munger. He's Warren Buffett's business partner. He said it best. If you show me the incentive, I'll show you the outcome. right um i think it was charlie munger he's warren buffett's business partner he said it best if you show me the incentive i'll show you the outcome so you know what you described it very much fits
Starting point is 00:27:50 that i mean you know you have a performance review is you're gonna that's what you're gonna be if that's what you're measured towards that's what you're gonna improve upon simple as that right so it's yeah it definitely does come down to exactly what you described. And so that's for good and for bad, right? So I want to talk a little bit about, you know, what's kind of going on in the world of data. So you started with like data warehouses. How has that area sort of changed over the past decade or so? And where do you see it going? I mean, I think the biggest shift is really the
Starting point is 00:28:25 movement to the cloud and the modern data stack. So if you go back 10 years ago, or even further than that, I would say 15, for argument's sake, right? I mean, your options at the time, if you're a company and you want data warehousing capability, you could obviously roll your own relational database and they'll get you some
Starting point is 00:28:41 way of the way there. It's fine. A data warehouse is meant to be an architectural paradigm.'s not meant to be a um a specific technology as you know bill bill inman would point out um but with respect to the data warehousing technologies i think what's changed is you know the modern data sec um you know i think really democratized um you know the use of data warehousing. And by that I mean November of 2012 is when I sort of put the, at least in my opinion, is when the modern data set started. And that was with the release of AWS Redshift.
Starting point is 00:29:14 So before you'd have to get these expensive contracts for data warehouses, usually on-prem. If you're getting Teradata, for example, you might be charged by the amount of cores that you're using and that can be very expensive. There's a lot of onerous details in that.
Starting point is 00:29:31 Redshift came out and they're like, fine, it's 25 cents per hour for a core. Digital processing unit or data processing unit at the time, DPU.
Starting point is 00:29:39 That was pretty cool. So for pretty cheap, you get a data warehouse and it runs in the cloud, and that's pretty cool. And that ushered in a lot of new technologies, right? Cloud-based data ingestion tools, ELT became a thing for better or for worse. Snowflake, I think they started working on that around 2012 as well.
Starting point is 00:30:00 And so you kind of couple this with the rise of data science too, and you started seeing, I think, at the beginning of the 2010s, the data science and data warehousing workloads are quite different. And I think there was actually a lot of animosity between the data warehousing crowd, the data science crowds were data science, sort of this, this hot new thing and data warehousing was, you know, this is old stodgy kind of blue shirt and khakis type of thing and uh you know and then i remember data scientists were claiming oh yeah like you know data warehousing is going to die sql is going to die i've heard these pronouncements countless times right but that was like that was for years people were saying this kind of stuff you know like oh we're just
Starting point is 00:30:38 gonna be all be writing in notebooks at some point right and so you know i think spark uh blew the lid off of a lot of things too when that came out right uh um you know because for the longest time you had to write map reduce hadoop and that's painful for everybody um you know but spark i think opened the made it a lot simpler and a lot faster and so you know that's probably around 2014 i think spark open source came out and then what was really interesting is you know databricks for example they they sort of um you know they were a data science first company back in the day and i remember using a lot of their stuff and thought it was pretty awesome then you start seeing a convergence though right i would say kind of the late 2010s is when you started seeing, you know, lake house come onto the scene. And, um, I forgot to mention data lakes.
Starting point is 00:31:28 That was a big thing. So as well, right. But that, so, I mean, the notion was, Oh, we'll just collect all this data and maybe we'll use it later. Right. But I think what happened was, do you ever watch those like hoarder shows on a cable? I know what you're talking about. Yeah. Yeah. Pretty awesome. I love watching these shows for some reason. I'm a pretty sick person. But that's what
Starting point is 00:31:49 a lot of people's, a lot of companies' data systems ended up looking like. Yeah, we don't know what we're going to do with this, but we know there's some value in there somewhere, so we're going to hold on to it. Right, yeah. It's like having value in some smelly pizza box that's like five years old that you just want to keep around for some reason and you know so that happened i think
Starting point is 00:32:09 people quickly rise at the data lake that maybe there are probably other ways to do this right because it discovered like curating data sets and discovering data sets i mean it becomes almost impossible at some point just because it's like how the hell do you find anything you can't right and so then gdR comes along in 2018. And well, now you have to be able to find your data. And if you want to delete it, for example, because you can find pretty heavily if you can't. So I think that's when people started taking
Starting point is 00:32:34 at least governance a bit more seriously. Because before that, it's like it was a free-for-all. You know, it's... Well, even now there's a ton of companies that are just sitting on like a mountain of data, unstructured that's like you know encrypted in a bucket somewhere and they're like we can't touch it because it's got you know pii or phi or you know something in it and uh but someday we'll be able to do something with this someday so it's interesting and then you so i mean that's what's changed i think is, is you just saw the combination and
Starting point is 00:33:06 sort of the convergence of data science and traditional data warehousing analytical workloads to the point now where these systems are very much converging. I mean, you haven't, you know, there's data fabric in Azure, you know, there's Databricks, Snowflake, I mean, are basically on track to be the same product, in my opinion, feature for feature, at least. Yeah. I mean, they basically want to own all the data at this point. Yeah, BigQuery is dope and Redshift is, I think, catching up. So, yeah, I mean, it's so the lake house paradigm is sort of where everything is going towards.
Starting point is 00:33:39 Right. And that's that was a big evolution. Yeah. Do you think the data lake, as you we know it, is just going to go away? I think so. I don't see much of a need for it. I mean, because you can get the best of both worlds with a lake house, and you can combine your structured or semi-structured data sets along with your unstructured data. With a management and a governance layer on top of that, I think that's a key distinction, because otherwise you would literally just have a data lake as we called it in the past. And nothing wrong with data lakes. I just think that, you know,
Starting point is 00:34:07 the world's kind of outgrown that chaos that it provides. I mean, you'd have to be a very disciplined individual to do a data lake, you know, or a disciplined organization. I mean, it certainly is done. A lot of companies have done it successfully, but that's, I don't think it's everyone's cup of tea. So this convergence is going on. Do you think that we now also have the emergence of all these vector database companies? Do you see that also moving into essentially a single platform?
Starting point is 00:34:36 Yeah, I think inevitably it will. Yeah. I mean, you're starting to see this kind of workload already, right? Yeah, I mean, I think like MongoDB now is supporting vectors, for example. Yeah, it makes a lot of sense. And so, yeah, I think you're going to see a convergence of all this stuff, especially with the rise of unstructured data sets. Because I think for the longest time, there were definitely people,
Starting point is 00:35:01 definitely companies with a kind of great use case for for uh unstructured data but it was sort of this tale of two worlds right you had the structured data people over here and the unstructured over here and these worlds didn't really um you know talk and but now you know with the rise of um you know uh generative ai everywhere right i mean these worlds will collide they have to but it's it's going to pose some very interesting dilemmas um but i mean the set that you know because you can use general ai with you know text images all the above in a multimodal setting then it's like why the hell not so so you're absolutely right vector databases will i think become first-class citizens in pretty much
Starting point is 00:35:42 every infrastructure i don't because i mean you're going to need that similarity search capability, right? Yeah, I mean, it's a way to essentially take action on the infrastructure data. Do you think now that we have more technologies that allow us to actually leverage the infrastructure data, that's changing in any way the type of work that data engineers are doing and responsible for? It's a good question, and it's something that I was thinking about last week quite heavily. I did a podcast episode on this and it's, I think it was titled things I didn't expect to see or something like that.
Starting point is 00:36:13 I don't know. I never remember what I record. But it was, so it was at, it's at Matillion's demo last week or is that their, their event hosting it with, with Mark Baccanetti. And we had a bunch of AI announcements.
Starting point is 00:36:28 Every company is announcing some sort of AI feature. What I thought was really interesting with this is okay, so they're letting you do prompts now to create data pipelines. I thought that was interesting. I'm still on the fence of whether that's good or not. Because if you're dealing with a
Starting point is 00:36:44 stochastic system where if I give you a prompt, i don't know what the output's actually going to be 100 of the time yeah reproducibility is a is a is a problem yeah it's like so am i going to do like prompt reviews with my with my team now uh you know what would that look like if i were in a cicd pipeline for example right so i think that's that's an interesting one but hey it's happening yeah i mean ideally like when i think about something like cicd or uh you know i don't know orchestration or something like that like i know reliability and being the reproducibility are really important um and those are two areas where it is not necessarily the core value of, you know, Gen AI right now. No, it's exactly the opposite.
Starting point is 00:37:31 It's just like, so, you know, how are you going to unit test these things, for example? Right, I have a prompt. I don't think anybody knows the answer to that. A pipeline, right, it's exactly, it's like, so is this a good idea to do? I don't know, but we're doing it anyway. So then, you know, the other thing they demoed,
Starting point is 00:37:46 which I thought was pretty cool, was the ability to use large language models to comb through a bunch of text data in your data pipeline. I think that was pretty cool in the way that it's just super easy to do. Because this is traditionally a tool that was more for structured data sets
Starting point is 00:38:03 and just querying databases. But now I can come up with sentiment analysis or some sort of analysis on my, say, customer review text data. And so these worlds will collide. And it can do the same thing with images and stuff. So I think it's going to unlock a lot of capabilities that so far probably data engineers
Starting point is 00:38:20 weren't even thinking of, right? Because the workflow is typically okay. So I'll just get data, put it somewhere, serve it for downstream use cases. But those downstream use cases are, well, they're growing, right? So I think that's pretty exciting. And it will change the workflow
Starting point is 00:38:35 of data engineers for sure. Vector databases, again, that's another big one, right? So, you know, I think for data engineers now, you're going to be intersecting the worlds of ML engineering and MLOps in ways that probably, at least this time last year probably weren't even thinking about yeah this time last year most people didn't know what a vector was so no i mean the the hype back in the you know through 2021 to now was the feature stores right you don't hear much about those right now yeah because you don't need features and oh, oh, well, Gen AI will find the features for you, I guess.
Starting point is 00:39:08 Sure. So, you know, you were talking a little bit about sort of the MapReduce Hadoop era. people maybe entering the world of data systems now to really understand the impact that those systems had on large-scale data problems back in the day. But a lot of those modeling approaches, even things like, I don't know, using one large table in order to overcome limitations of the technology, we had at disposal at a given time. Eventually those things got simpler and the underlying framework was taken
Starting point is 00:39:48 care of in terms of the complexity. So I guess looking at today, what sort of hacks or workarounds are we doing today that you think will go away and eventually just be something that kind of magically happens behind the scenes for us because we have a proper abstraction layer? Good question. I've been thinking about this a lot from the context of generative AI, actually, and what it can do for data modeling. I think the hacks we have right now, and I talk about data modeling a lot because that's what I'm writing a book on. That's pretty much the only thing on my mind at this point. But it has huge impacts, right? Hacky data models mean that you can spend more time than you need to on getting really bad answers, for one.
Starting point is 00:40:31 So double whammy, right? What do I mean by this? Well, if you have an inefficiently created data set, say a giant white table, right? The impacts can be quite severe. For example, you might have lots of duplicates. You might have lots of redundancies. You might have have you probably don't even know what the hell is in there right query patterns are chained together right so if you're using um you know certain transformation tools you don't understand modeling practices well you're going to basically create
Starting point is 00:40:57 probably just a bunch of tables or one big table or somewhere thing in between or just a bunch of queries i think those are those are the hacks right now it's like we're we're super reactive trying to answer various questions and what that means is you have just an enormous amount of sprawl right so if you thought the data lake was bad you know i i would challenge people to think that or to at least observe the the workings of their own data practices for example and and ask okay so how many how many queries do i have that are sitting out there right now how consistent are they in arriving to some sort of base level of truth and the questions i'm being asked right can i consistently answer questions from the
Starting point is 00:41:36 data sets i have often probably not right you have very much divergence of um you know truth so i think a couple things i've been thinking of is okay so generative ai could certainly help this in some ways i think especially when coupled with knowledge graphs um i think you're going to see the rise of graphs to provide more context in terms of data that people have and generative ai i think will just be a um um i think obviously the being able to search through data sets is is great and they'll get better um i think also there's there's a capability of it actually being able to go back and reformat uh data sets into a better form right i've been experimenting with this myself and i think there's actually
Starting point is 00:42:15 quite a bit of promise in this um but yeah it's still super early days we'll see but those are the hacks i see right now it's just i think, back to the nature of the type of work that people do. You're reacting to questions and needs of the business and constantly firefighting. But it's ironic because we're supposed to be data professionals and knowledge workers and stuff. But we're constantly just always in firefighting mode. Do you see that's because there's, you know, due to like a lack of resourcing traditionally in the space. So you, you don't have time to sort of take a step back and do the proper planning.
Starting point is 00:42:54 You're more just like you're taking orders at a restaurant essentially and reacting to those in the moment. Oh yeah. I think that's a very good analogy, Sean. Yeah, absolutely correct. Yeah. You're understaffed under resource especially now you know data teams have been kind of cut to the bone especially you know like along with other teams it's like you're going to do what you can to get by in the day and that's about it or you know what you can to get by in the sprint yeah i feel like all like sort of uh non uh like directly revenue generating teams uh have been suffered over the last year in terms of where the cuts
Starting point is 00:43:26 are making yeah i mean that's the reality of a business right that's going to happen and that's just you know so if you're if you're um on those teams that are still around then yeah you're going to be expected to do a lot more with a lot less and that's just how it is and so yeah you're not going to have time you're just going to do what you can to get your job done right because again it's by hitting it whatever kpi you're supposed to hit whether time. You're just going to do what you can to get your job done, right? Because again, it's about hitting whatever KPI you're supposed to hit, whether that's real or imagined, right? And most companies don't have KPIs. That's the crazy thing.
Starting point is 00:43:52 Most teams I've seen don't even have a sense of a KPI. So you're just going to try to strive towards whatever you think is the right thing to do or whatever you think your boss thinks is the right thing to do. It's all about self-preservation. So of course you're gonna you know and restaurant's a good example i mean i've worked in restaurants before maybe you have as well it's a very stressful environment right it's like i can't think of a more of a pressure cooker than a place like that but that's that's a lot of that's a lot of teams these days too so you're absolutely right yeah one of the things you mentioned there was this challenge around data duplication.
Starting point is 00:44:28 We were talking about the lake and the challenges with the data lake. It's become a management nightmare because you have all this data just sitting there. You have to be very disciplined about actually cataloging it to be able to do anything with it. But I think even in the structured world, one of the major problems that companies have is just duplication of data, especially when we're talking about PII, and that leads to this huge sprawl problem. And you create the same problems where you don't know what you're storing, where it's storing. And in the world of GDPR, as well as other 100 plus regulations that exist in the world, it becomes very, very difficult to be compliant as well as secure the data. So I guess, what are your thoughts on that?
Starting point is 00:45:08 Do you see this as a big challenge for companies that you work with as well? Huge problem. Right? And especially with the popularity of SaaS applications, right? Because now you have the ability to have no control of your data model. And then you have every ability to duplicate data to your heart's content. And all kinds of systems that don't talk to each other typically. Right?
Starting point is 00:45:33 Yeah, I mean, the temptation's there. And in a lot of cases, data duplication occurs, or triplication or quadruplication, whatever. How many replications do you want to talk about? I mean, you can do this to your heart's content in all these different systems. No way of reconciling it. So you might have different versions of a customer
Starting point is 00:45:54 or different versions of products. Cool, awesome. That's the world we're in right now, though. So there's this quote from your book where you said, it's easy to get caught up in chasing bleeding egg technology while losing sight of the core purpose of data engineering despite designing robust and reliable systems to carry data through the full life cycle
Starting point is 00:46:16 and serve it according to the needs of the end users. So there's this ever-changing landscape of tools, some of the things that we touched on in this that data engineers can use and learn. Do you think that we get too fixated and in love with tools and these beautifully complex modern data stacks when perhaps something simple could just do the job? Oh yeah, all the time. Well, I mean, you've got to consider it from the angle of an engineer too.
Starting point is 00:46:40 I mean, you get paid to engineer stuff and you're always trying to find the cool new thing, right? Because that's, you just want to tinker with stuff, right? And I think there's an element of resume-driven development too in this where, you know, constantly looking at what's the hot new technology, what's going to help me get my next job, right? I'm not going to be a COBOL programmer. It's not that cool. I could probably make a ton of money doing that.
Starting point is 00:47:02 You know what I mean? But it's like, that's just always a temptation. It just that's uh yeah it's it's i think it's just human nature though the grass is always greener on the other side so you know it's a there's always there's always a new open source technology you should try out there's always a new vendor with a cool product that you should that they tell you you should go try out and so there's a lot of noise right but you know you gotta spend your time also focusing on what the business needs and, um, you know, that's important, but, and the fundamentals are hard to do too. Right. I mean, cause again, it's,
Starting point is 00:47:34 I always tell people you should, if you master the fundamentals, it makes it a lot easier to assess all these new technologies. Cause it's like rarely is there anything that's like completely novel and new out there? I would say very, very rarely does that happen. Often what happens is that there are just variations and permutations and you know some sort of combinatorial stuff on that existing ideas and technologies and that's how things morph i very rarely do you see something that comes out left field that's completely new right doesn't doesn't really happen in our field yeah do you think that a lot of this i don't like getting better at sort
Starting point is 00:48:05 of making these decisions in terms of technology choices or knowing when hey like i don't need to like apply this whole stack when i could do this pretty simply with like i don't know a spreadsheet or something like that comes from you know just you know mature sort of maturity in the space and and experience yeah i think it you got to spend your cycles uh sort of getting in the space and experience? Yeah, I think you got to spend your cycles sort of getting your butt kicked around a bit. You know, and I think that it brings you back to reality. You know, like I do a lot of stuff in spreadsheets. Why?
Starting point is 00:48:37 Because it's really easy to do. And it costs me nothing, right? And they're super efficient. They're unreasonably effective for a lot of stuff. You know, but I think what you realize is you know the really it comes down to you as the individual and how effective you are at solving problems the tools are just there to to be tools right but i think when you start out right you want to compensate with your lack of knowledge and lack of skills with tools that's a temptation because it's like well i know how to use these tools i probably don't know what i'm doing right and
Starting point is 00:49:08 that's that's the kind of the fresher mentality that i've seen at least uh but that changes i think ultimately you end up i mean my favorite tool is really just a pad and paper these days and drawing out what i think the solution should be um and going on long walks and thinking about the problem that's my secret weapon i don't need technologies to do that up front i certainly needed to help implement things but then you you know what you need to do right but that you know if you're coming out of college you're not going to have that ability why would you you don't have the experience you wouldn't know how to solve it from, you know, but that just comes with time and comes with getting a lot of bruises. Yeah.
Starting point is 00:49:48 You got to over engine here a few systems before you realize that maybe you spend a little bit more time planning before executing. Oh yeah. And I think the big question you need to ask is why, why are we doing this at all? Like what, what is the objective? Like, you know, I think if you can, if you can treat
Starting point is 00:50:05 things as a journalist and approach it from that perspective you can just ask really good questions you'll have better context sometimes the answer is you don't need to design this you don't need to build this at all actually so um that that is an answer as well not everything has to be built right yeah i mean you could uh you could potentially even you know pay for service like exactly yes like i was reading this i was reading an article on hacker news um a few weeks ago uh somebody had um i think it was at uber they they wrote their own spreadsheet like they created a spreadsheet because like the excel or whatever didn't do what they needed to do and so they uh over engineered the spreadsheet. And then I think something happened and it was never really used at all.
Starting point is 00:50:48 Yeah. Right. Yeah. I feel like, so like, you know, Google had built up a culture of sort of engineering everything. And I think in the early days it made sense because one of their core assets was engineering and they were doing things at scale well beyond essentially like a lot of services existed. But now that's not really the case but there's still sort of this historic culture around like hey we we're
Starting point is 00:51:09 not going to use salesforce we're going to you know write our own crm or we're not going to use hubspot we'll use our own marketing automation but then you have these kind of like internal tools that are subpar to really what the industry standards are uh in some fashions because you know internal tools are never going to get the same resourcing that uh you know google searches or you know ads yeah that's exactly it right i think last time i checked it as a google cloud partner they use salesforce for the google cloud scrm now right so it's like you know you can't escape uh the inevitability that there are better tools out there sometimes and maybe you don't have the best tool but you're absolutely right that's a temptation i mean i know people who have written
Starting point is 00:51:47 their own databases when like my sql or postgres would have done just great they're like oh i have to build this and like i guess if that's your boss what lets you do i don't know why you would do this but yeah i mean even as a lot of database companies are um you know start with postgres and then use the extensions to you know do whatever they need that's why there's so many like postgres uh like sort of core core postgres i mean it's kind of like a you know an operating system like let's start with the unix kernel and then go from there we don't need to reinvent that piece oh exactly and postgres is awesome like you know linux is awesome i mean use these as we say in the book you know and we write in i think in chapter four about choosing technologies build versus buy it's like you know you should build build when it's a competitive advantage to you and it's uniquely yours.
Starting point is 00:52:28 Like what you mentioned with Google, like they were operating at a scale, solving problems in a way that nobody on Earth is doing. Of course, you're going to have to build this in your own. Like, you know, I mean, I don't think it's for lack of trying, you know, off the shelf stuff and breaking it. Right. I mean, they did that. And so, you know, this isthe-shelf stuff and breaking it, right? I mean, they did that. And so, you know, this is what they had to do. Like, you know, I talked to Jordan Tagani about, you know, the work they did with BigQuery, right? He was a founding engineer for that.
Starting point is 00:52:53 And it's like, yeah, you're building that because it's a system that you need and doesn't exist right now. You have to run analytics and tons of data. It's like, you can't really do that. You're going to have to build it, right? And so that's but you got up and i think you gotta understand like where you are as a company and as a as a team right like most companies aren't google and you don't need to do this and so you know i think
Starting point is 00:53:16 there's a temptation for engineers software data or whatever to read like google's blog uber's blog netflix's blog and say okay i'm gonna go to that at my company. And it's like, maybe it'll work, but do you have the same problem in that same way? Also, do you have 100 engineers to throw at a problem that's non-core to whatever it is your business model is? Right. Exactly. What do you think are the big unsolved problems in data engineering? Yeah, it's a good question.
Starting point is 00:53:49 I think it's really about integrating data into, you know, like I said, more application workflows. And that whole feedback loop, I think, is like one of the big sort of unsolved problems. I would say, again, like the capability of solving the classical data problems. I'm talking about analytics, for example, I think that we have the technology to do that right now. We've had it for a long time. So I think it's a combination, again, of skills and practices. I think that that's one of the big problems for data engineering
Starting point is 00:54:17 is just, I think, leveling up on the concepts, I think, to be most effective at your job. But we already touched on that, and I think there's a lot of reasons for this. But the big unsolved that. I think there's a lot of reasons for this. But the big unsolved problem, I would say, is that feedback loop between just in the data lifecycle and bringing it full
Starting point is 00:54:32 circle. I think we're going to continue solving. I'll throw out a trigger word for the audience and people will have an aversion to this or like it, but data mesh, I think has a capability of helping solve this problem. But, you know, we're not there yet.
Starting point is 00:54:52 Have you seen anybody actually implement some version of data mesh? I've seen people, I've talked to people who have said that they've implemented some version of it, right? But if you were to talk to people like Jim Actigani, who's a really good friend of mine i mean i think that she maybe have a has a different opinion of that
Starting point is 00:55:08 right so i think it's it's sort of in the eye of the beholder but i think the notion of the of data sharing and decentralized way like that i think has been done to some degree but i think according to her um perspective maybe there's some work to do on it. But we'll get there. I think. I hope so. But what it means, though, and one of the conclusions that he draws on, which I don't think gets discussed a lot, is that it actually changes the shape of the roles people have. So if you're a data product developer, as she calls it, you're bringing together software engineering and data practices all into one. So the notion of a data engineer, software engineer, ML engineer, this all kind of goes away. And it's just now you're delivering data products. And I think that's, that is a kind of the fundamental shift, which if you were to take, you know, what she proposes to a logical conclusion is exactly what would happen. So whether we get there or not, I don't know.
Starting point is 00:56:05 That's a debate for another podcast. Yeah, that's a whole topic. So as we start to wrap up, is there anything else you'd like to share? And how can people reach out to you, get in contact with you? Yeah, LinkedIn's good. Send me a message.
Starting point is 00:56:19 I usually respond. If you send me a sales pitch, I will not respond. So you'll actually be moved to the other box where that's purgatory for messages. Yeah, LinkedIn's good. Yeah, I'm taking a break from speaking. I think that we're recording this in kind of late November. But I'm taking a break from international travel
Starting point is 00:56:40 for several months. I'm working on a... I've got to finish my book, so that's coming out first half of next year. Then got a course I'm working on with deep learning AI on data engineering. So that's going to be pretty dope. Really looking forward to that. So that specialization is so you can keep an eye out for that too.
Starting point is 00:56:57 Can't commit to a date on when that'll be out, but I'm a heads down in those two projects right now, as well as starting a new company. So certainly a content and publishing company that will be announced early next year. Yeah. down on those two projects right now as well as starting a new company um so it's a new content publishing company that will be announced um early next year yeah got a lot going on yeah it sounds like a lot yeah awesome yeah and crossfit crossfit we forgot to talk about that so uh gonna get back gonna get back into shape doing that stuff i think we uh we actually have a mutual friend um so we're talking about on the show so uh yeah she's gonna be
Starting point is 00:57:25 doing some programming for me that's uh colleen fox if she's listening uh shout out so yeah colleen's uh amazing athlete uh um far superior i can't comment on i don't know what your athletic ability is joe but i'm gonna just warrant to guess that uh colleen's is uh above yours and certainly above mine oh it is it is yeah even though she's quote retired uh from crossfit she still will like completely uh mutilate anybody she competes with so yeah just uh but it's cool i think hanging out with people like that because it's i like to unplug from uh the data uh stuff as well it's it's fun but it's nice to you know hang out and do other stuff but she's a data person too so it's kind of funny so uh anyway but yeah yeah do you crossfit much uh yeah uh
Starting point is 00:58:05 five times a week jesus okay it's a lot yeah that's how i uh reset my brain um you know a little bit tired if you do something physically hard there's no way i can be you know sort of thinking too deeply about uh you know work and other things which i spend most of my time kind of thinking about so oh man yeah just go to fran every day or something yeah exactly there you go it's gross well thanks uh joe for so much for for coming um you know for those listening i highly recommend the fundamentals of data engineering a fantastic book and hopefully you know once you some of these other projects land um you know if you want to come back and chat about them i'm happy happy to have you back down there.
Starting point is 00:58:46 Yeah, I'd love to. Love to. Or we can do it in person. We can do a CrossFit workout and do a podcast. Yeah, there we go. All our breasts and sweaty. Probably before because we'll be really winded after. Yeah.
Starting point is 00:58:56 Awesome. Thank you so much. Cheers. Yeah, thanks, dude. All right, take care.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.