The Data Stack Show - 29: The Present and Future of Data Engineering with Joe Reis and Matthew Housley from Ternary Data

Starting point is 00:00:00 Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I'm Eric Dodds. And I'm Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it. We have guests on the show today who I think have a pretty broad view of the data space. It is Joe and Matthew from Ternary Data, and they do services and training for data engineering and all sorts of different data work and really interesting guys. My question, I'll be interested to see what common things that they see among the companies that they work with.

Starting point is 00:00:54 I know that's pretty simple, but just having done consulting myself before you sort of noticed patterns, you know, getting to look at the way that lots of different companies are trying to solve the same problem. So I want to see what types of things that they see in their work that are common across companies trying to do data engineering stuff how about you costas actually i think this time i will mainly want to hear the same things as you do what i would add to this is by the way i think it's the first time that we have people that are coming from a consulting company. So I think we should exploit the fact that they have this kind of exposure to many different companies. And they have seen many different ways of implementing data engineering and analytics and machine learning. And also, I'm very interested to see from their point of view the evolution of this space and this industry.

Starting point is 00:01:44 Because we are still at the beginning evolution of this space and this industry, because we are still at the beginning of defining this data related industry, but it's not like we didn't work with data in the past. And I think they will have a very unique perspective on that, on how things have progressed in the past 10 to 20 years. And that's super interesting for me to hear about. Totally agree. All right. Well, let's go talk with Joe and Matthew from Ternary Data. Let's do it. All right, we have Joe Reese and Matthew Housley

Starting point is 00:02:13 from Ternary Data on the show. Gentlemen, thank you so much for joining us. Hey, what's up, guys? Thanks for having us. All right. Well, so many interesting things to talk about. And I can go ahead and tell you that, you know, having done consulting in the data space myself and knowing all the different types of things you see, we're just going to have so many great things for our audience to hear. But why don't you tell us a little bit about yourselves, just a quick background, and then tell us about Ternary Data and what you do. Yeah, I'll go first. So this is Joe.

Starting point is 00:02:47 And my background's always been in data of some sort. I've been in the data space since the early 2000s. And, you know, I guess the work I was doing back then would now be called data science. And, you know, I always had a fascination with machine learning. And, you know, got into had a fascination with machine learning and, you know, got into that probably around like 2009, 2010, I would think. I started, you know, delving into that, especially, you know, the availability of cloud and, you know, those sorts of resources. I started at an

Starting point is 00:03:17 AutoML company in 2012 and then quickly realized that a lot of the problems facing machine learning had nothing to do with algorithms. Even early on then, I realized a lot of the problems facing machine learning had nothing to do with algorithms. Even early on then, I realized a lot of the problems to make machine learning successful in production had to do with proper data architectures and data engineering. And so, you know, over the years, I've been on a crusade to help companies get more value from their data by helping them build solid data foundations and data architectures. I'm Matthew Housley. My long-term background is actually in academia. So I have a PhD in math, more the pure side of math. I suspect that actually resonates with a lot of people. I find a lot of data people come out of the math world one way or another, including Joe. You have a bachelor's degree in math, I believe, Joe. I do. Yep. Yeah. So eventually I had a friend who was more on the statistical side of math. We had

Starting point is 00:04:10 worked on some papers together and he recruited me to work as a data scientist. So really appreciate him doing that, taking a chance on me. And I started the job and as a junior data scientist, learned about a lot of the core tools, really enhanced my Python skills, was doing things with Pandas, working with data on a laptop. And at some point, I realized that the laptop-based workflow was extremely limiting. And so it just didn't scale. And then the other data tools we had available in our organization were extremely slow and also, in a sense, didn't scale. They were just too overloaded. And then my focus gradually began to shift toward data engineering. So how can we build large, efficient data systems? How can we

Starting point is 00:04:56 basically provide systems that will be a force multiplier for data scientists so they can get off of their laptops and these kind of very circumscribed tools into tools that can handle terabytes and even potentially petabytes of data. This kind of intersected with the company I was working at was looking at a cloud migration. And I ran some early projects on GCP and then worked on a project on AWS and realized that there was this huge skills deficit around cloud data technologies. So you could do amazing things even with EMR on AWS, but you had to have the right skills to make that possible. And so around this time, I think it was 2017, Joe and I met.

Starting point is 00:05:38 And so we started mapping out the possibility of a company that would focus mostly on data engineering. The end goal would kind of be the same. In other words, our goal is to enable machine learning and data science, but from more of a foundational level with a very heavy cloud focus. Very cool. And it's actually interesting your comment about a background in mathematics that has proven true at least across our guests on the show we've had many guests who have a background in mathematics so that's at least a little bit of anecdotal data to to reinforce your yeah actually in mathematics and physics like

Starting point is 00:06:17 like these two uh types of people like they are thriving in data science, I think. Hey, Matt, didn't you do your master's in physics? I actually did. I started out in physics before. So there you go. And again, anecdotally, my observation is that you have your like applied math statistical people, and they tend to go really deep into data science and the machine learning. And again, anecdotally, I've just seen a number of my friends do this, but people who are more on the pure math side, so like proving theorems and working in areas like algebraic geometry, number theory, representation theory, tend to drift into engineering. Just because engineering, I don't know, I think it's the problem solving aspect. Maybe there's a lot in common between debugging a distributed system and trying to prove some theorem about linear algebra.

Starting point is 00:07:02 I don't really have a good theory. It's all anecdotal. Well, I mean, anecdotally, I think that's correct. I mean, that was kind of my path. I mean, I was an applied mathematician and went into analytics and more kind of real world situations. So yeah, that's interesting. Tons of data stuff, but because I'm going to indulge myself, which I do occasionally on the show by Matthew, just leveraging some of your expertise to help me answer my own children's questions. So last week, my four-year-old who,

Starting point is 00:07:32 you know, is he's counting and, you know, trying to count higher and higher and et cetera. And we're driving in the car the other day and he just said, Hey dad, do numbers go on forever? So I want the quick PhD in pure math answer to how do I respond to my four-year-old son? That is a good question. I might be able to find some good YouTube videos for you actually. That'd be great. So numbers do go on forever.

Starting point is 00:07:59 And Joe is probably familiar with this too. There are different types of infinity that you have to worry about as well. And so the type that we deal with as children is the countable type of infinity, where I can get to any number eventually. In other words, pick a number out of the bag, I count long enough and I'll get there. But there are other types of infinity that are not countable. So for example, the real numbers don't have that property. They're a larger type of cardinality. And yeah, I'm not doing a good job of explaining this,

Starting point is 00:08:28 but I'll try to point you to some resources that might be helpful. I am super appreciative. And I just learned something really cool that will surely take me down a Google rabbit hole later this evening. Give yourself a few hours on that one. It's a fun one though.

Starting point is 00:08:42 Multiple infinities, light evening reading with your four-year-old. Cool. Well, let's, I know Kostas has a ton of questions. I'll kick us off though. So I mentioned this in the intro. So you run a services business where you do all sorts of things, helping companies make better use of their data, get their data cleaned up, you know, just the variety of things that go with data engineering, Matthew, as you said, to help data become a force multiplier, you have a wide purview of different types of companies doing different things with data. And so I'm really interested in some of the common threads you see, you know, a lot of the

Starting point is 00:09:20 times we'll talk about really specific deep subjects with someone working on, you know, say data science inside of a particular company and dealing with a particular type of data. But I'm really interested in your sort of broad range of view working with a lot of different types of companies. So what are you seeing out there? I think it's a good question. Okay, so let me caveat this with some anecdotes where I, you know, we obviously work with a lot of companies. And I also talk to a lot of data professionals around the world on a weekly basis. And the things that I'm finding or the threads that we're seeing are actually pretty common everywhere. And so what are those threads? Mainly machine learning is really damn hard for the reasons that kind of alluded to earlier in our intro where it's easy to get to maybe the 70% of machine learning where you spun up a Jupyter notebook on your laptop and you can check that box, that might be a success. But then when it comes to, I think, rolling out machine learning, and I would also add analytics into this to the

Starting point is 00:10:28 broader organization, this is where a lot of the challenges start. So part of it is a technology issue. I would say a large part of it's also a organization issue. So if you don't have the company on board with these digital or data transformation projects, it's going to be incredibly difficult to make progress. And the types of organizations where we see this, they typically tend to be, well, if it's not data first, i.e. if you haven't incorporated data into your processes or a product from the get-go. Retrofitting data science, more modern data techniques is actually, I would say it ranges from fairly difficult to like not going to happen.

Starting point is 00:11:15 So what do you have to say on that, Matt? Am I off base there? No, I completely agree with that part. Yeah. And I think another big theme from my perspective, and Joe, you should weigh in on this as well, but a lot of our clients that are still dealing with on-prem systems are hitting a wall there for various reasons. So one possible type of wall is that they're using an older legacy

Starting point is 00:11:37 database that's a high quality database, but just they run into scalability limits at some point. So no matter how big your Oracle license is, at some point, you're probably going to hit a wall in your Oracle license. No matter how many servers you have in some kind of a cluster, you're probably going to hit a wall on that at some point. You might have a legacy Teradata system, you re-up to a bigger system and you still hit a wall on that. I think also we've seen that Hadoop has become a disappointment for many organizations on-prem because again, you run into those same fundamental issues.

Starting point is 00:12:09 Plus you need really heavy duty, expensive engineering resources to run that cluster. So Hadoop turned out to be fantastic if you were the scale of like a Yahoo or a Facebook and you could just build a massive cluster and you could have these highly, highly proficient engineers and you could scale to 5,000 nodes and serve the data needs of the whole company.

Starting point is 00:12:29 But if you're a lot smaller and you're not specialized in tech, then that is going to become a real problem for you at some point. And I think that is the big driving force behind cloud migrations, moving away from some of those limits and having limitless scaling as a

Starting point is 00:12:46 possibility in the future. I think the other big thing for me is that the data issues that organizations run into beyond their hardware and technology that hinder data science come down to really basic foundational problems like data quality, really complex ETL, it takes a long, long time to deploy fixes to data quality issues. And the interesting thing is your data scientists are smart. They'll often find these data quality issues very quickly as they're working on a new project, but then it might take six months to a year to actually get those fixes deployed with the somewhat non-agile legacy approach to data pipelines that many organizations have. Another big theme is just getting data in and out. So tools like

Starting point is 00:13:31 Fivetran, for example, that connect from A to B are just blowing up right now because that turns out to be a very hard problem, even though it looks simple. Yeah. And add RedRubber, RedRubber, that mix too. The other thing I would add too, is, you know, what we are noticing, and this is actually influencing our business model to a pretty big degree now is there's a big skills gap, right? There's a, there's a skills gap internally with companies with respect to data engineering and best practices with the cloud. And there's also a talent gap. So if you want to hire data engineers, you know, it's unless you have, you know, the cachet of one of the bigger tech companies or you're doing something really innovative, it's really hard to find data engineers. And so we also find is, you know, there's a trend towards, you know, easy to implement solutions and to couple with that training.

Starting point is 00:14:20 Right. So actually our business model, we've actually got out of the sort of button chair hours, typical services engagements, because what we realized is with a lot of our clients, what they need is actually, if they have a data team, the data team needs skills. They don't need somebody to come and implement it for them. They actually need somebody to help them implement and coach them with specific technologies and best practices and paradigms, because that in the end makes the implementation a lot stickier with best practices and so forth. So that's what we found, you know, sort of our secret sauce is, you know, as much as we can, we actually don't touch a keyboard, we teach other people how to, you know, level up

Starting point is 00:15:01 their skills and become, you know, awesome data engineers. And so we found that that's a big differentiator with what Ternary does versus other companies. And a lot of our partners like this approach because it makes, you know, obviously their products a lot stickier in a client. Sure. Yeah. I mean, that, that makes total sense because it's easy to think about data engineering as something that kind of has like a defined start and end point, you know, along the lines of implementing a new technology, right? We're going to migrate to a new warehouse, right? And so you get the new warehouse and you do the migration and then great, you're running on the new warehouse. But in reality, data engineering

Starting point is 00:15:41 is something, I mean, we see this, you know, just talking with people on the show, it's something that's a constant pursuit. And so I think describing that as the need is more around skills, I think is a really good point. And we see that all the time, because it's not something you're ever really done with, right? I mean, you can complete projects and you can build infrastructure, but as organizations grow and change, the needs around data grow and change, right? And the types of data and formats and, you know, delivery and all those sorts of things are dynamic as an organization grows and changes. So that's really interesting to hear. For sure. And, you know, having the skills is paramount too, because when you're evaluating a new tech stack, right, the number of data tools keeps growing every year. So, you know, in fact, Matt and I were looking at these charts of data tools in 2012 versus today. And, you know, you could count the number of data tools in 2012.

Starting point is 00:16:42 I don't know that you could actually feasibly count the number of data tools in 2012, I don't know that you could actually feasibly count the number of data tools today. And I think it also goes to, you know, just the ability of a person to keep up on best practices and modern tooling on, you know, the best tools out there like that has just become exponentially more complicated. And so again, like even to evaluate a new tech stack means you constantly have to keep staying on top of stuff because with the number of tools coming out, there's going to be new approaches, you know, to data and out. Yeah. Like I think to your point, nothing's, nothing's ever static in this industry. If you're static, you know, you might do that to your own detriment. So.

Starting point is 00:17:16 Sure. Sure. Absolutely. We were actually just talking to someone earlier today and they made a really interesting point. They said that there's a big gap, you know, a lot of the, you know, even data tools will sort of paint a picture with their marketing messaging that doesn't necessarily reflect how much work it actually takes to get to the end destination, right? It's like install this and then all of a sudden you'll have XYZ result. And in reality, it's even once, even if you get the tooling, like making it all work is just a gigantic effort. Well, one more question, then I want to hand it over to Kostas. So one more question on mine, because I've been monopolizing here. So we talked about the common threads. I'm interested to hear about any differences you

Starting point is 00:18:01 see across companies. And maybe, maybe you don't, but I think about things like, are there sort of challenges that are different maybe among different types of business models or particular challenges you see at companies of a certain type of scale, you know, maybe startup versus enterprise. Are there any sort of differences you see across the work you do with, because you work with such a wide variety of companies? It's funny. A lot of the differences we see are technology differences that are reflections of underlying cultural issues. So I think at this point, we do see a lot of companies that have struggled with data, but at least understand that data is really valuable.

Starting point is 00:18:45 And so are willing to make the investment. They maybe need a little bit of direction about where the wind is blowing and where to try to make those investments. Whereas others see data as just this really expensive hole that they dump money into and are very stingy about what they put into it in terms of money, obviously, but the technology and the people as well. And they kind of wonder, okay, why is my data terrible? They just, culturally, there's a disconnect between their core business maybe and the fact that data can help them. They just don't quite understand at the top level sometimes.

Starting point is 00:19:18 Yeah, it's more reflective of this. Are you guys familiar with Conway's law? Yes. You are, okay, cool. Yeah, so, I mean, that comes into play. Itway's law? So you are. Okay, cool. Yeah. So I mean, it may be good to just do a brief overview, just a brief run through. Yeah. So for the listeners out there who don't know what Conway's law is, so Conway's law basically says that an organization will develop modes of communication based upon how the organization communicates, right? So

Starting point is 00:19:43 they'll develop systems around how the organization communicates. So if you have a very siloed organization, your systems will represent silos, right? And if you have a very open communication format, then you'll develop systems that work accordingly. And so what we find with respect to data, especially, is data is different than technology with a few areas, right? So, you know, application development, for example, that tends to be focused on particular use cases. But, you know, quite a few departments in a company use data, right? Whether it's reports that come out of an ERP system

Starting point is 00:20:19 or, you know, any number of things. And as well, when technology fails, you tend to notice this because you're, you know, maybe your application stops working, right? And that's, it's pretty obvious where you're like, if you're maybe developing an application, the tests break, right? And so you have a pretty good understanding of that.

Starting point is 00:20:40 Data is a much different story where data issues may persist for months or years and nobody knows the difference right and so that's a big issue and i would say that when you start hearing things like oh well as long as the data is directionally correct that's a pretty big red flag that you know data needs to should be addressed not to say it will be but it should be and so with that, it tends to be the companies that I think are investing heavily into technology if they need to transform digital transformation, because inevitably data transformation and data value happens from

Starting point is 00:21:18 those sorts of endeavors. What we tend to find is when, you know, companies are not trying to transform at all, that tends to be where data kind of goes to die. So interesting. Yeah. It's the directional is in the data world, kind of like the term like seasonal and marketing, you know, it's like, well, this is weird or this doesn't look right. And it's like, well, it's seasonality. You know, it's just seasonal, right? And, you know, it's like kind of a catch-all for like why things aren't right. And like, it's funny, I'm thinking about the word directional. Like you said, oh, I mean, the data is directionally correct, right? And you're like, that usually means

Starting point is 00:21:59 that there's bigger problems under the hood. All right, Costas, I've been monopolizing. You did well, Eric. Thank you. Thank you. It was a very interesting conversation so far. So guys, a quick question. I noticed that you have your journeys, like you started from an academic background, like science, mathematics, physics, then you went to data science and from there to data engineering. Can you take us through this journey and what you saw out there as data scientists that made you realize that you want to focus more on data engineering and give us some examples of that?

Starting point is 00:22:40 Yeah, I could start out. So I think I was raised like a lot of data scientists on tools that run on a laptop, very heavy focus on Python, on R, and developed some Panda skills, some R skills in terms of being able to analyze data frames and run a Teradata query. That would take a while to run because the system was quite overloaded. I would download the data. I would load it into Pandas. And then I would discover that I needed really a different sample of data. So I would go back and run another Teradata query. And then I would try to transfer some of my workflow directly into SQL just to run on Teradata. So I didn't have to jump through quite so many hoops to get from A

Starting point is 00:23:25 to B. And that was super slow as my queries began to scale up. And we also had a Hadoop system that had more event-oriented data. I would try to run a query on there and it would take like three to five minutes at best. And so the turn time, the workflow was just extremely slow, like the iteration time to try to run Hadoop queries. And then if I needed to do something beyond a SQL query in Hive, then I further had to download that data to my laptop. Oh, it's too large. Okay, go sample it. Pull it into Pandas again and try to do some analytics that way. And so I think a lot of my experience comes back to this cliche that data scientists spend something like 75% of their time just trying to pull the

Starting point is 00:24:12 data, trying to acquire it, trying to do some basic filtering in order to attempt to do data science, like the first steps. And at some point, I realized that you had these tools in the cloud, like Elastic MapReduce, like Redshift and Snowflake, and at some point, BigQuery. And you could dynamically scale up to a huge number of nodes. Yes, it would cost some money, but you're only paying when these tools were actually turned on. And so I think that's what really turned me on to the idea of doing data engineering in the cloud. Suddenly we just had the scalability and resources of a much larger company at our disposal. And at some point I started using Databricks as well. And now you can kind of transpose those laptop oriented workflows into a data frame environment that was much, much more scalable and much faster.

Starting point is 00:25:06 And so given that so much of my time was spent on just trying to address these core issues, I started to have this realization that deploying these cloud tools could speed up that workflow dramatically. And then at some point I started to become kind of the point person for the teams I was on to deal with these kinds of issues, deploying Databricks and training people how to use it and enhancing people's SQL skills as well. Yeah. And I think, you know, on my end, you know, when I was getting into the ML space, there wasn't, I mean, there was a handful of libraries that, you know, made ML simple, but there was nothing in the way of proper tooling, right? I mean, this DevOps was still sort of being figured out in real time by a lot of companies. And so we, you know, in my case, there's, you know, I think it was more just

Starting point is 00:25:57 having to figure things out from the ground up, because there wasn't a playbook on how to do whatever you call data engineering now and to some extent ml engineering as well right so a lot of the stuff was you're just kind of having to make it up as you go along and so with that in mind i think it had always been a mission to i guess to make things better really or at least yeah you know trying to make the world simpler just because i felt it was horribly complicated and then i you know it was interesting because around that time of like, you know, deep learning becoming the hot new thing, this must've been like 2014, 2015. And then, you know, I, a lot of my data science friends, you know, and acquaintances, they were all asking, so why, why don't you want to get,

Starting point is 00:26:38 you know, why are you calling yourself a recovering data scientist right now? Like, surely you must be crazy. i was like well it'll make sense when you're older so because i mean i'd seen a lot i'd gotten a sneak peek into a lot of the problems you know and so it just made a lot of sense why you know as matt indicated you know developing a jupiter notebooks was i mean it's great you know knock yourself out but you know to make the stuff work in production there's just a lot more work that needs to be done. So it felt like, at least when I was getting into data engineering, kind of around late, early to mid 2010s,

Starting point is 00:27:17 it wasn't really a field then either, right? I mean, I think my titles at the time were like software engineer because it wasn't even a title for data engineer, but even though we were doing data engineering things. so you know i think yeah those experiences afforded it yeah yeah so guys i mean we're talking a lot about data engineers but what makes an engineer or a software developer a data engineer and the question has actually two parts one One, in terms of the skill set, what kind of skills someone has. And also, what is the role inside the organization? What does the data engineering do? I think the role of a data engineer is to help take the raw ingredients of data and ingest those, process them, and then make them useful for analytics and machine learning.

Starting point is 00:28:06 So I think if I were to see what the role is in a nutshell, I would think that that's it. What do you think, Matt? Yeah, I'll comment as well. I agree with Joe on that part. And I think in the last five years, there's been a huge shift in expectations of what a data engineer's role should be. Five years ago, 2016, you would see a lot of articles talking about how if you wanted to make a lot of money, you should go learn Hadoop, like low-level Hadoop, learn how to manage a cluster, learn about installing the software, learn about creating data pipelines, raw map-reduced jobs, maybe jump into Spark. That was the way to be a very competent data engineer.

Starting point is 00:28:46 I think in 2021, the emphasis has shifted much more towards stitching together a lot of pre-built pieces. So if you are using Spark, it might be something like Databricks or EMR or Dataproc on Google Cloud. And yes, you'll need to do maybe some low-level tweaking, but you're not going to spend time creating a cluster and managing hardware and these kinds of details that used to be a big part of your job. You'll also probably use a lot of completely off-the-shelf tools that

Starting point is 00:29:15 are turnkey. You might use Snowflake or BigQuery, and you might orchestrate those tools using something like Apache Airflow to make them work inside of a larger pipeline, get data into cloud storage, pull it back out, do interesting things with it. That to me kind of distills the skill set and the role. But you asked as well about the organizational role, I think, Kostas. Yeah, that makes sense. And I think from my point of view, like the way that I see the role, I think there is an interesting combination of tasks that in classic software engineering, you have, you know, like you have the SRE, you have DevOps, and then you have the software engineer, right?

Starting point is 00:29:55 I think that a data engineer, okay, it might depend also on the size of the company and like the scale of the problems that they are solving. They pretty much have to do a little bit of everything. As you said, one thing is like stitching things together, maintaining and making sure that this infrastructure is working, rewiring the infrastructure because it requires a lot of changes, not just like maintaining something, and writing code. I mean, I don't think that you can, even like with tools

Starting point is 00:30:23 that they are not supposed to need, let's say, a lot of coding using something like Fivetran. I mean, still someone needs at some point to create a DBT model, right? And interact with the data. It might be SQL, it might be Python, or it might be all of these things together. So for me, it's like a very interesting role. And it's a very interesting evolution of the role, to be honest, because I don't know, in my mind, at least, I don't know if you agree. I mean, we started with a DB admin back in the late 90s, beginning of zero. And we ended up today talking about data engineers.

Starting point is 00:30:59 I think the data engineer, at the end of the day, your job is to really, as tools become simple, it's still, I think you need to have a really good idea of the data life cycle, right? So ingestion, storage, processing, transformations, et cetera, I think to your point. So that doesn't go away. I can actually see a day though, this might be a bit heretical. I think the term data engineer may morph into something different. I mean, you're seeing new buzzwords like analytics engineer. I'm not saying buzzwords. These are practical titles, right? So ML engineer and so forth. And so it's going to be a lot more fighting brain, just like data scientist, right?

Starting point is 00:31:36 That was kind of a catch-all term where you had to be kind of good at everything. You're kind of the crossfitter of data. You're not really good at anything in particular, but you're, you know, amazing at everything. So, you know, I see data engineering sort of morphing into that because it is true. I mean, you point out the word DBA. I mean, I see data engineering job postings that are basically a DBA job, right? Or an ETL developer.

Starting point is 00:31:57 So, yeah. By the way, now that you said that, like, how did you see the role of data engineering changing inside the organization based on the size of the organization? Do you see data engineers working in like in startups compared to big established companies? Do you see a difference there? Is there all like evolves or changes from depending on like the environment where you work at? Yeah, definitely.

Starting point is 00:32:19 Like startups definitely tend to be, you know, more of the quote unquote full stack data person, right? Like startups definitely tend to be, you know, more of the quote unquote full stack data person. Right. So, I mean, I don't know if it ever disappears entirely just because you're resource constrained. And so whoever you hire is going to have to basically figure a lot of stuff out. But yeah, as you get into, you know, I think more established companies, the role of a data engineer is, it's a lot more defined. And I think that the nuance is it's more defined for that particular company.

Starting point is 00:32:42 Because again, a data engineer or data scientist, depending if you go to any of the FANG companies, it's all different. And let alone like all the millions of other companies across the country. Right. So. Yeah. How big usually the teams of data engineers are based on your experience? That's a good question. I think we've seen a lot of data core data engineering teams of 10 to 20.

Starting point is 00:33:03 I don't know if we can expect that to fragment in the future, but I'm thinking of like a couple of billion dollar a year revenue companies that had teams in that kind of size range. And they were just responsible for building out and maintaining a lot of pipelines and interfacing with parts of the company across the organization to provide resources to them, basically. What are your thoughts, Joe? Yeah, I think that's about right. And again, it just depends on the size of the company across the organization to provide resources to them basically what are your thoughts joe yeah i think that's about right and again it just depends on the size of the company right and i guess although other roles are split out i mean sometimes you'll see a software engineer doing a lot of data engineering work or you know a data scientist doing the data

Starting point is 00:33:38 engineering work and so that's and that just usually means like the titles have yet to settle so yeah but again it's it's there's kind of i don't know it's a weird thing where there's like that just usually means like the titles have yet to settle. So, yeah. But again, it's, it's, there's kind of, I don't know. It's a weird thing where there's like cargo culting of like job titles. Right. So, you know what I'm saying? So it's like you just pick like a job description from some other company. Like, Oh, that looks good. We'll just take that one. They might read them sometimes they might not. I don't know. So. Makes sense. All right. So, okay. I don't know. Makes sense.

Starting point is 00:34:05 All right. So, okay, I think we covered a lot around people and organizations. I think it's time to talk a little bit more about the technology. I think you mentioned earlier that how much the technology landscape has changed from 2012, where everything was around Hadoop and Spark was just starting. Until today, I mean, I think there's like a huge exponential growth in terms of the tools that are available out there. And I think you are the best people to talk about this evolution. Can you give us a little bit of your experience with how these tools have evolved

Starting point is 00:34:39 and some tools that you find as really, really important for the job of the data engineer? I'll go, Matt. So I think the one thing I noticed is a lot of the things that were like popular back then, so you're talking like Hadoops, Sparks and whatnot. It's interesting because the evolution is with a lot of those tools is, you know, depending on the type of company you're at and depending on your skill set, but the data warehouses come back into vogue, right?

Starting point is 00:35:04 So a lot of things that you could do in Spark i mean you can also do that in sql using snowflake or bigquery or redshift and so what i think what you know i recall conversations back in the day like oh sql is dead data warehouses are dead like you may as well just learn you know python and scala and call it a day and i still think there's a time and place for that discussion, but increasingly they, you know, the new generation of cloud data warehouses is extremely competitive with these, you know, these older quote, big data tools. Additionally, when you're talking about the streaming end of things,

Starting point is 00:35:37 I mean, that's, I think still a work in progress, but, you know, I would say streaming and data warehouses are two things that I've seen that, you know, are kind of taking a lot of attention from people. I would say data warehouses are a lot more understood than the streaming part, which we can get into in a bit. But I don't know. What are your thoughts, Matt? Yeah, I think it's funny. A lot of these tools that started out being targeted at developers. So for example, Hadoop, back in the day when Hadoop started, if you wanted to write a data processing job, you were going to be writing

Starting point is 00:36:11 Java code and writing MapReduce steps. Well, what happened? Eventually Hive came out and now you could do all that in SQL. And it turned out a lot of the mindshare started shifting toward Hive because even really sophisticated data engineers didn't want to be spending all their time writing map-produced jobs. We've kind of also seen the evolution of a lot of more traditional big data tools into the data warehousing space. Maybe I shouldn't say traditional. These aren't that old, but it seems like nothing we use these days is particularly old. But for example, I was using Databricks a couple of years ago. And at the time, it was very clear that Databricks was shifting toward being more of almost like a data warehousing hybrid with data lake model. Initially, the idea was I can take this raw unstructured data and do just about anything

Starting point is 00:36:56 with it. But over time, they shifted their focus towards schema management, toward Delta Lake, toward management of table changes and things that data warehousing does more traditionally. Another tool that's shifted in that direction is Imply. They started out being very, very just focused on real time. And now they also advertising themselves as being able to serve this data warehousing need. And so it does seem like data warehousing and SQL both have made a huge comeback. The other really big technology shift is just in terms of how you purchase and deploy these technologies. I think back in 2015, go back further to 2012, the cloud was perceived as a toy for companies that weren't big enough to have their own data centers. And I think in 2021, there's this realization that it makes more sense to deploy your capital

Starting point is 00:37:50 to the cloud and let someone else take care of a managed service for you, be that Google BigQuery or Databricks, managed open source or managed proprietary, and let your data engineers focus at a higher level and let someone else take care of a lot of the behind the scenes details and tuning. Yeah, that's a good point. Yeah, I would say the last five years especially has seen like sort of the rise of trying to eliminate as much undifferentiated heavy lifting as possible in the data stack. Whereas before it was almost like, how complicated can I make my stack? Then some companies wised up and found that maybe taking a more simplistic approach had value. And sure enough, those companies are now, you know, in a lot of cases, unicorns or soon to be.

Starting point is 00:38:35 And so that's kind of cool. Yeah. Yeah. I think that's also a big part of the success of Snowflake, to be honest. I mean, they managed to start as a data warehouse. Actually, it's very interesting because if you see the evolution of how they position the product and the company, I mean, it looks like, if you see their diagrams, it looks like the data warehouse was their MVP in a way, which is very interesting because, I mean, it's a pretty complex thing to build, right?

Starting point is 00:39:02 But today, if you go to their main website, they don't even call it a data warehouse. They call it the data cloud. And of course, their bet was on cloud. And I think that was cloud and self-serve, right? Because back then when they started, if you think about Redshift, Redshift was still, I mean, it was on the cloud,

Starting point is 00:39:22 but it was still a bit of a pain to manage. It wasn't that you still had like many knobs there that you had like to play with in order to optimize it. Or then at some point you had to rescale your cluster and that was a major pain and you had to go through downtime. So I think Snowflake really reflects like the evolution in this space. And I think it's very interesting. So what kind of stack you are excited about? I mean, if you have to build an infrastructure today, what are the tools that

Starting point is 00:39:52 you would choose? And also what tools you really enjoy working with? I mean, we both do a lot of work in Snowflake and BigQuery from a data warehousing angle. I would say that those are two we're excited about just because I think they're both pleasant to work in. The things I would say, I don't know, before I keep blabbing, Matt, what are you excited about? Oh, no, I completely agree. It's funny. I think a lot of data engineers still perceive data warehousing as very unsexy. It's like, well, it's just a data warehouse that runs on SQL.

Starting point is 00:40:21 But I think the exciting things about Snowflake and BigQuery are that you can just drop a couple petabytes of data in there if you want to. And you can be running these extraordinarily huge queries in short order. And so that means that if I am a large company and I have petabytes of data on-prem, the hard part is just shipping that data to the cloud. But once it's up in the cloud, I can do these amazing things with that data. And then I can start hooking in other tools as it makes sense to do. So if I really need the power of Spark, I can plug that into Snowflake or BigQuery very quickly. So yeah, I find those are a pleasure to work with.

Starting point is 00:40:57 And I find them both exciting because of the degree to which they can scale so easily. Yeah, I would say the things I'm excited about are actually the MLOps tooling space. That's fascinating to watch unfold in real time right now. I have no idea where it goes, honestly, but I don't think anyone in the industry knows either. But that to me is fascinating because the practices of data engineering, I think have pushed the maturity

Starting point is 00:41:26 of analytics for a lot of organizations. And simultaneously, there's been this undercurrent of people working in the ML tooling space. And so I'm very excited about that. I would almost say more so than even the stuff happening in data engineering. Stuff like in BigQuery are great. I consider those to be sort of the cool stuff in the present. The things I'm personally interested in are, you know, continuous learning, real-time systems and how that impacts business. Matt and I were just having a chat about that the other day, actually, just like the cool stuff that, you know, could possibly happen when you have

Starting point is 00:41:59 genuinely real-time systems that are, you know, taking automated actions and just what that means, you know, taking automated actions and just what that means, you know, for businesses and how it impacts people. So, yeah. Yeah. That's very interesting. Actually, we had like our previous episode was with someone from Tekton. Oh, well.

Starting point is 00:42:16 And yeah, yeah, yeah. And we were discussing about feature stores. And I mean, it was very useful for me because finally I figured out what a feature store is, or at least what we believe today that a feature store is. But it was amazing that you have like two or three years now,

Starting point is 00:42:34 so many talks out there about feature stores. And yeah, I mean, like the industry is still trying to figure out what this thing is. We all feel that we need it, but okay, how do you define it? How do you describe it to someone? Yeah, well, it's interesting because like you know in january you know josh tobin who's you know he teaches a full stack deep learning this course at berkeley but he he did a talk at one

Starting point is 00:42:55 of my meetups he talked about the evaluation store which is this brand new concept that nobody had really heard of until he unveiled it there and you know then who knows what kind of stores people come up with next i don't know or other other technologies and and just just maybe entirely new ways of thinking about things right because what i see right now in the ml space is like you know people are taking the best practices they've seen from devops and data engineering and data ops whatever that is and trying to make sense of the landscape but i I'm almost certain I'd be willing to bet my head somebody comes along with a completely different approach to things. Because the DevOps space, for example,

Starting point is 00:43:31 people are still trying to make progress with that as well. It's not like the DevOps space is said and done. I just think it's just 10 years ahead of where data is, basically. Yeah, absolutely. Yeah, I think we're just in the beginning of shaping this space. So it's going to be a couple of very exciting years, I think. So guys, one last question from me, and then I'll hand the microphone to Eric. So you mentioned at the beginning of the discussion that we had that ML is hard. And at the same time, I think we talk a lot about ML. But most cases,

Starting point is 00:44:03 if you ask the people that are excited about it, what are some specific use cases of ML, it's not that easy to come up with them. Can you share with us the most common use cases that you have seen of using data in the ML, let's say, context, but in general, outside the typical BI, which is extremely well-defined, we all know what BI is about and how it is used. How is ML today implemented? What are the most common use cases that you have seen out there? I would say there's certain tertiary problems with the business, right? So when you look at, I always sort of evaluate this from, again, this is from a business that isn't including

Starting point is 00:44:40 ML in its product, where you're doing maybe image recognition for an app or something like that, right? But if you're a business, the things that you really care about are likely revenue related. So if you have enough history forecasting, that's a really big thing, especially if you're, if you have a supply chain, you're going to need a forecast period. You know, you operate without a forecast in a supply chain at your own risk. So there's that. And then there's also customer retention and churn and those sorts of things. So those tend to be like the most immediate things that pop up where it's, you know, if I have customers and I have some data, which customers are going to churn? And then, you know, how can I take an action to maybe, you know, prevent that from happening?

Starting point is 00:45:22 That tends to be the first order things we notice. And then obviously if you're e-comm, recommendation engines are a really big one. And I don't know, do you have any other ideas, Matt? Yeah, yeah. Honestly, I think one of the most common applications I've seen, and this will probably resonate with you, Eric, is just very basic ad tech. And I don't mean like building some kind of advertising system. I just mean understanding who your customers are, who's likely to buy from you and feeding that data

Starting point is 00:45:48 into Facebook or Google ads. And you would be surprised how often that's not happening at all, where there are just hundreds of millions of dollars being burned without a lot of clarity on what's going on, or certainly not a lot of feedback to improve that loop and the efficiency of that spend. Now, one thing that's going to be interesting to watch now is this tightening of data privacy practices and how effective these advertising practices are going to be in the future. I suspect some companies may have just missed the opportunity and the window on some ad tech may be closing in the near future. I would also add to that. I think that any operation,

Starting point is 00:46:27 so ML is a really good fit when you have operations that are happening at such a high volume or at such a fast rate that it's really difficult for humans to keep up. Right. So any, anytime you have that, that's a classic use case for implementing ML. I would say some anti-patterns that we often see is using machine learning on BI data. And you might ask yourself, huh, that's interesting. I make models from BI data all the time. And I would ask you, okay, so in your model, using your BI data, how much are your features

Starting point is 00:47:04 correlated with your label? I'd be wagering to guess that in a lot of cases, they're pretty highly correlated because you can answer a lot of questions using just the data model that you already have, assuming that you've modeled it correctly. And so I noticed this because I was at an AutoML startup where we dealt a lot with ingesting BI data. And over and over, you know, when I started, I sat back and thought about the problem they're trying to solve. I was like, that's a SQL statement, actually.

Starting point is 00:47:32 So because it wasn't automated in such a fashion, right, where you would get this feedback loop with your ML, which, you know, and a model, which in turn would help improve processes. It was very much, you know, what I see often is people will make these models and they'll just be these static models. But when you look at it, when what I see often is people will make these models and they'll just be these static models. But when you look at it, when you step back and look at it, that's actually, they just made a report. Interesting. Yeah. Yeah. And going back really quickly on the advertising use case,

Starting point is 00:47:58 Matthew, I agree. It's, it is amazing how challenging it is to get, I mean, reporting, to your point, Joe, like the reporting around the sort of like full funnel attribution and marketing is involved so much more data engineering and pipeline work than you would guess, right? I mean, it's sort of like this horrible, gnarly, especially when you sort of are crossing different platforms, right? Going from, you know, web to mobile or, you know, marketing web to product web, and you're trying to tie all that stuff together. It's just, it's so, so hard to, I mean, it's not like the technology doesn't exist, but

Starting point is 00:48:48 I think to the point that you've made multiple times on the episode, the crossing the organization in order to do that and be, I mean, a lot of BI is the same way, right? It's just really, there's so much involved in getting all of it together and getting all of it right. We are close to time here, but I have one more question. Joe, you mentioned real time. So I would just love your perspective really quickly on the state of real time. It's one of those marketing type terms where it's used very liberally. and anyone who works in data engineering knows that, you know, we're still in early innings, right? Like it's pretty, it's certainly feasible. And a lot of

Starting point is 00:49:36 companies do interesting things with real time, but I think we're still in early innings. I'd just love to hear from your perspective, when you see companies trying to do real time, what do you see on the ground? What are, what are sort of the, some of the current technologies that they're using? And then what types of things do you see coming in the future that'll be sort of the game changers as far as real time goes? Real time is most effective when you're able to take automated actions against that real time data, right? So an example would be, you know, like IoT. That's a classic example. Data is flowing in and, you know, you're going to use that data to do something, right?

Starting point is 00:50:17 You can certainly store that data into a data warehouse or data lake for, you know, kind of after the fact analysis. I would say the state of like real-time analytics is a really interesting one. And Matt, like your thoughts on this too. We see a lot of companies wanting to do what they call real-time analytics, but, you know, if you take an extreme example, say data shows up every millisecond and it's updating a chart, I guess our question is what, what action are you going to take with that chart? Right. Right. Who's, who's just sitting there looking at the chart all day long.

Starting point is 00:50:53 Right. Yeah. So the, so the, the natural, like the actual next question as well. Okay. If you, if this is, this kind of goes back to the machine learning discussion where we're talking about high volume, you know, high velocity data where a human can't react in that, you know, that, that speed, right? That's where automation comes in. And so to me, I think that's where real time is heading is there's sort of a fascination, I think, like a gee whiz, like, oh, I can binge watch my business and watch everything in real time. I'm like, if you have to binge watch your business at that extent, you have a really shitty business actually. So you shouldn't have to pay that much attention to minutiae,

Starting point is 00:51:30 right? It's crazy. It's like, it's like watching your hair grow on your arms or something. So, you know, with that said, I think the future of it is you're just going to see it. Machine learning is going to be a lot more tightly coupled to real-time systems. I think whenever continuous learning is figured out and working at scale, I see that as the next natural evolution. What are your thoughts, Matt? I would say two things. So the way real-time is marketed right now tends to be pretty problematic. I think it's pitched as this universal solution, boil the ocean, replace all of your batch systems with real-time. And companies get into it and they discover that it's very expensive and it brings a whole host of

Starting point is 00:52:12 new problems. In many cases, these companies were already struggling with their batch systems and now the struggles just explode and doing basic things like joins suddenly becomes very hard. Having said this, I think it's a very promising domain. I think most companies of any size have some problem that would be enhanced where they could solve that problem better by using a real-time system. And so my recommendation generally when people are talking about real-time is like, okay, what problem are we interested in solving here? And this goes back to what Joe was saying. Like you want to couple it with some kind of automation. So let's find a place where real-time

Starting point is 00:52:48 can actually have an impact and deploy it there. And then we can look for the next use case. In terms of the technology and how that's changed, I think the huge difference now today from say five, six years ago, is that I have these off the shelf, really nice real-time solutions that manage all the layers for me.

Starting point is 00:53:10 Because in the past, deploying like a Lambda architecture would require a huge team of very expensive, insanely good engineers. And now a lot of these data warehouse products actually have off the shelf, real-time Lambda architecture built in that's just managed for you and taken care of. And so it's's data flow paper, I think it had a really good distinction between real-time and batch.

Starting point is 00:53:47 And what their distinction was, instead of thinking about it in terms of real-time or batch or streaming, think about things as bounded or unbounded by time. I mean, we only do batch right now because it's an artificial distinction that we have to do because of technical limitations, right? So you time-bound your data,

Starting point is 00:54:04 but in essence, all data is actually unbounded. And so the closer you can get to sort of this organic deal with data and events just sort of happening as they happen, like the rest of the world operate, like the natural world and universe is real time. It's all event driven. Humans are the only ones that seem to batch things up. And it's more just because it's convenient, not because it's actually how things work. Sure. You know, so, you know, where does, what does the future of batch look like in a real time world? I think that they're actually synonymous because batch is actually a subset

Starting point is 00:54:39 of real time. And when you, when you take away the time bounded constraint, you know, it's, it is the same thing. So it's a sort of transitive property of time-boundedness of data and events. Very, very elegant explanation there, Joe. That was wonderful. Thank you. Great. Well, we are at time here, and this is a great show. Really good conversation.

Starting point is 00:55:11 I learned a ton and we'd love to check back in with you at some point in the future and have you back on to hear about the new things you're learning as the space unfolds. So gentlemen, thank you so much for joining us. Thanks. Thanks, Eric. Thanks, Kostas. That's great talking to you guys. Thanks for hosting us. Well, that was a really interesting chat. I think beyond learning that there are multiple types of infinities, which is still bending my brain from a mathematics perspective, I thought that the way that Joe described real-time data and the distinction between batch data and real-time data as really just sort of a distinction that we use because it's an easy way for us to sort of digest the concept. But he said, in reality, all data is real-time. And as the technology catches up, we'll see that concept play out more and more in companies. And I just, I really appreciated that. I think he, in a really clear, concise and elegant way,

Starting point is 00:56:10 made that distinction for us. Yeah, absolutely. I mean, that was a very, I think he had a very elegant way of explaining and describing this fact that at the end, batching is just a convenient approximation to reality that we humans do because we're constrained by the technologies that we have but i think that this is going to change and i think it's changing already we see more and more let's say event driven and streaming based like approaches to problems that traditionally were tackled by batching mechanisms outside Outside of this, which of course, like I think it was amazing, like conclusion to our conversation with them,

Starting point is 00:56:48 I found extremely interesting this whole journey of starting like from science, going to data science, which it's a pattern that I think, Eric, we have seen a lot in this show. But as the next step for them, going to data engineering, because they figured going to data engineering because they were figure they

Starting point is 00:57:05 figured out that the data engineering is like the real problem that has to be solved before we figure out the most let's say sexy in a way and complicated cases of machine learning like at the end if you don't solve the problem of the quality of your data, for example, the availability of your data, your model is completely useless. And yeah, that was, I mean, I know that we are both like aware of this fact, but it was great to hear that from these two gentlemen, especially because of like the experience that they have and all the different companies that they have helped so far, like building their data infrastructure. Totally agree. Well, thank you again for joining us on the Data Stack Show. Be sure to subscribe on your favorite podcast network if you haven't already. And that way you'll get notified of new episodes every week.

Starting point is 00:57:55 And we will catch you next time.

The Data Stack Show - 29: The Present and Future of Data Engineering with Joe Reis and Matthew Housley from Ternary Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.