The Data Stack Show - 76: Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines. Learn more at Rudderstack.com. And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget, we're hiring for all sorts of roles. Exciting news. We are going to do another Data Sack Show livestream. That's where we record the episode live, and you can join us and ask

Starting point is 00:00:37 your questions. Kostas and I have been talking a ton about reverse ETL and getting data out of the warehouse. So we had Brooks go round up the brightest minds building reverse ETL products in the industry. We have people from Census, High Touch, Workato, and we're super excited to talk to them. You should go to datastackshow.com slash live. That's datastackshow.com slash live and register. We'll send you a link. You can join us and make sure to bring your questions. Welcome to the Data Stack Show. We have Sean Halliburton on the show today, and he has a fascinating career. Number one, he started as a front-end engineer, which I think is an interesting career path into data engineering, and he is doing some really cool stuff.

Starting point is 00:01:26 Costas, there are two big topics on my mind, and I'm just going to go ahead and warn you. I'm going to monopolize the first part of the conversation because I'm so interested in these two topics. So Sean did a ton of work at a huge retailer on testing. So testing and optimization. And I just know from experience, there's data pain all over testing because testing tools create silos, et cetera. And so he ran programs at a massive scale. So I want to hear how he dealt with that because my guess is that he did. The second one, I guess we're only supposed to pick one, but I'll break the rules, is on clickstream data. So he also managed infrastructure around clickstream data and sort of made this migration to real

Starting point is 00:02:09 time, which is also fascinating and something that we don't talk a whole lot about on the show. And so I just, I can't wait to hear what he did and how he did it. Yeah, 100%. For me, I mean, he's a person that has worked for a long time in this space and he has experienced engineering from many different sites. So he hasn't been just a data engineer. He has been, as you said, a front-end engineer and ended up at some point in three different things to become a data engineer. So I want to understand and learn from him, how do you deal with the reality and the everyday reality of maintaining an infrastructure around data? How do you figure out when things go wrong?

Starting point is 00:02:56 What does that mean? How you build, let's say, the same, sorry, the right intuition to figure out when something we should act immediately and when not. And most importantly, how you communicate that among different stakeholders, not just like the engineering team, but everyone else who is like involved in the company. Because at the end, data engineering is about like the customers of data engineering are always internal, right? Like you deliver something, which is data that someone else needs to do their job, right? So I'd love to hear from him, especially because he has been in such big organizations, what

Starting point is 00:03:35 kind of issues he has experienced and get some advice from him. Absolutely. Well, let's dig in and talk with Sean. Yeah, let's do it. Sean, welcome to the Data Sack Show. We're excited to chat about all sorts of things, in particular, sort of clickstream data and real-time stuff. So welcome. Thank you. I'm super stoked to be here. Okay. So give us, you have a super interesting history, background as an actual engineer,

Starting point is 00:04:01 software engineer. Could you just give us the abbreviated history of where you started and what you're doing today? Yeah. So I come at this gig from a lot of different directions. I was actually an English major in college. Before that, I was a music major. That's one of my favorite things about what we do is it takes a lot of different disciplines and those disciplines come in handy at a lot of different disciplines and those disciplines come in handy at a lot of different times. I've also been an individual contributor and I've been an engineering manager. And along with that, I've worn different hats doing program management, doing product management at different times as the need has been there. So I started out in the front ends. And so there's another

Starting point is 00:04:47 angle that I think is unusual in this field, but I'm also self-taught. And when I first started 15 plus years ago, it was really easy to just dig into the front end of building your own site, spit out some static HTML, and then slowly enhance it with progressive JavaScript. And then some of the site templating engines and WordPress has started to come in. And more and more people started tweaking their MySpace profiles and things like that. So I learned how to build data-driven websites and started specializing professionally in data-driven lead generation and optimizing landing page flows. I worked with University of Phoenix's optimization team for several years and really learned a lot about form flows and not only optimizing those pages to try to best reach the user and keep them

Starting point is 00:05:49 engaged to convert and get more information, but also to optimize the data that came out of them that would go into the backend and power so many things behind the scenes. I went from there to, I served about six to seven years at Nordstrom as both an IC and engineering manager and really built out a program around optimization and then expanded into quick stream data engineering and over time got addicted to replacing expensive enterprise third-party SaaS solutions with open source basedbased solutions deployed to the cloud, which was still relatively new in the space at the time. And that's kind of where I'm at today with CNN as a staff data engineer. And we've worked with a number of tools, some we love,

Starting point is 00:06:40 some that we thought could be better. and where we see opportunities to improve using open source tools, we have a highly capable team to do that. But interestingly enough, over the last two to three years, I think the pace of the greater community has been such, and some of the key tools like commercial databases have improved so much that I've come back around a little bit and embraced SaaS tools where it makes sense to for things like reverse ETL, analytics, and data quality, basically post data warehouse. Interesting. Okay, Gaz, I know you have a ton of the tip of the spear in terms of testing and optimization and getting it right can mean, you know, moving something, a point of percentage, you know, can mean, you know, huge amounts of revenue.

Starting point is 00:07:58 But you come at it from the data side as well. And at least in my experience with testing, there's this challenge of sort of the localized testing results in whatever testing tool you're using, right? So you get a statistically significant result that this landing page is better, or this button is better, this layout, or, you know, all that sort of stuff, which is great, because like, math on multivariate testing is, you know, pretty complex, but it's hard to tie that data to the rest of the business. Did you experience that? Yeah. So I have this saying that people drive software drives people and the tools you use have to meet the state of your program at the time and conversely are influenced by them. When it comes to optimization, you know, everyone starts with the basics,

Starting point is 00:08:52 testing different headlines, different banners, maybe different fonts, and you kind of mature into, you might be running a handful of tests per month. You get a little bit more experienced, more savvy, more strategic. Maybe you level up to a better testing platform and hire more analysts that can handle the output. And now you're running maybe a couple of dozen tests per month and testing custom flows and things like that. But there's still a limit as long as you're using a dedicated optimization platform. That certainly was the case for us at Nordstrom. We would generate analyses out of, we were using Maximizer at the time,

Starting point is 00:09:38 but those analyses were reporting things like sales conversions in potentially different methods from our enterprise clickstream solution, which was IBM CoreMetrics at the time, based off of two completely different databases, both of which were black boxes. Of course, a vendor can only convey so much about the logic that they're running in their own ETL on the back end. And as the technical team around these practices itself matures, it becomes more and more difficult to explain some of those results. At the same time, the more testing you do, the more data you naturally want to capture around those tests. So your analysts want to know their questions increasingly overlap with those being asked by your product owners that are analyzing your wider clickstream data. So I don't think it was any coincidence that we began to look

Starting point is 00:10:36 for alternatives to both of these solutions for us. And we landed on a couple of open source options. One was Planout, which was a static Facebook library at the time. And we developed that into a service designed to be hosted internally from AWS and scale up to meet hundreds of thousands of requests at a time. And on the Clickstream side, we planned and designed to scale up to handle more experiment results directly into the clickstream pipeline. And we migrated from core metrics to Snowplow. We leveraged the open source versions of each one and put a lot of work into making them more robust and scalable. And over a couple of years, those two practices, I would say,

Starting point is 00:11:27 really did become one. So what I'm hearing is you essentially sort of eschewed the third-party SaaS testing infrastructure and clickstream data infrastructure and said, we want it all on our own infrastructure. So then you had access to all the data, right? So for analysts and results, it's like, we can do whatever we want. Yeah. So this was in the early teens and AWS itself was really still

Starting point is 00:11:58 kind of in that early explosion phase where more highly capable and agile engineering teams were clamoring to get into the cloud. I mean, just the difference between working with our legacy Teradata databases on-prem and spinning up a Redshift cluster, I didn't need to ask anyone to spin up that Redshift cluster. I didn't need to ask anyone to resize it or anything. My Clickstream team was able to tag our own events in the front end. Ironically, we tagged some Clickstream events using our Maximizer Optimization JavaScript injection engine.

Starting point is 00:12:40 And we could land the results into our own data warehouse, into our own real-time pipeline within hours. We hacked away at this over a weekend and came back the next weekend and were so energized and really relieved because the right tools can have that kind of impact on your quality of life. It became equally important over time to engineer the limits around those capabilities, though, as well. So that was one of the more interesting learnings that we had. The more power you find you have, suddenly the challenge becomes when to say no to things and when to put artificial limits on those powers. Yes, we have access to real-time data now, but here's the thing. If we copy that data out in 10 parallel streams, we could have 10 copies of the data. If we produce a real-time dashboard of this metric or that metric, we have to make

Starting point is 00:13:41 sure that that metric aligns with the definition of other metrics that a product owner might have access to that we don't even know about going in. Could you give just maybe one, and you did a little bit, but just like a specific example of, well, and stepping back just a little bit, access to data creates more appetite for data, right? You know, it's kind of like you get a report and then it creates additional questions, right? Which, you know, sort of, you know, creates additional reports. But could you give an example of like maybe a specific example, if you can, of a situation where it was like, oh, this is cool, we should do this. But then the conclusion was, well, just because we can doesn't mean we should.

Starting point is 00:14:26 Yes, absolutely. So again, to try to put a time frame around this, I would say this was we had pretty large scale Lambda architecture between our batch data ETL side, which was our primary reporting engine. But we also had the real-time side, as I briefly described, and that's pivoted exclusively on Kinesis. Well, Kinesis is, it's an outstanding service. It really is. I love it. It's similarly easy to provision a stream. It's like managed Kafka. I'm sure it's not exactly Kafka under the hood,

Starting point is 00:15:22 but the principles are the same. And it's almost too easy to get up and running with. It's also easy to begin overpowering yourself with. We started landing so much data that scaling up a stream became, I would say, to put it nicely, excessive overhead. It could take hours. It was an operation that should not be done in an emergency scaling situation. And it kind of relates back to one of the fundamental principles of data that I don't think we talk enough about really, and that's data has a definite shape. You could describe it in 2D terms or even 3D terms. For the purposes of this example, I would describe it in 2D terms and just say, it's easy to consider the volume of events that you're taking in. It's easy to describe those in terms of requests per seconds

Starting point is 00:16:26 or events per second flowing through your pipe. It's easy to forget that those events have their own depth to them or their own width, however you want to describe it. The more you try to shove into that payload, you can create exponential effects for yourself downstream that are easy to overlook. And in our case at Nordstrom, we made a fundamental shift at one point to basically go from a base unit of page views down to a base unit of content impressions.

Starting point is 00:17:02 So think of it as like going from molecular to atomic. And that's essentially what we did. And we took in a flood of new data into the pipe that we didn't have before in a short amount of time. And also remember Kinesis only recently developed auto scaling capabilities. So solutions to that scaling problem were really homegrown until very recently. So I think that's an already classic example of be careful of what you wish for and know that you have some very powerful weapons at your disposal. Just stop to think about, as you said, okay, we can do it, but shouldn't. What is the value of all that additional data? I would suggest to not only engineering managers, but product managers, be very deliberate about the value

Starting point is 00:18:00 you anticipate getting out of that additional data, because it costs money, whether it's in storage, in compute, or in transit in between. Yeah, that's a very, very interesting point, Sean. And I think one of the more difficult things to do, and I think that like many people don't do at all, is like to sit down and consider when more data doesn't mean more signal, but actually adds noise there. And that's something that I don't think we discuss yet enough.

Starting point is 00:18:32 Maybe we will now that like people are more into like stuff like quality and metrics repositories and try also like to optimize these parts. But you talked about dimensions and from what I understood, the increase of dimensions has an impact on the volume, right? And you talked about scalability issues and how to auto scale and all these things. What other dimensions data has and what other, let's say, parameters of the data platform they affect outside of the volume and the scalability of it? Sure, sure. So I think a good example is kind of where we started with this discussion of layering experiment data onto clickstream data.

Starting point is 00:19:27 Or it may be a case where a product manager wants a custom context around, you know, say you have a mobile app that loads a web view and suddenly you're crossing in between platforms, but product manager wants to know what's happening in each context. And so you may have a variant type. You may have a JSON blob embedded into your larger event payload that quickly bloats in size. Or here's another example. In an attempt to simplify instrumentation at Nordstrom,

Starting point is 00:20:04 we attempted to capture the entire React state shard through the clickstream pipeline. So we could have as much data as we could possibly use, which was super powerful, but again, could be'm debugging in the front end, I tend not to use an extension like there are a couple different Snowplow debuggers that give you sort of a clean text representation. loads as it normally flows through the browser and to the collector and try to keep my fingers on the raw data so that I don't forget what's being sent through and ask from time to time, you know, as include as part of your debugging routine, like what is the value of this data? Okay. You want to capture hover events. How much intent do you expect to get out of a hover event? How will you be able to tell what's coincidental versus what is purely intentional? That's a great point. And how do you, I mean, from your experience, because as you describe this, I cannot stop thinking of how it feels to be on the other

Starting point is 00:21:26 side where you have to consume this data and you have to do analytics and you have to maintain and scheme on your data warehouse like you know like all these things how much this part of the data stack is affected by the decisions that happen let's say on the instrumentation side because at the end okay adding another line of code there to capture another event, okay, it's not that bad, right? It's not something that's going to hurt that much. But what's the impact that it has on the other side?

Starting point is 00:21:53 How much more difficult working with the data makes it? That's definitely a big piece of the puzzle and a big challenge. And that's kind of where you verge into API design, right? And you spend enough time in software engineering and you realize the challenges of API design. It's tricky. It's tricky to get the contract right in such a way that you can

Starting point is 00:22:20 adapt it later without forcing constant breaking changes. And because those breaking changes will not only break your partners upstream, but they'll break your pipeline. And if they break your pipeline, you've broken your consumers downstream. And I've always worked at places that were wonderfully academic. But by the same token, you end up being your own evangelist because you are constantly pitching your product internally for consumers to use. They don't have to use your product.

Starting point is 00:22:58 Any VP, any director can go out and purchase their own solution if they really want to, generally speaking. There are always exceptions, of course. And none of that is to denigrate any way of doing it or, you know, any leader that I've learned from in any way. That's just the nature of our business. So, and things move so quickly, you know, especially of the last two to three years.

Starting point is 00:23:25 So I apologize, this is a tangent, but I wanted to highlight one of the things that I think has really accelerated tool development on the fringes in the data landscape. You know, we've all seen the huge poster with the sprawling options for every which way you could come at data. But I think data warehouses in general were a big blocker for a number of years. And initially Redshift was the big lead, right? And then BigQuery right on the heels of that. And then I think you hit a wall with Redshift and it stagnated for a few years until Snowflake came along. Now we are a Snowflake shop, so I can praise it directly. And we've been very happy with it as a third party solution. And we've also touched on, you know, Lambda architectures and some of the difficulties of those. And I think a lot of the talk of Kappa versus lambda has been put on the back burner because it's kind of been obfuscated away with advances in piping data into your data warehouse.

Starting point is 00:24:34 We're a snowpipe user heavily. And if you had come to me a couple of years ago and said, well, can we have both? I would have said not necessarily, but now we kind of hand wave the problem away because we can essentially, it's sort of like using firehose in the AWS landscape, but we can pipe our data from the front end into our data warehouse within under a minute now. So why keep a Lambda architecture around, but also I don't feel like we need to obsess about a Kappa architecture either. You said something that I found very, very interesting. You talked about APIs and contracts,

Starting point is 00:25:22 and I want to ask you, what's the equivalence of an API contract in the data world? Like what do as data engineers we can use to communicate, let's say, the same things that we do with a contract between APIs? If there is something, I don't know, maybe there isn't, but, and if there isn't, why we don't have something? Sounds like... Yeah. So like the optimization analogy, I think it depends on the maturity of your data engineering team.

Starting point is 00:25:53 And it's probably more typical for a data engineering team in its younger years to handle all the instrumentation responsibilities. But at some point, product owners and executives are going to want some options for self-service. And when that happens, you have a couple of different, I think you have two primary approaches. And one is a service-oriented architecture, which was my initial approach and answer to that question, where we provided an endpoint and a contract for logging data, just like so many other logging solutions and other APIs. And that worked well, I would say, for not quite a year before we started hitting walls on that. I think longer term, the better solution, which we have now at CNN, and I think is a major asset, is we offer SDKs to front our data pipeline.

Starting point is 00:27:02 And our primary client is web, but we're increasingly expanding into the mobile SDK space. So that alone is a challenge because the more languages you want to offer an SDK in, you need developers that are proficient

Starting point is 00:27:18 in those languages, of course. But for where we're at right now, between CNN web and mobile and increasingly CNN+, our JavaScript and Swift SDKs meet our needs. And I think that is a good compromise. It's a more flexible one, especially if you're able to serve your SDK via CDN, then you can publish updates and fixes and patches and new features whenever you need in a much more healthy manner. And make less forced upgrades to those downstream teams

Starting point is 00:27:59 and by extension their end users. How restrictive are these SDKs for the developer? Do they enforce, for example, specific types? Do you reach at that point where there are specific types that they have to follow? Or they can do whatever they want and they can send whatever data they want at the end, right? Because if this is the case, again, the contract can be broken. So what kind of balance you have figured out there?

Starting point is 00:28:27 Yeah, now we're into kind of the fine tuning knobs of self-service, right? And verging into data governance now. So we've provided an SDK. We've provided these stock functions that construct the event payloads. But yeah, there's always some edge case. There's always some custom request where we want to be able to pass data this way under this name that the SDK does not

Starting point is 00:28:53 allow for. Or maybe there's some quirk of your legacy CMS where it outputs data already in some way. If only we could shove it in and shim it into that payload. So, yeah, we absolutely, there's a line we walk, there's a bounce we try to strike of self-service where we can offer this one custom carve-out space where you can pass a JSON blob, ideally with some limits. It's probably an arbitrary honor system arrangement, but we'll take your data into the data warehouse, but it'll still be in JSON or, okay, we can offer custom enrichment of that data. Once in the data warehouse, we'll model it for you for a set period of time. And then past that point, either the instrumentation has to change, or we just have to figure something else out that works for both sides. Yeah, that's

Starting point is 00:29:50 a great question. It's always a challenge between where does the labor fall? Whose responsibility is that? Whose ownership is that? And governance is a challenge in so many aspects of of life these days and data engineering and end users and analytics is no exception to that 100 i totally agree with you and i think it's one of the problems that as the space like the industry is maturing we'll see more and more solutions around that and probably also like some kind of standardization like we've seen also with things like DBT. But from what I understand, and that's also my experience, issues are inevitable, right? Like something will break at some point. And the problem with data in general is that they can break in a way that it's not obvious that something is going wrong.

Starting point is 00:30:45 Like you can have, let's say, for example, double gauge, or you might have data reliability issues, right? Your pipeline is still there. It still runs outside of seeing something not ordinary when it comes to the volume of the data or something like that. You can't really know if you are still sending the right data right so how do you deal with that how what kind of like mechanisms you have figured out like because you have like a very long career like in this space like what do you do how do you take do you yeah yeah i'm smiling because this always reminds me of uh the line from shakespeare in love where the producer is seeing this madness going on in

Starting point is 00:31:32 the theater and asking how are we ever going to pull this off and i believe it's jeffrey rush says nobody knows but it always works we'll work it out we'll figure it out. We'll figure it out. And that was definitely the case in data engineering until very recently. We were flying blind. We were. There was little to no running and whether they were maxing out on CPU and could take you deep down the rabbit hole of JVM optimization. But really describing the data behind your data was surprisingly hard. When it came to describing how ETL was performing. That was really hard for me, both as an engineer and as a manager responsible for representing my program and my team and the culture and engineers that I cared very deeply about continuing to grow. And just in terms of maintaining their quality of life, there were some downright stressful times, there was definitely burnout on the team. And so again, people drive software drives people. And I knew we could do better at the time, I very much wanted tools to

Starting point is 00:33:01 do that. And I'm happy to say in the last just just in the last six months, as an IC, again, back at CNN, I've been focusing on data quality and observability quite a bit. I've been testing different solutions. Recently, I've been working with VData and Monte Carlo as observability solutions. Again, I think having a more dynamic data warehouse like Snowflake helps unlock a lot. And I've been working on, simply put, data quality algorithms that can not only tell us how we're doing, but better define, illustrate and advertise our SLAs and SLOs to our partners and tell them how we're doing with some real numbers.

Starting point is 00:33:51 Oh, that's super interesting. Can you share a little bit more around that? Sure. So I believe Eric mentioned dbt back a while ago. I'm a very proud and happy dbt user. We've worked with them extensively to harden our data stack and I'm using it to capture things like presence or absence of critical fields in our enriched tables capturing latency of records as measured from when they land in our raw data tables versus when they reach enrichment and our data marks.

Starting point is 00:34:29 And beginning to kind of, as I said, develop an algorithm starting from a certain baseline and say a record is missing a critical user ID, I might subtract a tenth of a point, two tenths of a point, depending on the how critical that ID is. Maybe it's a second tier ID, and it's not as important, and the record is otherwise usable. Maybe it's not usable. I still want to send the record on downstream, but with that metric attached to it. And then you calculate that on a record and then a table basis, and you can begin to calculate like a daily average, a monthly average and start to build a scorecard. One of the biggest assets going that I think our analytics board at Nordstrom had was a fitness function. I think that term, maybe it's a little Amazon or

Starting point is 00:35:27 Microsoft centric, and maybe it's fallen out of favor a little bit, but it's sort of an assessment of your program's technical capabilities and the impact you have on the business. And when you work in analytics, that can actually be hard to do. But we were able to extrapolate a lot of performance metrics out of the test campaigns that we would run, out of the quick stream features that we would ship. think that's actually more critical to you as the IC or the manager than it even is to your team executives because it gives you one more measurable to perform to assess your performance against your OKRs. That's super interesting and okay let's say we have in place like amazing algorithms that help us like monitor and figure out if something goes wrong. Okay. And let's say something goes wrong. How do we debug data? I mean, as software engineers, we know that we have tools to debug our software, right? Like we have debuggers, we have different tools that we can use for that we have testing we have many different things that

Starting point is 00:36:46 they are both like let's say tools and also like engineering processes and best practices that we have learned that like they help like reduce the risk of something breaking how do you how do you debug like something that starts from the client of a mobile app and reaches at some point your data warehouse and anything can go wrong like between their line between like these two points right so how do you do that yeah that's another balance you have to strike between how much work do you want to put into making your synthetic tests appear as organic as possible, right? We've tested our pipeline using tools like serverless artillery to the point that we can accept hundreds of thousands of requests per second. I mean, think about the news industry in general.

Starting point is 00:37:40 Yeah, there are planned events like elections, but there are also unplanned events throughout the year as you go that can drive everyone to their phones and their laptops and can be extremely unexpected. And we need to prepare for that. So we've used the output of those tests to beef up things like across region failover pipelines, things like that. But even then you're operating on an assumption where you're probably using a fixed set of dummy events. So then you have to decide, okay, is it worth dedicating time maybe to pull in some developers from the mobile team to more accurately simulate how a user uses the app as they understand it. Keep in mind, their assumptions may be based on your own assumptions that are coming through your pipeline based on the data that you're

Starting point is 00:38:36 serving to their analysts. But yeah, you could certainly go down a rabbit hole and put a lot of work into automating tests from different platforms build a device farm even it's just about it's a matter of how far down that hole you want to go and how much you want to invest makes total sense and my last question and then i'll give the microphone back to to eric you mentioned at some point like more senior people that are involved in this, like VPs, like the leadership of the company, probably people that maybe don't even have like a technical background, like they cannot understand what delivery semantics are or the limitations of technology that we have and all these things. And at the same time, you said that like it's very important to make sure that you can communicate these things to them. Do you have some advice to give around that? Like how you can communicate effectively to your leadership team the limits of the data, how much they can trust it, and the limits of the technology and the people that we have?

Starting point is 00:39:39 Sure. So that is a primary function of the engineering manager naturally but no engineering manager can do that alone um so my advice to those considering management is before you accept such an opportunity and it may be a fantastic opportunity but do all you can to ensure that you have backup in place. Insist on program management help because you as an EM are busy managing not only the careers, but the lives and even the mental health of the technical talent that you worked hard to get in the door. And you can't be in every meeting. You can't be in every scenario. You can't cover every hour. You'll burn yourself out if you try. Just like I would also recommend insisting on a product manager, because there are already 100 different technical ways you could take a product. and 99 of those might not meet the actual demand of your internal users downstream.

Starting point is 00:40:54 And, you know, in data engineering, we talk mostly about internal users as our customers, but I do believe that extends to the external end user as well. But similarly, there's only so much you can do as an engineering manager to evangelize for the system you're building and to canvas your users on how it could be better and how you could better serve them. So those are really what I call the minimum three legs to the stool that you need to build an effective data engineering team and to meet those requests that are only going to mature as your consumer teams mature and as the business matures around you. Thank you. We got, I think, some amazing ideas and advice from you. Eric,

Starting point is 00:41:46 what are your questions? I have so many actually that we don't have time to get to because Bricks is telling us we're close to the buzzer. The SDK conversation is absolutely fascinating. So I'd love to have you back on to talk more about that. I have two more questions for you, both kind of quick, I think. Maybe not. Maybe I should stop saying that because they usually are. You have a really wide purview as sort of a buyer, user, researcher of tools. You have a bias towards open source tooling, but you also use sort of third-party SaaS that isn't open source. What are some of the tools, you know, whether you use them or not, that you've seen in the data and data engineering space that are really exciting to you that you think sort of represents like, okay, this is kind of the next phase of tooling that's going to be

Starting point is 00:42:36 added to the stack? Yeah, again, first and foremost, I would say the data observability and data quality tools, just because, again, they have such a direct impact on quality of life for your engineers went unmet for so long. And I'm not even exactly sure why that is, but I'm very happy to see that we've brought principles of site reliability engineering into the data engineering space. It's like everything went into the cloud and all of your sysadmins became SREs. And now a lot of those SREs are starting to look toward the data space or the data space is starting to look to those SREs to say, hey, can you help us out and make sure that this thing stays up?

Starting point is 00:43:34 Because if the data is gone, it's gone. There's no way to get it back. Yep. So that's one thing I'm excited about. I'm also, I would say, carefully excited about machine learning. I think ML, like blockchain, is one of those things that it's easy to say, oh, we should be doing this without you want to capture, I would again suggest think carefully before you apply ethics, because I think that is critical to the process. And that may be, you know, once data observability improves, I hope ethics improvements are right on the heels of that. But executing ML models is still fairly complex, but I think that will

Starting point is 00:44:42 improve over the next couple of years and become more closely integrated with the data stack. And I think in terms of applications of all of this, of ML and personalization, I think I'm most excited about the health space, which I have not worked in personally, but I think it has the biggest impact simply because you have the greatest diversity of end users there. And it's one of the most complex problem spaces, obviously. And with our health system as it is, it's tempting to try to make an end around to try to deliver some of those solutions that kind of break down the silos. So I hope we can continue to do that in a responsible way as well. Very cool. I am 100% aligned with you on all of that.

Starting point is 00:45:35 And actually, I've been writing a post that hits on some of those direct points. Okay, last question before we're at the finish line. You've been an individual contributor, an engineering manager that have worked in a variety of roles. Maybe give your top two pieces of advice and maybe we could give a piece of advice to like an individual contributor who sort of aspires to be a manager and then maybe give a piece of advice to a manager who's early in their career, you know, working with an engineering team, data engineering team? Okay. So I would say for the IC that is considering management, I would say it is absolutely a very rewarding career change, but it is a big change. And the role, it's not as simple as just

Starting point is 00:46:30 speaking as the most technically mature member, you know, frequently, the most senior IC becomes the manager out of necessity. And it looks on paper, like it's a very natural transition. But that's not necessarily the case. It's a very different skill set. You do need to be able to speak to what's being built and delivered. But you are coming at of a complex technical system that's hosted somewhere and rendered somewhere else. Come with humility, be prepared to listen, ask more questions than you make pronouncements. And I think that's a good transition to the

Starting point is 00:47:27 person that is already a new manager is, again, expect to do much more listening, ask tough questions when the time warrants it. But you're coming to learn from the people that you've retained or recruited to work for you, leverage them to be your experts. Don't try to be the smartest person in the room. You are there to hire the smartest people in the room and to be able to send them into rooms that you can't reach because you're overbooked and you will be overbooked. That my friend is some of the best practical advice for management I've ever heard. And is so true. Well, Sean, this has been such a fun episode. I've learned a ton and we just thank you so much for your time and sharing all your thoughts and wisdom. Absolutely. I'm happy to help anytime.

Starting point is 00:48:26 I feel like we could have talked for hours and that's such a cliche saying now, because we say that every time and really it's true. I'm going to pick something really specific as my takeaway. Whenever you hear about a data engineering team building their own SDKs, to me, that's an eye-opener because, you know, I don't come from a software engineering background, but I know enough to know that's a pretty heavy duty project to take on at the scale that they're running at, you know, a company like CNN with, you know, traffic volumes that they have. I mean, building a robust SDK is no joke. But the more I thought about that after Sean said it, I just kind of reviewed my, you know, mental Rolodex of hearing that. And I realized, you know, it's really not the first time that

Starting point is 00:49:19 I've heard of a large enterprise organization building their own SDK infrastructure in large part because the needs that they have to serve for downstream consumers, to Sean's point, is so complex. And so even if you take something off the shelf and modify it, you end up with something that's pretty different than the original SDK that you had anyway. So that's just fascinating to me. And it's pretty fascinating also, I think, to just consider a situation where building your own SDK is the right solution. Yeah, I totally agree with you.

Starting point is 00:49:58 I would say that I keep two things from the conversation we had with him. One is the concept of contract that comes from building APIs. I think it's a very interesting way of thinking and building also data contracts and what data contract would look like or how it can be implemented and what we can learn from building these services all these years and use this knowledge like also in the data space. That's one thing and the other thing is i think that by the end he gave some amazing advice on how to be a manager which is i think it was super super valuable for anyone who is interested in both becoming a manager but

Starting point is 00:50:41 also interacting with managers which pretty much much everyone, right? So that was also amazing. Yeah, I agree. And I think, you know, I mean, he said that in the context of data teams specifically, but really just great advice in general. So really appreciate that. Yeah. All right. Well, thank you for joining us on the Data Stack Show.

Starting point is 00:51:01 Tune in for the next one. Lots of great shows coming up. We hope you enjoyed this episode of the Data Stack Show. Tune in for the next one. Lots of great shows coming up. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.

Starting point is 00:51:12 We'd also love your feedback. You can email me, ericdodds at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack,

Starting point is 00:51:25 the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

CODACE Plant Stand

The Data Stack Show - 76: Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

The Data Stack Show - 76: Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.