The Data Stack Show - 76: Why a Data Team Should Limit Its Own Superpowers with Sean Halliburton of CNN
Episode Date: February 23, 2022Highlights from this week’s conversation include:Sean’s career journey (3:27)Optimization and localized testing results (7:49)Denying potential access to more data (13:46)Other dimensions data has... (18:32)The other side of capturing events (20:55)Data equivalent of API contracts (25:03)SDK restrictiveness for developers (27:40)How to know if you’re still sending the right data (30:38)Debugging that starts in a client of a mobile app (36:08)Communicating about data (38:36)The next phase of tooling (41:49)Advice for aspiring managers (45:21)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.Â
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines.
Learn more at Rudderstack.com.
And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget,
we're hiring for all sorts of roles. Exciting news. We are going to do another
Data Sack Show livestream. That's where we record the episode live, and you can join us and ask
your questions. Kostas and I have been talking a ton about reverse ETL and getting data out of
the warehouse. So we had Brooks go round up the brightest minds
building reverse ETL products in the industry. We have people from Census, High Touch, Workato,
and we're super excited to talk to them. You should go to datastackshow.com
slash live. That's datastackshow.com slash live and register. We'll send you a link.
You can join us and make sure to bring your questions. Welcome to the Data Stack Show. We have Sean Halliburton on the show today,
and he has a fascinating career. Number one, he started as a front-end engineer,
which I think is an interesting career path into data engineering, and he is doing some really cool stuff.
Costas, there are two big topics on my mind, and I'm just going to go ahead and warn you.
I'm going to monopolize the first part of the conversation because I'm so interested in these
two topics. So Sean did a ton of work at a huge retailer on testing. So testing and optimization.
And I just know from experience, there's data pain
all over testing because testing tools create silos, et cetera. And so he ran programs at a
massive scale. So I want to hear how he dealt with that because my guess is that he did.
The second one, I guess we're only supposed to pick one, but I'll break the rules,
is on clickstream data. So he also managed infrastructure around clickstream data and sort of made this migration to real
time, which is also fascinating and something that we don't talk a whole lot about on the
show.
And so I just, I can't wait to hear what he did and how he did it.
Yeah, 100%.
For me, I mean, he's a person that has worked for a long time in this space and he has experienced engineering from many different sites.
So he hasn't been just a data engineer. He has been, as you said, a front-end engineer and ended up at some point in three different things to become a data engineer. So I want to understand and learn from him, how do you deal with the reality and the everyday
reality of maintaining an infrastructure around data?
How do you figure out when things go wrong?
What does that mean?
How you build, let's say, the same, sorry, the right intuition to figure out when something
we should act immediately and when not.
And most importantly, how you communicate that among different stakeholders,
not just like the engineering team, but everyone else who is like involved in the company.
Because at the end, data engineering is about like the customers of data engineering are always internal, right?
Like you deliver something, which is data that someone else needs to do their job, right?
So I'd love to hear from him, especially because he has been in such big organizations, what
kind of issues he has experienced and get some advice from him.
Absolutely.
Well, let's dig in and talk with Sean.
Yeah, let's do it.
Sean, welcome to the Data Sack Show. We're excited to chat about all sorts of things,
in particular, sort of clickstream data and real-time stuff. So welcome.
Thank you. I'm super stoked to be here.
Okay. So give us, you have a super interesting history, background as an actual engineer,
software engineer. Could you just give us the abbreviated history of where you
started and what you're doing today? Yeah. So I come at this gig from a lot of different
directions. I was actually an English major in college. Before that, I was a music major.
That's one of my favorite things about what we do is it takes a lot of different disciplines
and those disciplines come in handy at a lot of different disciplines and those disciplines come in handy
at a lot of different times. I've also been an individual contributor and I've been an engineering
manager. And along with that, I've worn different hats doing program management, doing product
management at different times as the need has been there. So I started out in the front ends. And so there's another
angle that I think is unusual in this field, but I'm also self-taught. And when I first started
15 plus years ago, it was really easy to just dig into the front end of building your own site,
spit out some static HTML, and then slowly enhance
it with progressive JavaScript. And then some of the site templating engines and WordPress has
started to come in. And more and more people started tweaking their MySpace profiles and
things like that. So I learned how to build data-driven websites and started specializing professionally in
data-driven lead generation and optimizing landing page flows. I worked with University
of Phoenix's optimization team for several years and really learned a lot about form flows and not only optimizing those pages to try to best reach the user and keep them
engaged to convert and get more information, but also to optimize the data that came out of them
that would go into the backend and power so many things behind the scenes. I went from there to,
I served about six to seven years at Nordstrom as both an IC and engineering manager
and really built out a program around optimization and then expanded into quick stream data
engineering and over time got addicted to replacing expensive enterprise third-party
SaaS solutions with open source basedbased solutions deployed to the cloud,
which was still relatively new in the space at the time. And that's kind of where I'm at today
with CNN as a staff data engineer. And we've worked with a number of tools, some we love,
some that we thought could be better. and where we see opportunities to improve using
open source tools, we have a highly capable team to do that. But interestingly enough,
over the last two to three years, I think the pace of the greater community has been such,
and some of the key tools like commercial databases have improved so much that I've come back around a little bit
and embraced SaaS tools where it makes sense to for things like reverse ETL, analytics,
and data quality, basically post data warehouse. Interesting. Okay, Gaz, I know you have a ton of the tip of the spear
in terms of testing and optimization and getting it right can mean, you know, moving something,
a point of percentage, you know, can mean, you know, huge amounts of revenue.
But you come at it from the data side as well. And at least in my experience with testing, there's this
challenge of sort of the localized testing results in whatever testing tool you're using, right? So
you get a statistically significant result that this landing page is better, or this button is
better, this layout, or, you know, all that sort of stuff, which is great, because like,
math on multivariate testing is, you know, pretty complex, but it's hard to
tie that data to the rest of the business. Did you experience that? Yeah. So I have this saying
that people drive software drives people and the tools you use have to meet the state of your program at the time and conversely are influenced by them.
When it comes to optimization, you know, everyone starts with the basics,
testing different headlines, different banners, maybe different fonts,
and you kind of mature into, you might be running a handful of tests per month.
You get a little bit more experienced, more savvy, more strategic.
Maybe you level up to a better testing platform and hire more analysts that can handle the output.
And now you're running maybe a couple of dozen tests per month and testing custom flows and
things like that. But there's still a limit as long as you're using a dedicated
optimization platform. That certainly was the case for us at Nordstrom.
We would generate analyses out of, we were using Maximizer at the time,
but those analyses were reporting things like sales conversions in potentially different methods from our enterprise
clickstream solution, which was IBM CoreMetrics at the time, based off of two completely different
databases, both of which were black boxes. Of course, a vendor can only convey so much about
the logic that they're running in their own ETL on the back end. And as the technical team
around these practices itself matures, it becomes more and more difficult to explain some of those
results. At the same time, the more testing you do, the more data you naturally want to capture
around those tests. So your analysts want to know their questions increasingly overlap with those being asked by your product owners that are
analyzing your wider clickstream data. So I don't think it was any coincidence that we began to look
for alternatives to both of these solutions for us. And we landed on a couple of open source
options. One was Planout, which was a static Facebook library at the time.
And we developed that into a service designed to be hosted internally from AWS
and scale up to meet hundreds of thousands of requests at a time.
And on the Clickstream side, we planned and designed to scale up to handle more experiment results directly into
the clickstream pipeline. And we migrated from core metrics to Snowplow. We leveraged the open
source versions of each one and put a lot of work into making them more robust and scalable. And
over a couple of years, those two practices, I would say,
really did become one. So what I'm hearing is you essentially sort of eschewed the third-party
SaaS testing infrastructure and clickstream data infrastructure and said, we want it all
on our own infrastructure. So then you had access to all the data, right?
So for analysts and results, it's like,
we can do whatever we want.
Yeah.
So this was in the early teens
and AWS itself was really still
kind of in that early explosion phase
where more highly capable
and agile engineering teams were clamoring to
get into the cloud. I mean, just the difference between working with our legacy Teradata
databases on-prem and spinning up a Redshift cluster, I didn't need to ask anyone to spin
up that Redshift cluster. I didn't need to ask anyone to resize it or anything.
My Clickstream team was able to tag our own events in the front end.
Ironically, we tagged some Clickstream events using our Maximizer Optimization JavaScript injection engine.
And we could land the results into our own data warehouse, into our own real-time pipeline within hours.
We hacked away at this over a weekend and came back the next weekend and were so energized and really relieved because the right tools can have that kind of impact on your quality of life.
It became equally important over time to engineer the limits around those capabilities, though, as well.
So that was one of the more interesting learnings that we had.
The more power you find you have, suddenly the challenge becomes when to say no to things and when to put artificial limits on those powers.
Yes, we have access to real-time data now,
but here's the thing. If we copy that data out in 10 parallel streams, we could have 10 copies
of the data. If we produce a real-time dashboard of this metric or that metric, we have to make
sure that that metric aligns with the definition of other metrics that
a product owner might have access to that we don't even know about going in. Could you give just
maybe one, and you did a little bit, but just like a specific example of, well, and stepping back just
a little bit, access to data creates more appetite for data, right? You know, it's kind of like you
get a report and then it creates additional questions, right? Which, you know, sort of,
you know, creates additional reports. But could you give an example of like maybe a specific
example, if you can, of a situation where it was like, oh, this is cool, we should do this.
But then the conclusion was, well, just because we can doesn't mean we should.
Yes, absolutely. So again, to try to put a time frame around this, I would say this was we had pretty large scale Lambda architecture between our batch data ETL side, which was our
primary reporting engine. But we also had the real-time side, as I briefly described, and that's pivoted exclusively on Kinesis.
Well, Kinesis is, it's an outstanding service.
It really is.
I love it.
It's similarly easy to provision a stream.
It's like managed Kafka.
I'm sure it's not exactly Kafka under the hood,
but the principles are the same.
And it's almost too easy to get up and running with. It's also easy to begin overpowering yourself with. We started landing
so much data that scaling up a stream became, I would say, to put it nicely, excessive overhead. It could take hours. It was an operation that should not be
done in an emergency scaling situation. And it kind of relates back to one of the fundamental
principles of data that I don't think we talk enough about really, and that's data has a definite shape.
You could describe it in 2D terms or even 3D terms.
For the purposes of this example, I would describe it in 2D terms and just say, it's easy to consider the volume of events that you're taking in.
It's easy to describe those in terms of requests per seconds
or events per second flowing through your pipe.
It's easy to forget that those events
have their own depth to them or their own width,
however you want to describe it.
The more you try to shove into that payload,
you can create exponential effects for yourself downstream
that are easy to overlook.
And in our case at Nordstrom, we made a fundamental shift at one point to basically go from a base unit of page views down to a base unit of content impressions.
So think of it as like going from molecular to atomic.
And that's essentially what we did. And we took in a flood of new data into the pipe that we
didn't have before in a short amount of time. And also remember Kinesis only recently developed auto scaling capabilities.
So solutions to that scaling problem were really homegrown until very recently.
So I think that's an already classic example of be careful of what you wish for and know that you have some very powerful weapons at your disposal.
Just stop to think about, as you said,
okay, we can do it, but shouldn't. What is the value of all that additional data? I would suggest
to not only engineering managers, but product managers, be very deliberate about the value
you anticipate getting out of that additional data, because it costs money, whether it's
in storage, in compute, or in transit in between.
Yeah, that's a very, very interesting point, Sean.
And I think one of the more difficult things to do, and I think that like many people don't
do at all, is like to sit down and consider when more data doesn't mean more signal,
but actually adds noise there.
And that's something that I don't think
we discuss yet enough.
Maybe we will now that like people are more
into like stuff like quality
and metrics repositories
and try also like to optimize these parts.
But you talked about dimensions and from what I
understood, the increase of dimensions has an impact on the volume, right? And you talked about scalability issues and how to auto scale and all these things. What other dimensions data has and what other, let's say, parameters of the
data platform they affect outside of the volume and the scalability of it? Sure, sure. So I think
a good example is kind of where we started with this discussion of layering experiment data onto clickstream data.
Or it may be a case where a product manager wants a custom context around,
you know, say you have a mobile app that loads a web view and suddenly you're crossing in between platforms,
but product manager wants to know what's happening in each context.
And so you may have a variant type.
You may have a JSON blob embedded into your larger event payload
that quickly bloats in size.
Or here's another example.
In an attempt to simplify instrumentation at Nordstrom,
we attempted to capture the entire React state shard through the clickstream pipeline.
So we could have as much data as we could possibly use, which was super powerful, but again, could be'm debugging in the front end, I tend not to use an extension like there are a couple different Snowplow debuggers that give you sort of a clean text representation. loads as it normally flows through the browser and to the collector and try to keep my fingers
on the raw data so that I don't forget what's being sent through and ask from time to time,
you know, as include as part of your debugging routine, like what is the value of this data?
Okay. You want to capture hover events. How much intent do you expect to get
out of a hover event? How will you be able to tell what's coincidental versus what is purely
intentional? That's a great point. And how do you, I mean, from your experience, because
as you describe this, I cannot stop thinking of how it feels to be on the other
side where you have to consume this data and you have to do analytics and you have to maintain and
scheme on your data warehouse like you know like all these things how much this part of the data
stack is affected by the decisions that happen let's say on the instrumentation side because at
the end okay adding another line of code there to capture another event,
okay, it's not that bad, right?
It's not something that's going to hurt that
much. But what's
the impact that it has on the other side?
How much more difficult working
with the data makes it?
That's definitely a
big piece of the puzzle and a big challenge.
And that's kind of where you
verge into API design,
right? And you spend enough time in software engineering and you realize the challenges
of API design. It's tricky. It's tricky to get the contract right in such a way that you can
adapt it later without forcing constant breaking changes. And because those
breaking changes will not only break your partners upstream, but they'll break your pipeline. And
if they break your pipeline, you've broken your consumers downstream.
And I've always worked at places that were wonderfully academic. But by the same token,
you end up being your own evangelist
because you are constantly pitching your product internally
for consumers to use.
They don't have to use your product.
Any VP, any director can go out
and purchase their own solution
if they really want to, generally speaking.
There are always exceptions, of course.
And none of that is to denigrate any way of doing it or, you know, any leader that I've
learned from in any way.
That's just the nature of our business.
So, and things move so quickly, you know, especially of the last two to three years.
So I apologize, this is a tangent, but I wanted to highlight one of the things that I think has really accelerated tool development on the fringes in the data landscape.
You know, we've all seen the huge poster with the sprawling options for every which way you could come at data. But I think data warehouses
in general were a big blocker for a number of years. And initially Redshift was the big lead,
right? And then BigQuery right on the heels of that. And then I think you hit a wall with
Redshift and it stagnated for a few years until Snowflake came along. Now we are a Snowflake shop, so I can
praise it directly. And we've been very happy with it as a third party solution. And we've also
touched on, you know, Lambda architectures and some of the difficulties of those. And I think
a lot of the talk of Kappa versus lambda has been put on the back burner because it's kind of been obfuscated away with advances in piping data into your data warehouse.
We're a snowpipe user heavily.
And if you had come to me a couple of years ago and said, well, can we have both? I would have said not necessarily, but now we kind of
hand wave the problem away because we can essentially, it's sort of like using firehose
in the AWS landscape, but we can pipe our data from the front end into our data warehouse within under a minute now. So why keep a Lambda architecture around,
but also I don't feel like we need to obsess
about a Kappa architecture either.
You said something that I found very, very interesting.
You talked about APIs and contracts,
and I want to ask you,
what's the equivalence of an API contract
in the data world? Like what do as data engineers we can use to communicate, let's say, the same
things that we do with a contract between APIs? If there is something, I don't know, maybe
there isn't, but, and if there isn't, why we don't have something? Sounds like...
Yeah.
So like the optimization analogy,
I think it depends on the maturity of your data engineering team.
And it's probably more typical for a data engineering team
in its younger years to handle all the instrumentation responsibilities.
But at some point, product owners and executives are going to want some options for self-service.
And when that happens, you have a couple of different, I think you have two primary approaches.
And one is a service-oriented architecture, which was my initial approach
and answer to that question, where we provided an endpoint and a contract for logging data,
just like so many other logging solutions and other APIs. And that worked well, I would say, for not quite a year before we started hitting walls on that.
I think longer term, the better solution, which we have now at CNN, and I think is a major asset, is we offer SDKs to front our data pipeline.
And our primary client is web,
but we're increasingly expanding
into the mobile SDK space.
So that alone is a challenge
because the more languages
you want to offer an SDK in,
you need developers
that are proficient
in those languages, of course.
But for where we're at right now,
between CNN web and mobile and increasingly CNN+, our JavaScript and Swift SDKs meet our needs.
And I think that is a good compromise.
It's a more flexible one, especially if you're able to serve your SDK via CDN, then you can publish updates and fixes and patches
and new features whenever you need
in a much more healthy manner.
And make less forced upgrades to those downstream teams
and by extension their end users.
How restrictive are these SDKs for the developer?
Do they enforce, for example, specific types?
Do you reach at that point where there are specific types that they have to follow?
Or they can do whatever they want and they can send whatever data they want at the end,
right?
Because if this is the case, again, the contract can be broken.
So what kind of balance you have figured out there?
Yeah, now we're into kind of the fine tuning knobs
of self-service, right?
And verging into data governance now.
So we've provided an SDK.
We've provided these stock functions
that construct the event payloads.
But yeah, there's always some edge case. There's always some custom
request where we want to be able to pass data this way under this name that the SDK does not
allow for. Or maybe there's some quirk of your legacy CMS where it outputs data already in some way. If only we could shove it in and shim it into that payload.
So, yeah, we absolutely, there's a line we walk, there's a bounce we try to strike of self-service
where we can offer this one custom carve-out space where you can pass a JSON blob,
ideally with some limits. It's probably an
arbitrary honor system arrangement, but we'll take your data into the data warehouse, but it'll still
be in JSON or, okay, we can offer custom enrichment of that data. Once in the data warehouse, we'll
model it for you for a set period of time. And then past that point, either the instrumentation
has to change, or we just have to figure something else out that works for both sides. Yeah, that's
a great question. It's always a challenge between where does the labor fall? Whose responsibility
is that? Whose ownership is that? And governance is a challenge in so many aspects of of life these days and data
engineering and end users and analytics is no exception to that 100 i totally agree with you
and i think it's one of the problems that as the space like the industry is maturing we'll see more
and more solutions around that and probably also like some kind of standardization like we've seen also with things like DBT. But
from what I understand, and that's also my experience, issues are inevitable, right? Like
something will break at some point. And the problem with data in general is that they can
break in a way that it's not obvious that something is going wrong.
Like you can have, let's say, for example, double gauge,
or you might have data reliability issues, right?
Your pipeline is still there.
It still runs outside of seeing something not ordinary
when it comes to the volume of the data or something like that.
You can't really know if you are still sending the right data right so how do you deal with that how what kind of like mechanisms you have figured out like because you have like a very
long career like in this space like what do you do how do you take do you yeah yeah i'm smiling because this always reminds me of uh
the line from shakespeare in love where the producer is seeing this madness going on in
the theater and asking how are we ever going to pull this off and i believe it's jeffrey rush
says nobody knows but it always works we'll work it out we'll figure it out. We'll figure it out. And that was definitely the case in data engineering until very recently. We were flying blind. We were. There was little to no running and whether they were maxing out on CPU
and could take you deep down the rabbit hole of JVM optimization. But really describing the data
behind your data was surprisingly hard. When it came to describing how ETL was performing. That was really hard for me, both as an engineer and as a manager
responsible for representing my program and my team and the culture and engineers that I cared
very deeply about continuing to grow. And just in terms of maintaining their quality of life, there were some
downright stressful times, there was definitely burnout on the team. And so again, people drive
software drives people. And I knew we could do better at the time, I very much wanted tools to
do that. And I'm happy to say in the last just just in the last six months, as an IC, again, back
at CNN, I've been focusing on data quality and observability quite a bit.
I've been testing different solutions.
Recently, I've been working with VData and Monte Carlo as observability solutions.
Again, I think having a more dynamic data warehouse like Snowflake
helps unlock a lot. And I've been working on, simply put, data quality algorithms that can not
only tell us how we're doing, but better define, illustrate and advertise our SLAs and SLOs to our partners
and tell them how we're doing with some real numbers.
Oh, that's super interesting.
Can you share a little bit more around that?
Sure.
So I believe Eric mentioned dbt back a while ago.
I'm a very proud and happy dbt user. We've worked with them extensively to harden our
data stack and I'm using it to capture things like presence or absence of critical fields
in our enriched tables capturing latency of records as measured from when they land in our raw data tables versus when they
reach enrichment and our data marks.
And beginning to kind of, as I said, develop an algorithm starting from a certain baseline
and say a record is missing a critical user ID, I might subtract a tenth of a point, two tenths of a point, depending on the
how critical that ID is. Maybe it's a second tier ID, and it's not as important, and the record is
otherwise usable. Maybe it's not usable. I still want to send the record on downstream,
but with that metric attached to it. And then you calculate that on a record and then a table
basis, and you can begin to calculate like a daily average, a monthly average and start to
build a scorecard. One of the biggest assets going that I think our analytics board at Nordstrom had
was a fitness function. I think that term, maybe it's a little Amazon or
Microsoft centric, and maybe it's fallen out of favor a little bit, but it's sort of an assessment
of your program's technical capabilities and the impact you have on the business.
And when you work in analytics, that can actually be hard to do. But we were able to extrapolate a lot of performance metrics out of the test campaigns that we would run, out of the quick stream features that we would ship. think that's actually more critical to you as the IC or the manager than it even is to your team
executives because it gives you one more measurable to perform to assess your performance against your
OKRs. That's super interesting and okay let's say we have in place like amazing algorithms that help us like monitor and figure out if something goes wrong.
Okay. And let's say something goes wrong. How do we debug data?
I mean, as software engineers, we know that we have tools to debug our software, right?
Like we have debuggers, we have different tools that we can use for that we have testing we have many different things that
they are both like let's say tools and also like engineering processes and best practices that we
have learned that like they help like reduce the risk of something breaking how do you how do you
debug like something that starts from the client of a mobile app and reaches at some point your data warehouse and anything can go wrong
like between their line between like these two points right so how do you do that yeah that's
another balance you have to strike between how much work do you want to put into making your
synthetic tests appear as organic as possible, right?
We've tested our pipeline using tools like serverless artillery to the point that we can accept hundreds of thousands of requests per second.
I mean, think about the news industry in general.
Yeah, there are planned events like elections, but there are also unplanned events throughout the year as you go that can drive everyone to their phones and their laptops
and can be extremely unexpected. And we need to prepare for that. So we've used the output of
those tests to beef up things like across region failover pipelines,
things like that. But even then you're operating on an assumption where you're probably using a
fixed set of dummy events. So then you have to decide, okay, is it worth dedicating time maybe
to pull in some developers from the mobile team to more accurately simulate
how a user uses the app as they understand it. Keep in mind, their assumptions may be based on
your own assumptions that are coming through your pipeline based on the data that you're
serving to their analysts. But yeah, you could certainly go down a rabbit hole and put a lot
of work into automating tests from different platforms
build a device farm even it's just about it's a matter of how far down that hole you want to go
and how much you want to invest makes total sense and my last question and then i'll give the
microphone back to to eric you mentioned at some point like more senior people that are involved in this, like VPs, like the leadership of the company, probably people that maybe don't even have like a technical background, like they cannot understand what delivery semantics are or the limitations of technology that we have and all these things.
And at the same time, you said that like it's very important to make sure that you can communicate these things to them.
Do you have some advice to give around that?
Like how you can communicate effectively to your leadership team the limits of the data, how much they can trust it, and the limits of the technology and the people that we have?
Sure.
So that is a primary function of the engineering manager naturally but
no engineering manager can do that alone um so my advice to those considering management
is before you accept such an opportunity and it may be a fantastic opportunity but
do all you can to ensure that you have backup in place. Insist on program management help because you as an EM are busy managing not only the careers, but the lives and even the mental health of the technical talent that you worked hard to get in the door. And you can't be in every meeting.
You can't be in every scenario. You can't cover every hour. You'll burn yourself out if you try.
Just like I would also recommend insisting on a product manager, because there are already 100
different technical ways you could take a product. and 99 of those might not meet the actual demand of your internal users downstream.
And, you know, in data engineering, we talk mostly about internal users as our customers, but I do believe that extends to the external end user as well.
But similarly, there's only so much you can do as an engineering manager to evangelize for the system you're building and to canvas your users on how it could be better and how you could
better serve them. So those are really what I call the minimum three legs to the stool
that you need to build an effective data
engineering team and to meet those requests that are only going to mature as
your consumer teams mature and as the business matures around you.
Thank you. We got, I think,
some amazing ideas and advice from you. Eric,
what are your questions?
I have so many actually that we don't have time to get to because Bricks is telling us we're close to the buzzer. The SDK conversation is absolutely fascinating. So I'd love to have
you back on to talk more about that. I have two more questions for you, both kind of quick,
I think. Maybe not. Maybe I should stop saying that because they usually are. You have a really wide purview as sort of a buyer, user, researcher of tools.
You have a bias towards open source tooling, but you also use sort of third-party SaaS
that isn't open source. What are some of the tools, you know, whether you use them or not,
that you've seen in the data and data engineering space that are really exciting to you that you
think sort of represents like, okay, this is kind of the next phase of tooling that's going to be
added to the stack? Yeah, again, first and foremost, I would say the data observability and data quality tools, just because, again, they have such a direct impact on quality of life for your engineers went unmet for so long. And I'm not even exactly sure why that is,
but I'm very happy to see that we've brought principles of site reliability engineering
into the data engineering space. It's like everything went into the cloud and all of
your sysadmins became SREs. And now a lot of those SREs
are starting to look toward the data space
or the data space is starting to look to those SREs
to say, hey, can you help us out
and make sure that this thing stays up?
Because if the data is gone, it's gone.
There's no way to get it back.
Yep.
So that's one thing I'm excited about.
I'm also, I would say,
carefully excited about machine learning. I think ML, like blockchain, is one of those things that it's easy to say, oh, we should be doing this without you want to capture, I would again suggest think carefully before you apply ethics, because I think that is critical to the process.
And that may be, you know, once data observability improves, I hope ethics improvements are right on
the heels of that. But executing ML models is still fairly complex, but I think that will
improve over the next couple of years and become more closely integrated with the data stack. And I think in terms of applications of all
of this, of ML and personalization, I think I'm most excited about the health space, which I have
not worked in personally, but I think it has the biggest impact simply because you have the greatest diversity of end users there.
And it's one of the most complex problem spaces, obviously.
And with our health system as it is, it's tempting to try to make an end around to try to deliver some of those solutions that kind of break down the silos.
So I hope we can continue to do that in a responsible way as well.
Very cool.
I am 100% aligned with you on all of that.
And actually, I've been writing a post that hits on some of those direct points.
Okay, last question before we're at the finish line.
You've been an individual contributor, an engineering
manager that have worked in a variety of roles. Maybe give your top two pieces of advice and maybe
we could give a piece of advice to like an individual contributor who sort of aspires to
be a manager and then maybe give a piece of advice to a manager who's early in their career, you know, working with an engineering team,
data engineering team? Okay. So I would say for the IC that is considering management,
I would say it is absolutely a very rewarding career change, but it is a big change. And the role, it's not as simple as just
speaking as the most technically mature member, you know, frequently, the most senior
IC becomes the manager out of necessity. And it looks on paper, like it's a very natural
transition. But that's not necessarily the case.
It's a very different skill set.
You do need to be able to speak to what's being built and delivered.
But you are coming at of a complex technical system
that's hosted somewhere and rendered somewhere else. Come with humility, be prepared to listen,
ask more questions than you make pronouncements. And I think that's a good transition to the
person that is already a new manager is, again, expect to do much more listening,
ask tough questions when the time warrants it. But you're coming to learn from the people that you've retained or recruited to work for you,
leverage them to be your experts.
Don't try to be the smartest person in the room.
You are there to hire the smartest people in the room and to be able to send them
into rooms that you can't reach because you're overbooked and you will be overbooked.
That my friend is some of the best practical advice for management I've ever heard. And is so true. Well, Sean, this has been such a fun episode. I've learned a ton and we just thank
you so much for your time and sharing all your thoughts and wisdom. Absolutely. I'm happy to help anytime.
I feel like we could have talked for hours and that's such a cliche saying now, because we say
that every time and really it's true. I'm going to pick something really specific as my takeaway.
Whenever you hear about a data engineering team building their own SDKs, to me, that's an eye-opener because,
you know, I don't come from a software engineering background, but I know enough to know
that's a pretty heavy duty project to take on at the scale that they're running at, you know,
a company like CNN with, you know, traffic volumes that they have. I mean, building a robust SDK is
no joke. But the more I thought about that after Sean said it, I just kind of reviewed my, you know,
mental Rolodex of hearing that. And I realized, you know, it's really not the first time that
I've heard of a large enterprise organization building their own SDK infrastructure in large part because the
needs that they have to serve for downstream consumers, to Sean's point, is so complex.
And so even if you take something off the shelf and modify it, you end up with something that's
pretty different than the original SDK that you had anyway.
So that's just fascinating to me.
And it's pretty fascinating also, I think, to just consider a situation where building
your own SDK is the right solution.
Yeah, I totally agree with you.
I would say that I keep two things from the conversation we had with him.
One is the concept of contract that comes from
building APIs. I think it's a very interesting way of thinking and building also data contracts
and what data contract would look like or how it can be implemented and what we can learn
from building these services all these years and use this knowledge like also in the data space.
That's one thing and the other
thing is i think that by the end he gave some amazing advice on how to be a manager which is
i think it was super super valuable for anyone who is interested in both becoming a manager but
also interacting with managers which pretty much much everyone, right? So that was also amazing.
Yeah, I agree.
And I think, you know, I mean, he said that in the context of data teams specifically,
but really just great advice in general.
So really appreciate that.
Yeah.
All right.
Well, thank you for joining us on the Data Stack Show.
Tune in for the next one.
Lots of great shows coming up.
We hope you enjoyed this episode of the Data Stack Show. Tune in for the next one. Lots of great shows coming up. We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes
every week.
We'd also love your feedback.
You can email me,
ericdodds at eric
at datastackshow.com.
That's E-R-I-C
at datastackshow.com.
The show is brought to you
by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.