The Data Stack Show - 23: Migrating from On-Premises to the Cloud with Alex Lancaster from Intuit
Episode Date: February 3, 2021On this week’s episode of The Data Stack Show, Kostas and Eric are joined by the risk data engineering manager at Intuit, Alex Lancaster. Alex has been with Intuit, known for its products like Quick...Books, TurboTax, Mint and more, for 15 years and was part of a recent massive and successful re-architecturing from on prem to cloud-based.Highlights from this week’s episode include:Alex and his role at Intuit (1:51)Data marts at Intuit (2:57)Revolutionary changes in the data engineering space in the past 15 years (6:46)Security in the cloud vs. on prem (12:46)Data architecture at Intuit (15:42)Doing ETLs inside or outside of the database (19:11)How to transition successfully from on prem to cloud. Forklifting vs. re-stacking (23:22)Alex’s application of software engineering skills to data engineering (28:44)Dealing with data engineering challenges related to security and regulation (31:48)Pipelines managed and challenges in data types (36:45)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists,
and the teams and people consuming data products.
I'm Eric Dodds.
And I'm Kostas Pardalis.
Join us each week as we explore the world of data and meet the people shaping it.
Welcome to the Data Stack Show. We have Alex from Intuit on the show as a guest today. And my burning question that I want to ask Alex is, he's been at Intuit
for a really long time, you know, and it's really common. I think among our guests, you know, they'll
have different roles in different companies, which is really cool. It's just unique to see someone who's
been at a company for well over a decade. And so one of the main questions I want to ask Alex is
what he's seen in that time within an organization. That just gives you a really unique perspective.
Costas, what's the main question you want to ask Alex? I really, really want to ask him about the migration from on-prem to the cloud,
especially for a company of the complexity and the size of Intuit.
So I'm very, very excited to talk with him today and learn more about this.
Great. Well, let's go and ask our questions.
Let's do it. Welcome back to the
Data Stack Show. We have Alex Lancaster from Intuit. Alex, thank you so much for joining us
on the show today. Sure. Thank you for having me. Now, I'm really excited to chat with you because
I think you're going to bring, I think, a unique perspective. You've spent well over a decade coming up on 15 years at the same
company working in software and data. A lot of the guests we have have been at multiple
different companies over that period of time. And so I'm just really excited to hear about
your perspective having been at the same place over such a period of change too with technology.
So why don't you
start out by giving us just a little bit of background on yourself and talk about what you do
at Intuit. Okay. So my name is Alex Lancaster. I'm the risk data engineering manager at Intuit.
In February, it'll be 15 years there for me. And before that, I worked at United Title Escrow as a software engineer for
four years. And before that, I worked for an MLS company in Simi Valley for almost six years.
And my current work for Intuit is mostly in the data and engineering, data warehousing,
data pipeline space for risk and fraud management for money movement. And we also do some stuff for the compliance folks
and pricing and accounting and finance teams. And we help design all kinds of data marts,
data warehouses, reporting dashboards, things like that. And then our product internally
is known as the risk data mart. And could you explain just for the sake of our listeners,
could you explain the concept of a data mart within Intuit and sort of how that seems like
a product that your team is producing for other people within the company? Could you dig into a
little bit about what a data mart is? Sure. So it's usually a large collection of tables that
have been brought in from different sources,
many different sources.
We probably have 20, 30 different sources that we bring in data to.
Usually these are front-end source systems, and they'll do one little piece of the pie
or piece of the business.
And then when it comes time to understanding the big picture and
people want to do reporting for long periods of time and they want to aggregate data and
roll up data across lots of different functions you know you got to have that all in one place
so that's mostly what a data mart is about and and also you know the data is often transformed or pivoted or flattened whatever you
want to call it into schemas and you know this is where like the kimball conform dimensional schema
came from you know many years ago and oh interesting you know so people you know you want
to transform the data in a way that makes it work really well for reporting and analytics. And because the source systems that are upstream,
they're usually designed to be fast
at a transactional level.
So you can select, insert, update, delete,
be really fast for one record,
but in a data warehouse,
you're running queries across millions or billions
of records and long periods of time.
And that's a totally different kind of workload than the upstream source
systems do. So that's sort of a summary of what a data mart is.
Yeah. I mean, it's, I mean, I,
I have a ton of questions and I have one more before I hand it over to Costas
because I know he's probably chopping it a bit with all the interesting stuff
there. But one observation is that, you know,
we talked to a lot of different people from a lot of different companies working in, in data engineering. And
it's really cool to hear about, I guess I would call it the productization
of delivering data to the company. I mean, even with a name like Datamart, you know,
in a much smaller organizations, you know, you basically have the software engineering team
also doing the data engineering team, and then you get a little bit bigger and you have,
you know, maybe a data engineering team and perhaps a data analyst, and then those teams grow,
but it's sort of an individual delivering those things. But it's really neat to hear about how
you've really productized that in a pretty significant and widespread way at Intuit.
Yeah. Yeah. I think it has a lot to do with
the size, right? So when you start out small and you're just supporting a few people or teams,
you can approach it one way. But I have 11 engineers on my team now, and then we're supporting
400 plus people across the enterprise. So when it gets big like that, then the story changes and the way that
you approach things has to change. And also when you're talking about money movement and compliance
and SOX audits and things like that, you have to get more serious about how things are architected
and that sort of thing. Sure. Okay. Well, I'm going to ask my burning question
based on your time at Intuit, and then I'll hand it over to Costas because
I'm monopolizing the conversation. So almost 15 years at Intuit, congratulations.
You really, I think, have seen sort of what we would call like the data engineering revolution
firsthand with just massive change in technologies, data infrastructure,
the coming of age of the cloud, just all sorts of different, I mean, major, major milestones
in terms of the way that sort of software is delivered and consumed today. So I'd love to
know what are, when you look back over 15 years into it, what are some of the big revolutionary changes that you've seen in
the data engineering space? I think for me, at least the biggest is the move from the on-prem
world into the cloud world. So when you have an on-prem data center, you know, you're maybe you're
using, you know, storage area networks and, you know, you have to worry about your own infrastructure and worry about
storage space and how many nodes do I have and how much space left do I have on my SAN. And maybe if
you have an active, passive or active-active data center situation, you have to worry about
replicating your data across to the other data center. So this world, while it was okay,
and some companies were better at it than others,
it had a lot of problems and drawbacks.
There was always people messing around
with the network infrastructure
and doing patching or updates at weird times
and they may or may not tell you about it.
And so it's as good as companies could get at that.
I don't think they're anywhere near what the big cloud companies are today.
So when you,
when you move to a public cloud and you're in an Azure or AWS situation,
those guys are investing billions every year into their cloud architecture and
infrastructure.
And I'm pretty sure no company, even governments, can't compete with that kind of investment.
And so they're really good at it. And they have designed their cloud environment from the ground up to be very scalable across
the world.
And it lets you get out of the business
of worrying about your hardware and your storage and you know do i have hard drives that are
popping or network cards that are popping that kind of thing you don't have to worry about that
anymore so you can scale in a way that's just impossible to do on-prem. So to me, that's the biggest kind of change I've seen, you know,
in the last, I don't know, 10 years or something like that. So I can talk a little bit about,
you know, what we were doing on-prem versus what we're doing in the cloud today. Just give a quick
summary. Sure. Yeah, that'd be great. And I do think, I mean, I've never heard the comparison of, you know, not even a government can invest that much into the technology. And I think that's just a fascinating comparison. So thanks for that. That got my mind going. But yeah, I would love to hear about your migration from on-prem, you know, we actually had a pretty nice setup. We were using SQL Server Enterprise Edition.
We had a nice, you know, Dell Fiber Channel SAN dedicated to our environment. It was 185 terabyte, which is pretty good size. And we had two data centers. So we had identical setup,
about a thousand miles away from each data center with data replication running between the two.
And, you know, that worked well for a while and 185 terabytes
is nothing to sneeze at.
It's mostly row store data though.
SQL server does have column store index.
So there were some, you know, columnar tables, which we'll, we'll get into later.
But mostly row store stuff.
And then in September, 2017, you know, we started to get work on the AWS public
crap cloud migration, and we decided to do a full tech restack at that point.
Not a forklift.
So the difference is when you do a tech restack, you're basically
re-architecting everything you have.
Moving the products of everything you have.
So we moved away from SQL server to Redshift, for example.
And that took us like 18 months to do that.
So by summer 2019, we were pretty much all in AWS.
And then we were able to turn off our on-prem infrastructure at that point.
And now, you know, we're all in the cloud using the native services there.
So we use things like EMR and Spark clusters and Parquet files and S3 and Redshift, Aurora, Kinesis, you know, MSK,
which is the managed Kafka service, Glue, you know, CloudWatch, things like that. So it's,
that was a huge change for us. And it took us 18 months to do that. So, you know, it was painful,
but it was worth it. And now we're able to support an
environment where we've got around 600 terabytes of columnar compressed storage. So that's, you
know, 10 to one compression ratio right there. So if you tried to take that 600 terabytes and put
it in a row store, you'd end up with like, you know, 6,000 terabytes. So that'd be really hard
to manage on-prem, you know, in some kind of SQL Server or Oracle environment.
But in the cloud, I'm not too concerned about managing 600 terabytes.
And then plus in the cloud, Amazon is managing a lot of data replication for you.
They're doing patching and management stuff for you. So a lot of burden is on them.
And that allows my team to focus just on building application logic and serving our customers. And I don't have to worry nearly as much about what's going on in the data center anymore.
Alex, it's very, very interesting and very exciting for me to have you here today because
you are one of these rare cases of people who have experienced both the on-prem and the cloud solutions.
And it sounds so far that you're pretty excited about the cloud.
And correct me if I'm wrong, but probably you prefer it and you find a lot of benefit in being deployed on the cloud instead of using an on-prem infrastructure.
Many people say that one of the benefits of having on-prem deployment has to do with security and compliance and the control that you have.
What's your opinion about that?
Do you think that this is actually something that it's a real concern?
Do you think it is addressed right now by the cloud providers?
Do you think that there's still work to be done there? What's your feeling about it?
I think the security is fine in the cloud. And at Intuit, we have a central security
team. We have a data handling team and they help the various PD teams set up their account
in a certain way. They have Intuitized AMIs. So when you're restacking your
AMIs, they come bundled with all of the security that they want. And we have all the KMS keys and
things like that. That's locking down S3 buckets and encrypting data at rest the way we want.
So I don't see a problem with that. But at the same time, we have a central team
of very smart people that have looked into the details of all this, and they've carefully
architected things to a corporate standard, and we follow that standard. So, you know, to me,
everything works awesome in the cloud, and I would never want to go back to the on-prem way
of doing things. That's great. Is there any kind of advantage that you still think that on-prem has compared to cloud?
If you're small, maybe. I really don't think so. I mean, honestly, I think that the age or the time
of the on-prem data center is quickly evaporating and going away. I don't think if you're a new
company and you're thinking about building infrastructure, to me, it makes no sense to do it on-prem, just build your stuff in the
cloud from the get-go. Maybe there are certain industries or certain weird use cases that I
haven't heard of that you really need some kind of on-prem supercomputer, or maybe you're like a
weather modeling place or something and you need some
crazy supercomputer. But I mean, these days there's so much variety and option in the cloud
to do huge machine learning, huge modeling of data and handling of many, many petabytes of data,
very straightforward. So I just don't see any advantage really to on-prem anymore.
Yeah, makes sense. That's very interesting to hear from you. Going back to the things that
you mentioned a little bit earlier during your conversation with Eric, where you mentioned about
data marts. I mean, data marts in data infrastructure is one of the last steps before
the customer, they're not that you have, the user
of this data is going to consume it through a BI tool or whatever other tools they have.
Can you give us an overview of the architecture that you have today? I mean, the architecture,
the data infrastructure architecture that you have and what kind of paradigm you're
following? Is it something like a data lake?
It's more like build around the data warehouse.
And let's chat a little bit about this, because I think you're going to have a very interesting case.
And you've done a lot of, let's say, very thoughtful decisions around that stuff.
So I think it's going to be very useful for both me and Eric, and also like the people
that are going to listen to the show.
Sure. So we do have a central corporate data lake that is there, and we do pull data from that.
And we also register our transformed files with the central Hive metastore so that it's visible
to other people that use the data lake. But we also have to pull from upstream transactional systems and also streams to get
data in our environment. So, you know, we use EMR clusters to do query-based ingestion from
some places. We use Kinesis streams and MSK to pull from data from queues. There's different
latency requirements that we have. So the lake, for example, could be
like a 24-hour kind of latency situation. And then if you try to pull from upstream
transactional databases, maybe you're running many batches and you're pulling every two,
three hours or something from them. And then if you have very low latency situations, you're talking about streams.
So like Kinesis Stream, MSK,
you can get data into your warehouse every 15 minutes,
every 30 minutes, something like that.
So we have all those use cases in play today.
And I think one of the most important architecture things
for me is do your ETLs outside of the database, right?
So when we were in SQL Server on-prem, we were using SQL Server to do your ETLs outside of the database, right? So when we were in SQL
Server on-prem, you know, we were using SQL Server to do the ETLs and we had lots of stored procedures
and using SSIS and all that. So all that's gone away now. So we use EMR and Spark clusters and
we have several of them and we can scale out our Spark clusters as needed. You know, we can use persistent Spark cluster.
We can use transient Spark cluster as needed.
And also Lambda functions.
When you talk about streaming, you know, Amazon manages the infrastructure for Lambda functions.
And we can handle, you know, hundreds of thousands of messages a minute in that scenario.
And then you, you know, you do your transformations in Spark and so on, and then you
write it back out to S3 for your final summary tables, right? So use Parquet in S3, and you can
partition Parquet files, huge Parquet files, right, that can be dozens of terabytes large
in S3 with no problem. And then you just use this copy command
to load that into Redshift very quickly.
So Redshift has a way to do parallel loads
with Parquet files in S3 very fast.
And your Redshift, most of them take seconds
or a minute or so.
And then Redshift just becomes like your serving layer
at that point.
So that's sort of the main architecture overview.
That's very interesting.
Actually, you said something that I'd love to learn more about.
You mentioned that it's better to have your ETL logic
outside of the database, let's say, or the data warehouse,
which is quite interesting because I don't know if you have heard
of all this movement in the market
from going from ETL to ELT, which is more of the paradigm of let's extract the data,
load the data into the data warehouse, and then run any kind of transformations
that we want inside the data warehouse instead of doing it on the fly.
So why do you believe that it's better to have the ETL outside?
And what's the difference?
Like, what was the problems that you had when we were doing the opposite with MS SQL Server?
Okay.
So when you do the ETLs in the database, you are sort of boxed in or limited by that machine,
right?
So if you need to handle some giant ETL job with billions of records, you're running that on your database.
And when the data gets really big, you start to have problems with this approach. So when you
take the ETLs out of the database and you're doing it in EMR with Spark, now if you need a 50 or 100
node Spark cluster for 30 minutes, whatever, to process some 50 plus billion row
ETL, you can do it and it's not going to touch or hurt your database or affect the resources there.
And, and then you use the, you know, the big data parquet format and S3 to store your transformation
back out and you can partition the parquet file, you know, however you want is very useful.
And then the copy command works very good with Redshift to load the data in there.
But at the same time, you can use your parquet file in S3 to share your data with back with
the lake, right?
So you, what you do is you use a hive cluster to, to register that table with, with a hive
metastore.
And then the lake becomes aware of your Parquet file sitting in your account.
You don't even have to move the data anywhere.
It's just a metadata entry in there.
And then people can query the lake and see your Parquet file and query it right away.
So you solve two problems, right?
You're solving the problem of sharing big data sets with data science folks
that want to use your output with SageMaker and their own Spark clusters. And what they really
want to see is Parquet files in S3. And then you solve the data warehouse use case with Redshift,
where people just want to use SQL to query it. And they have things like Tableau and business objects and Qlik Sense and so on connected to Redshift that works well for them and
you know it's just the quickest way to this scenario. This is great. I have
another interesting question at least for me. You are describing a very like
modern data architecture that you have deployed on AWS.
That's from what I understand, a pretty recent development, right? I think you said that you
ended the transition to the cloud in 2019 or something. Is this correct?
Yeah. We were finished in summer 2019 and it took us about 18 months to do all of that.
Yeah. So this lake architecture that you have,
did you have any part of this architecture also when you were on-prem or the architecture that
you have there for your data infrastructure was completely different? So on-prem,
they had the Intuit analytics cloud, the IAC. It was a big H hive cluster, Hadoop cluster. It was not very good.
It was nowhere near what we have in AWS with the S3 data lake now.
And it was always having like space problems and, you know, throughput
problems and stuff like that.
It just, we just couldn't operate it on the scale that we wanted to.
And the lake really, in my opinion,
the central data lake wasn't truly realized until we got into AWS and we got everything in S3 and everybody put Parquet files in there
and it became like this real usable, powerful thing at that point.
It's very fascinating.
How do you do that?
How do you design this transfer
from this on-prem solution that you already have and you're running and it's operational and it
drives your business and in 18 months you have completely substituted this environment with
something completely new, right? Because it's not just that you are changing your infrastructure.
It's not that you did just that. You re-architected the whole data infrastructure that you have. So
what it takes from an organizational point of view and from the engineering perspective
to do that, how do you do that? I'd love to hear more about how you did it successfully? So the first, you have to make a decision
about forklifting versus tech restack. That's a key decision. Personally, I wouldn't recommend
people to forklift what they're doing on-premises into any cloud and then try to duplicate what
you're doing on-prem using virtual machines in the cloud. That's really not what the
cloud is designed for. And you can do it, it's true, but you're not going to get the result and
the value and the benefit from the cloud that you could if you use the native services there.
So we decided to do a full tech restack. We wanted to use all the native services in the cloud and really use the cloud for how it wanted to be used.
And we wanted to get into Spark.
We wanted to use the Redshift MPP, you know, which is a managed service.
And then instead of SQL Server, we use Aurora.
We have a small Aurora database that's also a managed service.
So that's like the start of it. It's that decision forklift
versus tech restack. And then, you know, you have to, there's a lot of learning that 18 months was
painful, you know, and we had a lot of learning and a lot of trial and error on things. And,
but, you know, we had some architect people to guide us with decisions. We had technical account
managers from Amazon to help guide us with certain decisions. So that helped a lot.
And then, you know, you have to make sure that your manager and his manager and so on
is on board with that.
And, you know, your executive sponsorship is on board with that about what you're doing
and why you're doing it.
So you have to, you know, politically and, you know, program management wise, you have
to communicate a lot about what you're doing and why you're doing it and timelines and so on.
And then you've got to get your customers to come along for that ride at the end and convince them that, you know, you're doing the right thing for the right reasons.
So it's a complicated thing, but at the end, I'm glad we did it this way.
And for us, that tech restack decision was the right one.
I see other teams who did not make that decision. They wanted to, they decided to forklift and
I see they struggle. They have all kinds of issues from doing that. And I'm just so glad I'm not on
those teams. Yeah, that's, that's amazing. I mean, congrats for successfully doing this project. I mean,
it's for you and the whole team that was involved in this. It's really amazing because it's not just,
I mean, and it's also amazing like from an organizational standpoint, because there's
always resistance and change and you decided not just change, but to radically change your
infrastructure and the way that you operate. And that's amazing. And says something also about the culture in the company.
One last question before I let Eric continue with his questions.
Is there a particular technology that became available to you after you migrated
into the cloud that you are really excited that you are using and it's something that
you consider as like a game changer in your, in your work?
Yeah, I would say the streaming. So being able to use Kinesis streams with Lambda or using
MSK managed services for Kafka, that's pretty huge because now you can get data in your warehouse,
like 15 minute latency, 30 minute latency and handle huge throughput, right? So we can
handle, you know, hundreds of thousands of messages a minute with no problem. And Amazon
is scaling out on the backend, handling all this, all this crazy message infrastructure.
That's something that we just could not do on-prem. And it's exciting because people can,
your customers can see what's happening, you know, in production, you know, 15 minutes after it happens.
And that just wasn't really possible before in a big data, you know, data warehouse situation.
That's great.
Eric, it's your turn now.
Awesome.
Yeah, I was going to say, thinking back, you know, you said 18 months. And, you know, that's, that is a non-trivial amount of time. But my gut reaction to hearing, and especially now after hearing more of the details of the migration, that actually sounds really fast for how fundamental of a shift it was technologically. So again,
I'll reiterate Kosta's congratulations on that because that's a monumental effort in a relatively
short amount of time for how much he changed. Thank you. Yeah, I think it was worth it. So
we're happy with where we are now. I guess my whole thing with this episode I'm discovering about my line of questioning is
about understanding sort of the course of your career but I noticed that you were a software
engineer before working in the data engineering space and I'm just interested to know you know
and you did software engineering at atuit, how has that changed your perspective
on data engineering? And specifically, you know, do you think that there are things that you
experienced as a software developer that make you more valuable as a data engineer,
especially with sort of the range of, or the scope of work you're doing across the organization with
all sorts of different types of data? So, so yes, I was a, you know, software application developer for most of my career.
And then right around October, 2010 timeframe, you know, I, I left that, that software engineering
team and became part of the risk, you know, data warehouse warehouse BI team and pretty much been working in that space ever since.
So as a software engineer, I was working with highly transactional systems and data sets were mostly small.
So you build like a business application or a website or something like that.
And you're just dealing with small, lots of small transactions and
usually working with relational databases like SQL server or Oracle or so on. So I did that for
a long time. And you know, I'm happy with what I learned in that space. And, you know, I learned
kind of what the limitations are, although at the time, you know, I didn't think about that.
There were limitations. I just learned how that world worked and, you know, dealing with transactional DBs and getting good at writing SQL and stored procedures and learning how to, you know, tier your applications and those kinds of things.
But I think just around October 2010, you know, I just got more interested in the data warehousing space and started working on that. And that's when I started to, you's more on the backend of course, but it's just dealing with different kinds of problems and the data is much bigger and
the problems and scenarios are different.
So it kind of felt like a new job now in many ways and keep my interest out in this space.
But I think it's, it's a really good way to kind of get out there and do what you're
doing and do what you're doing.
And I think that's a really good way to get out there and do what you're doing. and scenarios are different. So it kind of felt like a new job, you know, in many ways and keep
my interest out in this space. But I think it, you know, really helped that I know the front end as
well as the back end and kind of what the pain points are on the front end and understanding
what they're about. And I think that helps me deal with the backend stuff and, you know, be sympathetic to those
things.
Yeah, absolutely.
It gives you, having been in sort of the shoes of someone who's doing a certain job that
has an output that you deal with, gives you, and I'm just thinking back on experiences
I've had where it just gives you a lot more empathy, you know, in terms of dealing with
some of the issues that come with data, which is always messy, you know, in some form or fashion,
and always requires some level of cleansing. So question, so Intuit's in the financial space,
so you deal with sensitive information. Could you talk through how that impacts your work in the data engineering
space? I mean, you talked a little bit about the security in the cloud, but finance is one of the
most highly regulated industries there is. And dealing with that data, I'm sure presents pretty
particular challenges. I'd just love to hear about what some of those challenges are and then how you deal with them as, you know, as a data engineering manager.
Sure. So where the group I work in is mainly in the money movement space. So this is things like
payments, payroll, QuickBooks capital, you know, moving money around, dealing with, you know,
card entities like Visa, MasterCard, Discover, Amex, Pindebit, ACH,
those kinds of things. And, you know, it's a lot of parallels with being a bank. So I always kind
of remind people that Intuit is almost like a bank, but not quite a bank. So there's a lot of
things that we need to do. Yeah, we need to do things, you know, that are very
common with, with a big bank.
So you have all kinds of compliance issues that come into place.
So like, you know, PCI compliance, SOX compliance, you know, for tax,
they have 72 16 compliance and you have deal with entities like office
of foreign asset control, NACHA, FinCEN. If there's big fraud events, we have contacts with the FBI and so on
to help us deal with fraud attacks.
And so there's a lot of regulations and stuff that you have to deal with,
and that's not fun, but it's necessary.
And also encrypting data in transit, encrypting data at rest,
and dealing with keys, how you're handling keys, how you're handling sensitive fields. And also, you know, encrypting data in transit, encrypting data at rest and
dealing with keys, how you're handling keys, how you're handling sensitive fields.
All of these things are important and there, there are central teams that help.
The PD teams, you know, deal with the stuff and make the right decisions and
make sure their account is set up the right way and that they're using the keys
properly and, you know, keys properly and you know they understand
you know two-way encryption or hashing and stuff properly and so you know there's a lot of guidance
and help in that space but yeah it is very similar to banking and and there's a when you move a lot
of money around there's a lot of risk that comes so fraudsters are always trying to, you know, attack the system and create fake
accounts and launder money. And, you know, so that's a, it's a big kind of soup of issues that
you need to deal with on a daily basis, but it's a fun, fun space to work in.
Yeah. Yeah. I mean, I'm sure it's, you have to solve all sorts of interesting problems, you know, as the entire world has gone digital.
And one question we had had a guest on recently who worked in data science
in the healthcare space.
And he talked a little bit about sort of the, you know,
some of the challenges he faced in a very highly regulated industry building models with,
you know, PII or sensitive data. I know you're not on the data science team, but it sounds like
you deliver data products to them, you know, or collaborate closely with them. Is there anything
on the data science side in terms of dealing with financial data or sensitive data that presents particular challenges?
Yes. So I think that they can use hashed fields for a lot of things. So instead of having a full tax ID in the clear, they can use a hashed value of that, for example. So yeah, I'm sure there's
issues when they do their featurization and they're coming up with,
you know, which features are going to be, you know, more powerful than others and more influential than others.
They have to use, you know, what's available to them.
So some of the things they can do is have like a real-time model, for example, in line
with a transaction or an onboarding event.
And in that case, they have access to data as it's coming in for a
transaction and they can see things that you couldn't otherwise see like in the data lake,
for example. So for those kinds of real-time models, they're able to do some fancy stuff there
and have access to data that wouldn't be normal to have access to in the lake. And then for a
batch model, for example, they can run huge batch models for portfolio
analysis or whatever on lake data or data that we have in RS3 bucket.
And then those models might be like hashed values for sensitive fields, for example.
So I think they get around it.
But there is definitely a big difference
between batch machine learning and
real-time machine learning. Very interesting. Okay, one more question because we're getting
close to time here and, you know, we talked to all sorts of different people, but it's really
fun to talk to data engineers because we get to ask all sorts of specific questions about data. So
you've talked about multiple different types of data, just in answering some of the other questions.
So Parquet files, et cetera.
But I'd love to know the breadth of the types of data that you and your team deal with.
And then if there are any particular types of data that sort of present unique challenges for you as you're managing.
I mean, it seems like how many pipelines do you manage?
It seems like a huge amount.
So we have, you know, well over a thousand jobs
in our environment.
And then for streams, you know,
we have a couple dozen streams going.
So yeah, it gets complicated.
And then there's dependencies, right?
So if you have, you know, 1,500, 2,000 jobs, whatever it is, you need to, certain jobs need to execute before others. So there's a complex dependency web that needs to be managed there. And so we have to take care of that too. In terms of the types of data, the various types of data that are flowing through those pipelines,
I'd love to know just some of the major ones to understand the breadth of different types you're dealing with.
So the standard in the data lake is Parquet, right?
And this Parquet is nice because it includes the schema in the header of the Parquet file.
So you can look at a Parquet file and natively understand the schema and the data types in there. And then
it's partitioned into many files in S3. So it's easy to read that way. And you can read specific
partitions if you want. So that's the data lake standard. But if you're doing messaging or
streaming, usually JSON format messages are common there. And some of those can be, you know, pretty simple and trivial. Others can be have deep nesting and be kind of complex.
So you have to be, you know, adaptive parsing JSON out to, to do whatever
processing or flattening that you're trying to do from a data warehouse.
And then we don't have to deal with like fixed field formats,
you know, much anymore.
There's usually the upstream teams are dealing
with that. So like, like Experian ARF format, for example, can be fixed field. So some of the,
some of the older mainframe systems that data vendors use, they might have, you know, fixed
fields or CSVs and things like that, but that's much more rare. And then we don't have to deal
with like any images or audio video data as of yet. So I haven't had to deal with that part.
Yeah, we had someone from Netflix on as a guest, and it was pretty fascinating to hear about them dealing with, you know, audio video data, because it's, I mean, it's pretty heavy duty, you know, when it comes to file sizes, etc.
I know I said only one more question,
but here's a quick follow-up.
What challenges do you face in data types?
Like, is there something that, you know,
is there something that you find
you constantly have to deal with
or sort of has required you sort of making changes
in the pipeline or addressing?
Well, I think you have to be good
detecting problems upstream.
So sometimes the upstream systems, they're not really aware or nice to the downstream systems,
and they can make breaking schema changes. They can change data types in the middle of a table.
Sometimes they change the meaning of fields, and they don't really think too hard sometimes about the downstream implications of that. So that's a challenge.
Also, the upstream systems may not be aware of the data can still be there for like 20,
30% of the time, but then they did a code change and now the other 70% of the time,
the field is not populated and they may not notice that right away.
But when you do, you know, aggregations and pivots and stuff like that, that kind of problem pops out very prominently and you can see big drop-offs and field populations.
So sometimes we have to, you know, tell the upstream system, hey, you know, what happened with this field?
And, you know, on Tuesday it was populating 99% in here and Wednesday we only see 70% what happened. And, you know, sometimes
it's news to them, but so you just have to be sort of prepared. You have to be good at
detecting problems with the upstream systems and problems in the lake too. So there's different
techniques for that. Sure. All righty.
Well, we are at time here, Alex.
It has been really fascinating to hear about all the work that you've done into it, all
the incredible work you've done into it.
And I know that our listeners will really appreciate the insights that you've provided,
especially around handling major migrations.
So thank you again for your
time and for teaching us so many great things. Thank you very much for having me on today.
Appreciate it. Well, that was absolutely fascinating. I mean, I think one of my big
takeaways is that Alex manages a thousand pipelines, which is kind of mind boggling to me.
That's just, that sounds, I'm getting a little bit stressed just thinking about that. What stuck out to you, Costas?
Well, I think managing a thousand pipelines is nothing compared to re-architecturing and
redeploying everything from on-prem to the cloud in 18 months successfully. That was
insane. I think they are very, I don't think that, I mean, he was very modest and very cool about
it, but the team and the company, I think it's also like a big success for Intuit and
the culture that they have.
This kind of radical restructuring of such an important thing and complex thing as the
data infrastructure in 18 months, like it's, it's insane.
I found it extremely interesting.
Me too.
Yeah.
Alex is so calm.
He's the, he seems like the type of guy you would want behind the wheel of a huge project
like that because it doesn't seem like a lot ruffles his feathers.
All right.
Well, thanks for joining us on the Data Stack Show.
Subscribe to get notified of new episodes on your favorite podcast service,
and we will catch you on the next one.