The Data Stack Show - 23: Migrating from On-Premises to the Cloud with Alex Lancaster from Intuit

Starting point is 00:00:00 Welcome to the Data Stack Show, where we talk with data engineers, data teams, data scientists, and the teams and people consuming data products. I'm Eric Dodds. And I'm Kostas Pardalis. Join us each week as we explore the world of data and meet the people shaping it. Welcome to the Data Stack Show. We have Alex from Intuit on the show as a guest today. And my burning question that I want to ask Alex is, he's been at Intuit for a really long time, you know, and it's really common. I think among our guests, you know, they'll have different roles in different companies, which is really cool. It's just unique to see someone who's

Starting point is 00:00:49 been at a company for well over a decade. And so one of the main questions I want to ask Alex is what he's seen in that time within an organization. That just gives you a really unique perspective. Costas, what's the main question you want to ask Alex? I really, really want to ask him about the migration from on-prem to the cloud, especially for a company of the complexity and the size of Intuit. So I'm very, very excited to talk with him today and learn more about this. Great. Well, let's go and ask our questions. Let's do it. Welcome back to the Data Stack Show. We have Alex Lancaster from Intuit. Alex, thank you so much for joining us

Starting point is 00:01:33 on the show today. Sure. Thank you for having me. Now, I'm really excited to chat with you because I think you're going to bring, I think, a unique perspective. You've spent well over a decade coming up on 15 years at the same company working in software and data. A lot of the guests we have have been at multiple different companies over that period of time. And so I'm just really excited to hear about your perspective having been at the same place over such a period of change too with technology. So why don't you start out by giving us just a little bit of background on yourself and talk about what you do at Intuit. Okay. So my name is Alex Lancaster. I'm the risk data engineering manager at Intuit.

Starting point is 00:02:18 In February, it'll be 15 years there for me. And before that, I worked at United Title Escrow as a software engineer for four years. And before that, I worked for an MLS company in Simi Valley for almost six years. And my current work for Intuit is mostly in the data and engineering, data warehousing, data pipeline space for risk and fraud management for money movement. And we also do some stuff for the compliance folks and pricing and accounting and finance teams. And we help design all kinds of data marts, data warehouses, reporting dashboards, things like that. And then our product internally is known as the risk data mart. And could you explain just for the sake of our listeners, could you explain the concept of a data mart within Intuit and sort of how that seems like

Starting point is 00:03:10 a product that your team is producing for other people within the company? Could you dig into a little bit about what a data mart is? Sure. So it's usually a large collection of tables that have been brought in from different sources, many different sources. We probably have 20, 30 different sources that we bring in data to. Usually these are front-end source systems, and they'll do one little piece of the pie or piece of the business. And then when it comes time to understanding the big picture and

Starting point is 00:03:45 people want to do reporting for long periods of time and they want to aggregate data and roll up data across lots of different functions you know you got to have that all in one place so that's mostly what a data mart is about and and also you know the data is often transformed or pivoted or flattened whatever you want to call it into schemas and you know this is where like the kimball conform dimensional schema came from you know many years ago and oh interesting you know so people you know you want to transform the data in a way that makes it work really well for reporting and analytics. And because the source systems that are upstream, they're usually designed to be fast at a transactional level.

Starting point is 00:04:32 So you can select, insert, update, delete, be really fast for one record, but in a data warehouse, you're running queries across millions or billions of records and long periods of time. And that's a totally different kind of workload than the upstream source systems do. So that's sort of a summary of what a data mart is. Yeah. I mean, it's, I mean, I,

Starting point is 00:04:53 I have a ton of questions and I have one more before I hand it over to Costas because I know he's probably chopping it a bit with all the interesting stuff there. But one observation is that, you know, we talked to a lot of different people from a lot of different companies working in, in data engineering. And it's really cool to hear about, I guess I would call it the productization of delivering data to the company. I mean, even with a name like Datamart, you know, in a much smaller organizations, you know, you basically have the software engineering team also doing the data engineering team, and then you get a little bit bigger and you have,

Starting point is 00:05:29 you know, maybe a data engineering team and perhaps a data analyst, and then those teams grow, but it's sort of an individual delivering those things. But it's really neat to hear about how you've really productized that in a pretty significant and widespread way at Intuit. Yeah. Yeah. I think it has a lot to do with the size, right? So when you start out small and you're just supporting a few people or teams, you can approach it one way. But I have 11 engineers on my team now, and then we're supporting 400 plus people across the enterprise. So when it gets big like that, then the story changes and the way that you approach things has to change. And also when you're talking about money movement and compliance

Starting point is 00:06:12 and SOX audits and things like that, you have to get more serious about how things are architected and that sort of thing. Sure. Okay. Well, I'm going to ask my burning question based on your time at Intuit, and then I'll hand it over to Costas because I'm monopolizing the conversation. So almost 15 years at Intuit, congratulations. You really, I think, have seen sort of what we would call like the data engineering revolution firsthand with just massive change in technologies, data infrastructure, the coming of age of the cloud, just all sorts of different, I mean, major, major milestones in terms of the way that sort of software is delivered and consumed today. So I'd love to

Starting point is 00:06:58 know what are, when you look back over 15 years into it, what are some of the big revolutionary changes that you've seen in the data engineering space? I think for me, at least the biggest is the move from the on-prem world into the cloud world. So when you have an on-prem data center, you know, you're maybe you're using, you know, storage area networks and, you know, you have to worry about your own infrastructure and worry about storage space and how many nodes do I have and how much space left do I have on my SAN. And maybe if you have an active, passive or active-active data center situation, you have to worry about replicating your data across to the other data center. So this world, while it was okay, and some companies were better at it than others,

Starting point is 00:07:49 it had a lot of problems and drawbacks. There was always people messing around with the network infrastructure and doing patching or updates at weird times and they may or may not tell you about it. And so it's as good as companies could get at that. I don't think they're anywhere near what the big cloud companies are today. So when you,

Starting point is 00:08:13 when you move to a public cloud and you're in an Azure or AWS situation, those guys are investing billions every year into their cloud architecture and infrastructure. And I'm pretty sure no company, even governments, can't compete with that kind of investment. And so they're really good at it. And they have designed their cloud environment from the ground up to be very scalable across the world. And it lets you get out of the business of worrying about your hardware and your storage and you know do i have hard drives that are

Starting point is 00:08:52 popping or network cards that are popping that kind of thing you don't have to worry about that anymore so you can scale in a way that's just impossible to do on-prem. So to me, that's the biggest kind of change I've seen, you know, in the last, I don't know, 10 years or something like that. So I can talk a little bit about, you know, what we were doing on-prem versus what we're doing in the cloud today. Just give a quick summary. Sure. Yeah, that'd be great. And I do think, I mean, I've never heard the comparison of, you know, not even a government can invest that much into the technology. And I think that's just a fascinating comparison. So thanks for that. That got my mind going. But yeah, I would love to hear about your migration from on-prem, you know, we actually had a pretty nice setup. We were using SQL Server Enterprise Edition. We had a nice, you know, Dell Fiber Channel SAN dedicated to our environment. It was 185 terabyte, which is pretty good size. And we had two data centers. So we had identical setup, about a thousand miles away from each data center with data replication running between the two. And, you know, that worked well for a while and 185 terabytes

Starting point is 00:10:06 is nothing to sneeze at. It's mostly row store data though. SQL server does have column store index. So there were some, you know, columnar tables, which we'll, we'll get into later. But mostly row store stuff. And then in September, 2017, you know, we started to get work on the AWS public crap cloud migration, and we decided to do a full tech restack at that point. Not a forklift.

Starting point is 00:10:33 So the difference is when you do a tech restack, you're basically re-architecting everything you have. Moving the products of everything you have. So we moved away from SQL server to Redshift, for example. And that took us like 18 months to do that. So by summer 2019, we were pretty much all in AWS. And then we were able to turn off our on-prem infrastructure at that point. And now, you know, we're all in the cloud using the native services there.

Starting point is 00:11:02 So we use things like EMR and Spark clusters and Parquet files and S3 and Redshift, Aurora, Kinesis, you know, MSK, which is the managed Kafka service, Glue, you know, CloudWatch, things like that. So it's, that was a huge change for us. And it took us 18 months to do that. So, you know, it was painful, but it was worth it. And now we're able to support an environment where we've got around 600 terabytes of columnar compressed storage. So that's, you know, 10 to one compression ratio right there. So if you tried to take that 600 terabytes and put it in a row store, you'd end up with like, you know, 6,000 terabytes. So that'd be really hard to manage on-prem, you know, in some kind of SQL Server or Oracle environment.

Starting point is 00:11:47 But in the cloud, I'm not too concerned about managing 600 terabytes. And then plus in the cloud, Amazon is managing a lot of data replication for you. They're doing patching and management stuff for you. So a lot of burden is on them. And that allows my team to focus just on building application logic and serving our customers. And I don't have to worry nearly as much about what's going on in the data center anymore. Alex, it's very, very interesting and very exciting for me to have you here today because you are one of these rare cases of people who have experienced both the on-prem and the cloud solutions. And it sounds so far that you're pretty excited about the cloud. And correct me if I'm wrong, but probably you prefer it and you find a lot of benefit in being deployed on the cloud instead of using an on-prem infrastructure.

Starting point is 00:12:51 Many people say that one of the benefits of having on-prem deployment has to do with security and compliance and the control that you have. What's your opinion about that? Do you think that this is actually something that it's a real concern? Do you think it is addressed right now by the cloud providers? Do you think that there's still work to be done there? What's your feeling about it? I think the security is fine in the cloud. And at Intuit, we have a central security team. We have a data handling team and they help the various PD teams set up their account in a certain way. They have Intuitized AMIs. So when you're restacking your

Starting point is 00:13:27 AMIs, they come bundled with all of the security that they want. And we have all the KMS keys and things like that. That's locking down S3 buckets and encrypting data at rest the way we want. So I don't see a problem with that. But at the same time, we have a central team of very smart people that have looked into the details of all this, and they've carefully architected things to a corporate standard, and we follow that standard. So, you know, to me, everything works awesome in the cloud, and I would never want to go back to the on-prem way of doing things. That's great. Is there any kind of advantage that you still think that on-prem has compared to cloud? If you're small, maybe. I really don't think so. I mean, honestly, I think that the age or the time

Starting point is 00:14:18 of the on-prem data center is quickly evaporating and going away. I don't think if you're a new company and you're thinking about building infrastructure, to me, it makes no sense to do it on-prem, just build your stuff in the cloud from the get-go. Maybe there are certain industries or certain weird use cases that I haven't heard of that you really need some kind of on-prem supercomputer, or maybe you're like a weather modeling place or something and you need some crazy supercomputer. But I mean, these days there's so much variety and option in the cloud to do huge machine learning, huge modeling of data and handling of many, many petabytes of data, very straightforward. So I just don't see any advantage really to on-prem anymore.

Starting point is 00:15:06 Yeah, makes sense. That's very interesting to hear from you. Going back to the things that you mentioned a little bit earlier during your conversation with Eric, where you mentioned about data marts. I mean, data marts in data infrastructure is one of the last steps before the customer, they're not that you have, the user of this data is going to consume it through a BI tool or whatever other tools they have. Can you give us an overview of the architecture that you have today? I mean, the architecture, the data infrastructure architecture that you have and what kind of paradigm you're following? Is it something like a data lake?

Starting point is 00:15:47 It's more like build around the data warehouse. And let's chat a little bit about this, because I think you're going to have a very interesting case. And you've done a lot of, let's say, very thoughtful decisions around that stuff. So I think it's going to be very useful for both me and Eric, and also like the people that are going to listen to the show. Sure. So we do have a central corporate data lake that is there, and we do pull data from that. And we also register our transformed files with the central Hive metastore so that it's visible to other people that use the data lake. But we also have to pull from upstream transactional systems and also streams to get

Starting point is 00:16:28 data in our environment. So, you know, we use EMR clusters to do query-based ingestion from some places. We use Kinesis streams and MSK to pull from data from queues. There's different latency requirements that we have. So the lake, for example, could be like a 24-hour kind of latency situation. And then if you try to pull from upstream transactional databases, maybe you're running many batches and you're pulling every two, three hours or something from them. And then if you have very low latency situations, you're talking about streams. So like Kinesis Stream, MSK, you can get data into your warehouse every 15 minutes,

Starting point is 00:17:12 every 30 minutes, something like that. So we have all those use cases in play today. And I think one of the most important architecture things for me is do your ETLs outside of the database, right? So when we were in SQL Server on-prem, we were using SQL Server to do your ETLs outside of the database, right? So when we were in SQL Server on-prem, you know, we were using SQL Server to do the ETLs and we had lots of stored procedures and using SSIS and all that. So all that's gone away now. So we use EMR and Spark clusters and we have several of them and we can scale out our Spark clusters as needed. You know, we can use persistent Spark cluster.

Starting point is 00:17:46 We can use transient Spark cluster as needed. And also Lambda functions. When you talk about streaming, you know, Amazon manages the infrastructure for Lambda functions. And we can handle, you know, hundreds of thousands of messages a minute in that scenario. And then you, you know, you do your transformations in Spark and so on, and then you write it back out to S3 for your final summary tables, right? So use Parquet in S3, and you can partition Parquet files, huge Parquet files, right, that can be dozens of terabytes large in S3 with no problem. And then you just use this copy command

Starting point is 00:18:26 to load that into Redshift very quickly. So Redshift has a way to do parallel loads with Parquet files in S3 very fast. And your Redshift, most of them take seconds or a minute or so. And then Redshift just becomes like your serving layer at that point. So that's sort of the main architecture overview.

Starting point is 00:18:49 That's very interesting. Actually, you said something that I'd love to learn more about. You mentioned that it's better to have your ETL logic outside of the database, let's say, or the data warehouse, which is quite interesting because I don't know if you have heard of all this movement in the market from going from ETL to ELT, which is more of the paradigm of let's extract the data, load the data into the data warehouse, and then run any kind of transformations

Starting point is 00:19:14 that we want inside the data warehouse instead of doing it on the fly. So why do you believe that it's better to have the ETL outside? And what's the difference? Like, what was the problems that you had when we were doing the opposite with MS SQL Server? Okay. So when you do the ETLs in the database, you are sort of boxed in or limited by that machine, right? So if you need to handle some giant ETL job with billions of records, you're running that on your database.

Starting point is 00:19:47 And when the data gets really big, you start to have problems with this approach. So when you take the ETLs out of the database and you're doing it in EMR with Spark, now if you need a 50 or 100 node Spark cluster for 30 minutes, whatever, to process some 50 plus billion row ETL, you can do it and it's not going to touch or hurt your database or affect the resources there. And, and then you use the, you know, the big data parquet format and S3 to store your transformation back out and you can partition the parquet file, you know, however you want is very useful. And then the copy command works very good with Redshift to load the data in there. But at the same time, you can use your parquet file in S3 to share your data with back with

Starting point is 00:20:36 the lake, right? So you, what you do is you use a hive cluster to, to register that table with, with a hive metastore. And then the lake becomes aware of your Parquet file sitting in your account. You don't even have to move the data anywhere. It's just a metadata entry in there. And then people can query the lake and see your Parquet file and query it right away. So you solve two problems, right?

Starting point is 00:21:00 You're solving the problem of sharing big data sets with data science folks that want to use your output with SageMaker and their own Spark clusters. And what they really want to see is Parquet files in S3. And then you solve the data warehouse use case with Redshift, where people just want to use SQL to query it. And they have things like Tableau and business objects and Qlik Sense and so on connected to Redshift that works well for them and you know it's just the quickest way to this scenario. This is great. I have another interesting question at least for me. You are describing a very like modern data architecture that you have deployed on AWS. That's from what I understand, a pretty recent development, right? I think you said that you

Starting point is 00:21:52 ended the transition to the cloud in 2019 or something. Is this correct? Yeah. We were finished in summer 2019 and it took us about 18 months to do all of that. Yeah. So this lake architecture that you have, did you have any part of this architecture also when you were on-prem or the architecture that you have there for your data infrastructure was completely different? So on-prem, they had the Intuit analytics cloud, the IAC. It was a big H hive cluster, Hadoop cluster. It was not very good. It was nowhere near what we have in AWS with the S3 data lake now. And it was always having like space problems and, you know, throughput

Starting point is 00:22:37 problems and stuff like that. It just, we just couldn't operate it on the scale that we wanted to. And the lake really, in my opinion, the central data lake wasn't truly realized until we got into AWS and we got everything in S3 and everybody put Parquet files in there and it became like this real usable, powerful thing at that point. It's very fascinating. How do you do that? How do you design this transfer

Starting point is 00:23:06 from this on-prem solution that you already have and you're running and it's operational and it drives your business and in 18 months you have completely substituted this environment with something completely new, right? Because it's not just that you are changing your infrastructure. It's not that you did just that. You re-architected the whole data infrastructure that you have. So what it takes from an organizational point of view and from the engineering perspective to do that, how do you do that? I'd love to hear more about how you did it successfully? So the first, you have to make a decision about forklifting versus tech restack. That's a key decision. Personally, I wouldn't recommend people to forklift what they're doing on-premises into any cloud and then try to duplicate what

Starting point is 00:24:00 you're doing on-prem using virtual machines in the cloud. That's really not what the cloud is designed for. And you can do it, it's true, but you're not going to get the result and the value and the benefit from the cloud that you could if you use the native services there. So we decided to do a full tech restack. We wanted to use all the native services in the cloud and really use the cloud for how it wanted to be used. And we wanted to get into Spark. We wanted to use the Redshift MPP, you know, which is a managed service. And then instead of SQL Server, we use Aurora. We have a small Aurora database that's also a managed service.

Starting point is 00:24:41 So that's like the start of it. It's that decision forklift versus tech restack. And then, you know, you have to, there's a lot of learning that 18 months was painful, you know, and we had a lot of learning and a lot of trial and error on things. And, but, you know, we had some architect people to guide us with decisions. We had technical account managers from Amazon to help guide us with certain decisions. So that helped a lot. And then, you know, you have to make sure that your manager and his manager and so on is on board with that. And, you know, your executive sponsorship is on board with that about what you're doing

Starting point is 00:25:17 and why you're doing it. So you have to, you know, politically and, you know, program management wise, you have to communicate a lot about what you're doing and why you're doing it and timelines and so on. And then you've got to get your customers to come along for that ride at the end and convince them that, you know, you're doing the right thing for the right reasons. So it's a complicated thing, but at the end, I'm glad we did it this way. And for us, that tech restack decision was the right one. I see other teams who did not make that decision. They wanted to, they decided to forklift and I see they struggle. They have all kinds of issues from doing that. And I'm just so glad I'm not on

Starting point is 00:25:57 those teams. Yeah, that's, that's amazing. I mean, congrats for successfully doing this project. I mean, it's for you and the whole team that was involved in this. It's really amazing because it's not just, I mean, and it's also amazing like from an organizational standpoint, because there's always resistance and change and you decided not just change, but to radically change your infrastructure and the way that you operate. And that's amazing. And says something also about the culture in the company. One last question before I let Eric continue with his questions. Is there a particular technology that became available to you after you migrated into the cloud that you are really excited that you are using and it's something that

Starting point is 00:26:41 you consider as like a game changer in your, in your work? Yeah, I would say the streaming. So being able to use Kinesis streams with Lambda or using MSK managed services for Kafka, that's pretty huge because now you can get data in your warehouse, like 15 minute latency, 30 minute latency and handle huge throughput, right? So we can handle, you know, hundreds of thousands of messages a minute with no problem. And Amazon is scaling out on the backend, handling all this, all this crazy message infrastructure. That's something that we just could not do on-prem. And it's exciting because people can, your customers can see what's happening, you know, in production, you know, 15 minutes after it happens.

Starting point is 00:27:29 And that just wasn't really possible before in a big data, you know, data warehouse situation. That's great. Eric, it's your turn now. Awesome. Yeah, I was going to say, thinking back, you know, you said 18 months. And, you know, that's, that is a non-trivial amount of time. But my gut reaction to hearing, and especially now after hearing more of the details of the migration, that actually sounds really fast for how fundamental of a shift it was technologically. So again, I'll reiterate Kosta's congratulations on that because that's a monumental effort in a relatively short amount of time for how much he changed. Thank you. Yeah, I think it was worth it. So we're happy with where we are now. I guess my whole thing with this episode I'm discovering about my line of questioning is

Starting point is 00:28:26 about understanding sort of the course of your career but I noticed that you were a software engineer before working in the data engineering space and I'm just interested to know you know and you did software engineering at atuit, how has that changed your perspective on data engineering? And specifically, you know, do you think that there are things that you experienced as a software developer that make you more valuable as a data engineer, especially with sort of the range of, or the scope of work you're doing across the organization with all sorts of different types of data? So, so yes, I was a, you know, software application developer for most of my career. And then right around October, 2010 timeframe, you know, I, I left that, that software engineering

Starting point is 00:29:18 team and became part of the risk, you know, data warehouse warehouse BI team and pretty much been working in that space ever since. So as a software engineer, I was working with highly transactional systems and data sets were mostly small. So you build like a business application or a website or something like that. And you're just dealing with small, lots of small transactions and usually working with relational databases like SQL server or Oracle or so on. So I did that for a long time. And you know, I'm happy with what I learned in that space. And, you know, I learned kind of what the limitations are, although at the time, you know, I didn't think about that. There were limitations. I just learned how that world worked and, you know, dealing with transactional DBs and getting good at writing SQL and stored procedures and learning how to, you know, tier your applications and those kinds of things.

Starting point is 00:30:15 But I think just around October 2010, you know, I just got more interested in the data warehousing space and started working on that. And that's when I started to, you's more on the backend of course, but it's just dealing with different kinds of problems and the data is much bigger and the problems and scenarios are different. So it kind of felt like a new job now in many ways and keep my interest out in this space. But I think it's, it's a really good way to kind of get out there and do what you're doing and do what you're doing. And I think that's a really good way to get out there and do what you're doing. and scenarios are different. So it kind of felt like a new job, you know, in many ways and keep my interest out in this space. But I think it, you know, really helped that I know the front end as well as the back end and kind of what the pain points are on the front end and understanding

Starting point is 00:30:59 what they're about. And I think that helps me deal with the backend stuff and, you know, be sympathetic to those things. Yeah, absolutely. It gives you, having been in sort of the shoes of someone who's doing a certain job that has an output that you deal with, gives you, and I'm just thinking back on experiences I've had where it just gives you a lot more empathy, you know, in terms of dealing with some of the issues that come with data, which is always messy, you know, in some form or fashion, and always requires some level of cleansing. So question, so Intuit's in the financial space,

Starting point is 00:31:38 so you deal with sensitive information. Could you talk through how that impacts your work in the data engineering space? I mean, you talked a little bit about the security in the cloud, but finance is one of the most highly regulated industries there is. And dealing with that data, I'm sure presents pretty particular challenges. I'd just love to hear about what some of those challenges are and then how you deal with them as, you know, as a data engineering manager. Sure. So where the group I work in is mainly in the money movement space. So this is things like payments, payroll, QuickBooks capital, you know, moving money around, dealing with, you know, card entities like Visa, MasterCard, Discover, Amex, Pindebit, ACH, those kinds of things. And, you know, it's a lot of parallels with being a bank. So I always kind

Starting point is 00:32:34 of remind people that Intuit is almost like a bank, but not quite a bank. So there's a lot of things that we need to do. Yeah, we need to do things, you know, that are very common with, with a big bank. So you have all kinds of compliance issues that come into place. So like, you know, PCI compliance, SOX compliance, you know, for tax, they have 72 16 compliance and you have deal with entities like office of foreign asset control, NACHA, FinCEN. If there's big fraud events, we have contacts with the FBI and so on to help us deal with fraud attacks.

Starting point is 00:33:12 And so there's a lot of regulations and stuff that you have to deal with, and that's not fun, but it's necessary. And also encrypting data in transit, encrypting data at rest, and dealing with keys, how you're handling keys, how you're handling sensitive fields. And also, you know, encrypting data in transit, encrypting data at rest and dealing with keys, how you're handling keys, how you're handling sensitive fields. All of these things are important and there, there are central teams that help. The PD teams, you know, deal with the stuff and make the right decisions and make sure their account is set up the right way and that they're using the keys

Starting point is 00:33:42 properly and, you know, keys properly and you know they understand you know two-way encryption or hashing and stuff properly and so you know there's a lot of guidance and help in that space but yeah it is very similar to banking and and there's a when you move a lot of money around there's a lot of risk that comes so fraudsters are always trying to, you know, attack the system and create fake accounts and launder money. And, you know, so that's a, it's a big kind of soup of issues that you need to deal with on a daily basis, but it's a fun, fun space to work in. Yeah. Yeah. I mean, I'm sure it's, you have to solve all sorts of interesting problems, you know, as the entire world has gone digital. And one question we had had a guest on recently who worked in data science

Starting point is 00:34:33 in the healthcare space. And he talked a little bit about sort of the, you know, some of the challenges he faced in a very highly regulated industry building models with, you know, PII or sensitive data. I know you're not on the data science team, but it sounds like you deliver data products to them, you know, or collaborate closely with them. Is there anything on the data science side in terms of dealing with financial data or sensitive data that presents particular challenges? Yes. So I think that they can use hashed fields for a lot of things. So instead of having a full tax ID in the clear, they can use a hashed value of that, for example. So yeah, I'm sure there's issues when they do their featurization and they're coming up with,

Starting point is 00:35:28 you know, which features are going to be, you know, more powerful than others and more influential than others. They have to use, you know, what's available to them. So some of the things they can do is have like a real-time model, for example, in line with a transaction or an onboarding event. And in that case, they have access to data as it's coming in for a transaction and they can see things that you couldn't otherwise see like in the data lake, for example. So for those kinds of real-time models, they're able to do some fancy stuff there and have access to data that wouldn't be normal to have access to in the lake. And then for a

Starting point is 00:36:04 batch model, for example, they can run huge batch models for portfolio analysis or whatever on lake data or data that we have in RS3 bucket. And then those models might be like hashed values for sensitive fields, for example. So I think they get around it. But there is definitely a big difference between batch machine learning and real-time machine learning. Very interesting. Okay, one more question because we're getting close to time here and, you know, we talked to all sorts of different people, but it's really

Starting point is 00:36:34 fun to talk to data engineers because we get to ask all sorts of specific questions about data. So you've talked about multiple different types of data, just in answering some of the other questions. So Parquet files, et cetera. But I'd love to know the breadth of the types of data that you and your team deal with. And then if there are any particular types of data that sort of present unique challenges for you as you're managing. I mean, it seems like how many pipelines do you manage? It seems like a huge amount. So we have, you know, well over a thousand jobs

Starting point is 00:37:09 in our environment. And then for streams, you know, we have a couple dozen streams going. So yeah, it gets complicated. And then there's dependencies, right? So if you have, you know, 1,500, 2,000 jobs, whatever it is, you need to, certain jobs need to execute before others. So there's a complex dependency web that needs to be managed there. And so we have to take care of that too. In terms of the types of data, the various types of data that are flowing through those pipelines, I'd love to know just some of the major ones to understand the breadth of different types you're dealing with. So the standard in the data lake is Parquet, right?

Starting point is 00:37:54 And this Parquet is nice because it includes the schema in the header of the Parquet file. So you can look at a Parquet file and natively understand the schema and the data types in there. And then it's partitioned into many files in S3. So it's easy to read that way. And you can read specific partitions if you want. So that's the data lake standard. But if you're doing messaging or streaming, usually JSON format messages are common there. And some of those can be, you know, pretty simple and trivial. Others can be have deep nesting and be kind of complex. So you have to be, you know, adaptive parsing JSON out to, to do whatever processing or flattening that you're trying to do from a data warehouse. And then we don't have to deal with like fixed field formats,

Starting point is 00:38:41 you know, much anymore. There's usually the upstream teams are dealing with that. So like, like Experian ARF format, for example, can be fixed field. So some of the, some of the older mainframe systems that data vendors use, they might have, you know, fixed fields or CSVs and things like that, but that's much more rare. And then we don't have to deal with like any images or audio video data as of yet. So I haven't had to deal with that part. Yeah, we had someone from Netflix on as a guest, and it was pretty fascinating to hear about them dealing with, you know, audio video data, because it's, I mean, it's pretty heavy duty, you know, when it comes to file sizes, etc. I know I said only one more question,

Starting point is 00:39:25 but here's a quick follow-up. What challenges do you face in data types? Like, is there something that, you know, is there something that you find you constantly have to deal with or sort of has required you sort of making changes in the pipeline or addressing? Well, I think you have to be good

Starting point is 00:39:44 detecting problems upstream. So sometimes the upstream systems, they're not really aware or nice to the downstream systems, and they can make breaking schema changes. They can change data types in the middle of a table. Sometimes they change the meaning of fields, and they don't really think too hard sometimes about the downstream implications of that. So that's a challenge. Also, the upstream systems may not be aware of the data can still be there for like 20, 30% of the time, but then they did a code change and now the other 70% of the time, the field is not populated and they may not notice that right away. But when you do, you know, aggregations and pivots and stuff like that, that kind of problem pops out very prominently and you can see big drop-offs and field populations.

Starting point is 00:40:37 So sometimes we have to, you know, tell the upstream system, hey, you know, what happened with this field? And, you know, on Tuesday it was populating 99% in here and Wednesday we only see 70% what happened. And, you know, sometimes it's news to them, but so you just have to be sort of prepared. You have to be good at detecting problems with the upstream systems and problems in the lake too. So there's different techniques for that. Sure. All righty. Well, we are at time here, Alex. It has been really fascinating to hear about all the work that you've done into it, all the incredible work you've done into it.

Starting point is 00:41:16 And I know that our listeners will really appreciate the insights that you've provided, especially around handling major migrations. So thank you again for your time and for teaching us so many great things. Thank you very much for having me on today. Appreciate it. Well, that was absolutely fascinating. I mean, I think one of my big takeaways is that Alex manages a thousand pipelines, which is kind of mind boggling to me. That's just, that sounds, I'm getting a little bit stressed just thinking about that. What stuck out to you, Costas? Well, I think managing a thousand pipelines is nothing compared to re-architecturing and

Starting point is 00:41:56 redeploying everything from on-prem to the cloud in 18 months successfully. That was insane. I think they are very, I don't think that, I mean, he was very modest and very cool about it, but the team and the company, I think it's also like a big success for Intuit and the culture that they have. This kind of radical restructuring of such an important thing and complex thing as the data infrastructure in 18 months, like it's, it's insane. I found it extremely interesting. Me too.

Starting point is 00:42:28 Yeah. Alex is so calm. He's the, he seems like the type of guy you would want behind the wheel of a huge project like that because it doesn't seem like a lot ruffles his feathers. All right. Well, thanks for joining us on the Data Stack Show. Subscribe to get notified of new episodes on your favorite podcast service, and we will catch you on the next one.

The Data Stack Show - 23: Migrating from On-Premises to the Cloud with Alex Lancaster from Intuit

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.