Software at Scale - Software at Scale 13 - Emma Tang: ex Data Infrastructure Lead, Stripe

Episode Date: March 21, 2021

Emma Tang was the Engineering Manager of the Data Infrastructure team at Stripe. She was also a Lead Software Engineer at Aggregate Knowledge, where she worked on the data platform. We explore the tec...hnological and organizational challenges of maintaining big data platforms. We discuss when a company would require a “Big Data” system, what the properties of a good system look like, how some of these systems look like today, some of the tools/frameworks that work well, hiring the right engineers, and unsolved problems in the field.Apple Podcasts | Spotify | Google PodcastsHighlights0:30 - “Big Data” for software engineers - when does a company need a big data solution2:30 - The transition from when a company uses a regular database to a big data solution, with a motivating example of Stripe4:20 - Verification of processed output. Some of the tools involved: Amazon S3, Parquet, Kafka, and MongoDB.9:00 - The cost of ensuring correctness in the data processing. Using tools like Protobuf to enforce schemas13:30 - Data Governance as a trend16:30 - Why should a company have a data platform organization? 21:30 - Hiring for data infrastructure/platform engineers24:00 - How does a data organization maintain quality? What metrics do they look at?28:30 - Trends of some problem areas in this space.33:30 - Emma’s interest in data infrastructure, and advice for those looking to get into the field.TranscriptUtsav Shah: [00:13] Hey, everyone, welcome to another episode of the Software at Scale podcast. Joining me here today is Emma Tang, who is an ex-Engineering Manager at Stripe on the Data Infrastructure team of the Data Infrastructure group. And she used to be a Lead Software Engineer at Neustar, which is an AdTech company. Thanks for being a guest today. Emma Tang: [00:30] Thanks for having me, Utsav.Utsav Shah: [00:32] Yeah. Could you tell the listeners and me, because a lot of us are mostly just experienced with full-stack software development, we do front end, we do back end, where does data infrastructure come in, or data in general? When would I use something like Spark? And when does that become business-critical?Emma Tang: [00:51] That's a really good question and I think that's a question on a lot of folks’ minds. It's like, “When does my data become too big that I need to use big data?” Or “Should I consider it right now, given the set of circumstances that I have?” I personally don't think there's a hard and fast rule in terms of volume or size that defines big data. You can't say, “Oh, I have one petabyte of data, therefore, I need to start using Spark.” I think that actually, it's really depending on your situation, like your architecture and your business needs. So, some symptoms of when you need to consider big data include, oh, you're actually hitting your production database for analytics and you're really worried about the load on your production application. That's a time where you need to consider maybe something that's more specialized for dealing with big data. Another example is, let's say you integrate with a bunch of third-party vendors like Salesforce, and Zendesk, and Clearbit, and you want to integrate the data into your system and yield insights from that, you should also consider big data when you're at that point. And then obviously, another case is, you just literally have a business need. For example, you need to produce data-heavy reports for customers or regulators, or other internal purposes, you also should consider spinning up big data custom for that as well.Utsav Shah: [02:10] Okay, so maybe, walking through this as somebody who has no experience in this field at all so I'm just guessing here, maybe like V1 of how Stripe would be implemented is just a relational database that has transactions, it's like a ledger. But at some point, you want to do a lot of processing on those transactions, like what the tax rate is for a certain transaction. And there's also really complicated transactions like you're paying maybe an Instacart shopper, and then you're also paying Instacart something and that money is coming from some random account somewhere. At that point, you would start considering managing those transactions using something like Spark? Is that a good way to think about it?Emma Tang: [02:51] Yeah, you're definitely going at it the right angle here. So, at Stripe, for example, some of our earliest use cases for big data was actually in kind of the reporting and reconciliation side. So basically, because we're a FinTech company, we deal with a lot of money. And when money is involved, accuracy is super important. So, we can't be like, “A rough estimate of what happened yesterday.” No, we have to know exactly what happened yesterday, and not only yesterday, for all the days before that as well. If a transaction was somehow reversed a few days later, we need to know about that, and we need to reconcile the systems. So big data is great for that. You can very easily take in the whole universe of all the data you've ever had in your company, and just crunch it in a few minutes. So that's the power that you get. At Stripe, some of our earliest use cases, including our internal billing system, so at the end of the month, we need to charge our customers a fee and say, “Oh, you made these many transactions on Stripe, we need to charge you this much money.” That process is actually a big data process at Stripe, it's literally how we make money. And big data is at the epicenter of that. Some other examples are just product examples. So, Stripe has things called Radar fraud detection, or Capital, which is our lending product, and Sigma analytics products. All these products are powered by big data. Without machine learning without big data, none of these are possible.Utsav Shah: [04:13] Okay, that makes a lot of sense. So, you basically have something like a crone job or something, just bear with me here, which will go through like one day's worth of transactions and figure out, “Okay, these are the final fees that I need to charge people, or charge our customers based on whatever transactions they made,” and that will finally go through.Emma Tang: [04:33] Yep, sort of like that. And we have to combine different data sources together as well. It's not just one transaction source, you also have to combine it with their custom fee plan, because they might have a very different fee plan from other merchants. And you have to combine maybe currency codes, maybe they operate in different countries, and there's different currency codes and different currency exchanges and all that. So it’s a lot more complicated than I’m making it out to be, but that's where big data shines. You can really do very complex stuff on really [05:00] big data and everything, the system is built for that kind of processing.Utsav Shah: [05:05] Okay. How does one verify that what the number-crunching did is actually correct when you have such a large pipeline?Emma Tang: [05:15] That's such a good question, and one of the biggest pain points in big data honestly. Again, I think it really varies by business need. If it's a B2C company that's like Facebook or social media, where you want to know how many clicks happen on this post or whatever, it's not super mission-critical that it's super, super accurate. And therefore, the way you implement that system might be a little bit different. You don't need at least once or only once like semantics. Whereas at Stripe, we really do care about accuracy. So, for us, we actually build in a lot of custom validation frameworks on top of the jobs themselves. So once a job finishes running, we actually have a bunch of separate post hoc processes that we run after the job runs to validate the data, that it is the right format, that it follows some rules, for example, it has to be always growing, or “the max may not exceed this”, or unique key validations. So, we have all of these that run after each job to automatically validate this. And if it doesn't satisfy these, we immediately invalidate the data and stop the entire pipeline from running. So, this is something that we built bespoke at Stripe, but I honestly feel, and we'll talk more about future trends and stuff like that, but I honestly feel that this is one of the things that the industry needs to do better, which is easy ways to build validations on top of big data pipelines.Utsav Shah: [06:40] That's really interesting. It sounds like it's the inverse of machine learning where you're trying to make sure your training data is clean before it goes into the pipeline. Here, you need to kind of figure out your output data is clean and so you run a bunch of verifiers, and you need to--Emma Tang: [06:56] Yeah, exactly.Utsav Shah: [06:56] Okay, so that makes sense. So maybe a system design for something like Stripe would be, you have an API, which pretty much writes information to some kind of big data warehouse like Hive or something like that, and then you have Spark jobs that crunch all of that information that go into Hive and produce this final output, which is like reports and billing, and then you have a bunch of verifies that check other-- Emma Tang: [07:21] Yeah.Utsav Shah: [07:22] Okay, that's really interesting.Emma Tang: [07:23] I can describe the system at Stripe for you. It's a little bit different from other companies because, again, big data is so core to our business at Stripe, therefore, it does look a little bit different for us. So, there are a couple of different sources of data for us to be used in big data. One is our online database. So, at Stripe, we use a combination of different types of databases, we use some relational, but also some NoSQL databases. So, at Stripe, we have a really huge Mongo presence. And there are actually ETL pipelines that take the data from Mongo, and basically apply yesterday's changes to yesterday’s snapshot to create today's new snapshot. That snapshot will land in S3 for downstreams to use. In that process, we also apply a schema, and we do validations on top of that Mongo data, so the data is actually in Parquet format when it lands in S3. So, another source of data for us is from our Kafka Streams. So, these are more immutable event-based streams but also end up being archived in S3. And we also transform that to a Parquet format. And then there's other smaller kinds of data streams that flow in mostly from third-party integrations, like I mentioned, Salesforce, etc., those also land in S3. So as Stripe, our data lake is S3, the source of truth is S3. However, we do have data warehouses on top of that, but they are not considered the source of truth, even though they have almost all the stuff that S3 contains. So at Stripe, we use a combination of Redshift and Presto, but we're in the process of deprecating Redshift. So, I think we're very close. So that basically sits on top of the data, and it gets ingested in every day through another pipeline as well. In terms of batch distributed computation, we use Spark. We use just mostly DataFrames, and we also have quite a bit of Spark SQL. We used to use something called Scalding, which is something based on MapReduce, which is an older technology, but we've mostly rubbed off that at this point.Utsav Shah: [09:29] Okay. And the benefits of Spark are you get basically a streaming infrastructure rather than a map job, like a map phase and a reduce phase. And each phase taking a long… especially the reduce phase, could potentially take a lot of time to complete.Emma Tang: [09:42] Yeah, so we actually don't use Spark for streaming purposes. So, Spark actually still has analogous phases. There's mapper phases, but there's shuffles where it's kind of like a reduce phase. So, I think the basic concept is very similar, but I think the main advantage [10:00] to do MapReduce is it's a lot of in-memory processing, which is a lot faster than having to [Inaudible 10:06] onto disk over and over again. So that's why it's a lot faster. It's also just because the people working on it, there's so much attention on it, the optimization they're doing is really, really great, they have a really strong optimization engine. And if you write data frames or SQL, they can actually optimize your query very well, to the point where it will outperform raw Spark, without using DataFrames by like 10x, 20x. It’s quite amazing. So, there's a lot of interesting things that are happening on the Spark SQL side. Utsav Shah: [10:39] Yeah, that definitely sounds really interesting. And it seems like you have a lot of different data sources that put stuff into S3, and then that gets processed by the Spark jobs, and you have this business-critical output. And it's a very different shape of a company compared to your regular tech companies that don't have to deal with… or a lot of tech companies don't have to deal with large scale data consistency. As you said, a lot of it is analytics so it's okay if it's slightly wrong. That's fascinating.Emma Tang: [11:10] Yeah, it's fascinating but it's also very challenging in different ways, I would say. We care a lot more about accuracy, so our pipelines tend to be really robust, but also slightly more computationally intensive. Like for the same job, we’re probably spending three times the computation power than a normal company would to make sure everything's great, accurate,Utsav Shah: [11:31] Okay. So just in the additional verification, and all of that is where that additional time goes through.Emma Tang: [11:36] Yeah, there's a lot of that. And also, sometimes we take the more robust route. For example, you can get an estimate by just crunching three days of numbers, but we would go back in history and crunch all 10 years of data.Utsav Shah: [11:48] Oh, interesting. And it's only going to grow with time. Emma Tang: [11:51] It’s only going to grow.Utsav Shah: [11:51 And even transaction volume is only going to grow.Emma Tang: [11:53] Exactly. So, there's actually good ways to work around that issue as well but we can go into that another time, as long as-- There's a lot of stuff here. And there's some things I didn't even touch upon that's also in the sphere of big data. For example, scheduling is also a major issue. You have pipelines, you want to make sure that this job runs after job A and job B runs after job D. That kind of what's called a directed acyclic graph is something that is very important to use in making data pipelines more robust. At Stripe, we use Airflow, but there's also other schedulers out there as well. Utsav Shah: [12:32] Okay. Yeah, I'm familiar with Airflow, where, as you said, it seems like it allows you to make these DAGs off different jobs so you make sure that the entire pipeline goes through. But what happens then when somebody adds an intermediate job that starts failing? Does the entire company get paged? How do you maintain reliability?Emma Tang: [12:53] You ask all the best questions. Utsav Shah: [12:55] Thank you.Emma Tang: [12:56] Unfortunately, sometimes that does happen. So different people deal with it differently. At Stripe, we have basically different types of rules for each job to specify what the failure behavior is. So, there are rules like “Stop the entire pipeline”, that's usually the default. But there's also rules like, “If this job fails, and the next job needs to run, the next job is allowed to use yesterday's data.” Or “If this job fails, there's some other default job that you can default to. That job will run instead, and then the downstreams can run.” So, we have a lot of complicated logic around this, and unfortunately, it just really depends on the situation.Utsav Shah: [13:37] Yeah, that makes sense. So, I guess one example I can think of is if you're trying to detect whether there was a lot of fraud or not, you don't have to use the previous day's data. I don't know, I could be wrong here on what the business criticality of that is in case that job is failing temporarily. But definitely calculating how much you need to bill your customers; the entire pipeline has to stop.Emma Tang: [13:58] Yeah. So, unfortunately, when that happens, people do get paged. All of us get paged.Utsav Shah: [14:04] Yeah. So how do you maintain data quality? This is a very persistent problem that I've seen where sometimes data might just not be-- For example, country codes, but there might be an invalid country code over there in that, and then some assumptions that a downstream job makes on valid country codes could just break. How do you make sure that data isn't getting dirty because of some intermediate job? I don't know if that makes sense.Emma Tang: [14:32] Yeah, no, that does make sense. So, like we talked a little bit before, I think validations are a really important part of it. So, we do spend a lot of again, computational resources and ultimately money on making sure that the data is really correct. Some architectural decisions that we made to make sure things are good is, for the most part, at Stripe, in terms of Spark jobs, it's almost all purely Scala, so it's type safe. [15:00] We also make sure that all data that's important also has a schema associated with it. So, all data is written in Parquet but we also have a schema representation for every data set. So, we use a combination of Thrift and Proto. We basically, used to only use Thrift; we’re moving on to Proto as a schema representation. And they need to match one to one with the data you're reading. So, if you have a mismatched data schema anywhere along the way, it won't work. So, we rather things fail than be false is our approach. It's an extreme way to do things, again, but that's just the world that we operate in.Utsav Shah: [15:39] I think that's fascinating. So, every single schema is basically specified via a Protobuf message. Emma Tang: [15:45] Yeah.Utsav Shah: [15:46] Or at least the really critical ones. Emma Tang: [15:48] Yes, I would say almost everything. So, everything that I mentioned that comes through Mongo, or comes through Kafka, all datasets coming in that way will automatically have a schema representation, either in Thrift or Proto. And we have some libraries that autogen Scala libraries for each model, basically, which users can then use in their type safe Spark jobs. So that's how we ensure there's consistency there. Utsav Shah: [16:15] Okay, that makes sense. So being the Engineering Manager for pretty much one of the most, it seems like business-critical teams, the company must be pretty stressful.Emma Tang: [16:27] It can get stressful, that's true. Like you mentioned, when jobs fail, it can fail any time of the day, which can be stressful when you're firefighting. I would also say, increasingly, and this touches upon another subject that we might talk about later, which is future and stuff like that, a trend that I'm seeing a lot is increased data governance, data lineage, data security type work. There's just more regulation these days about how data is stored, how data is moved around, what kind of data you can keep around. So that also adds a lot of pressure to data teams as well. Not only do you have to do your job of making sure the infrastructures are running, that you're actually fulfilling the business needs, but you also have these security and privacy regulatory things that you have to start implementing solutions around as well. So, I think there is, in the last few years, a little bit of a double whammy of pressure on data teams. So, I definitely think it’s a ripe solution for some startup to come in and help solve.Utsav Shah: [17:34] So maybe to dig into that a bit, is an example of that something like European data has to be processed in a European data center? Is that an example of that?Emma Tang: [17:44] Yeah, an example is, if a country comes out and says, “Oh, we're going to implement data locality rules. All payment data needs to live in a data center in my country.” And then you might have to spend a lot of time actually working with a regulator to figure out what does that actually mean? Does it mean just storage? Does it mean storage on online databases but also S3? Does it mean compute, like data in fly? And if it is compute, is there a time limit? Can you keep it around for 24 hours, and then ship it over by then? There's a lot of different kinds of possible specifications here. So that's an example of a rule. Another thing is GDPR, you have to encrypt different things in ways. So that's another example.Utsav Shah: [18:33] Is there a particular country that has the most annoying rules? I don't know if you want to share. Or one particular rule that you had to deal with?Emma Tang: [18:47] I think honestly, these are great things for the consumer and for businesses too. These are things that we should have in the world. They do add technical challenges to the teams that have to sell for them. An example of one that was a little tricky to solve was the India data locality rule that’s coming out, specifically for payment data. And yeah, basically, if you already have a company and architecture that has scaled to a certain point, then migrating away from that architecture to make data locality work can be a bit challenging. But we still figured out a really great solution, so I'm very happy about that. Utsav Shah: [19:28] And you can't even fall back to the US when something like India because it’s literally halfway around the world, so the latency must be just really high.Emma Tang: [19:38] Yeah, so luckily, we're all AWS so latency is not as bad as you think it is, even though, yes, there is some bit of that latency.Utsav Shah: [19:50] Okay. Yeah, I just remember trying to SSH into my work production servers when I was visiting back. It was like, “I can't work here at all, there’s no point. [20:00] It's just too slow.”Emma Tang: [20:03] Yeah, it could be like that. We did a lot of studies before we even attempted with any solution, to compare the latency characteristics. And in the end, it was actually much better than we assumed so, it was great.Utsav Shah: [20:13] Yeah, that is interesting. So how did the work of your team evolve over time? Because I'm guessing in the beginning, there were teams just writing Spark jobs, and at some point, somebody decided, “We need to really invest in this space and make sure there's a [platform-ish 20:33] layer.’ So, I'm guessing that's where your team is. How did that work evolve over time?Emma Tang: [20:37] Yeah, no, that's a good question. So, I'm going to step back a bit and talk a little bit about why would someone want to have a data org. Is it okay to not have a data org at all, and just figure out whatever solution that works for the business need? And I think the analogy for me is having planes around when most of the transportation is our trains. You can usually get to places on trains, and it's relatively okay, but there are some places that you cannot get to, like, from North America to Europe, you can't get there just on a train, you have to have a plane. But if it's a really short distance, if you're going from San Jose to San Francisco, maybe a plane is a little bit too extra, probably a train is okay. That's an analogy, I think. Big data, it's going to solve problems that your current architecture just cannot solve. It's just not possible. So, you're just losing out on capabilities and possible business products and scalability if you don't even consider it. You just will be kind of left out in the wind, in the dust if that happens. So the question is, when do you want to consider doing it? And like I mentioned before, when you've noticed that there are make or break situations in terms of business needs, the products that you want to make, you want to run machine learning, for example, you definitely to set up big data in order to even do that. Let's say you have a real-time business need for processing some large volume data, you probably need to set up Kafka. If your production database is falling over because you're running queries on it, you should probably set up a data warehouse like a Redshift, or a Snowflake. So, these are the times when you need to consider it. But I would really recommend not waiting until the breaking point. Investigate earlier, see if you can integrate it into it smoothly so that your architecture will just work well with whatever big data systems that you introduce. Okay, now, sorry, I'm going to go back to your original question, which is how has work evolved on big data at Stripe? So, I think I'll just start out by saying there is a couple of different philosophies about how data infrastructure groups should be organized. And I think, especially on the point of where do you put data infrastructure engineers and data engineers? And to clarify, in terms of terminology, I think the phrase data engineer is actually interpreted quite differently at different companies. My personal definition is folks who are setting up the Hadoop clusters or Spark or basically writing these autogen for schema data models, I think these folks, I call them data infrastructure engineers. And the folks who are using the infrastructure to write the jobs and pipelines, I call them data engineers. So, I think that's how I differentiate the two. And a lot of the organizational principles revolve around how do you organize a relationship with these two groups? One philosophy is basically have all of them in the same org. So, you have data infrastructure engineers, and then you have data engineers. And data engineers will help production engineers write their pipelines, but they report up to the same folks as the data infrastructure engineers do. Another philosophy is, let's say the data engineers are actually inside the product works, right? They report up to the product works, but they just know data engineering, and they will write the pipelines for the product works. And the third is there's no data engineer. There's data infrastructure engineers, and there's product engineers, but the product engineers have data engineering capabilities, they know how to do these things, they are more generalists. So, I think those are the main three philosophies. And Stripe definitely started out as data engineers and data infrastructure are one and the same. When it was small, the people who spun up the infrastructure are also the ones who knew how to write this, so they were doing both jobs. Eventually, when we started growing, it was just very inefficient to keep doing that because they're fundamentally very different skill sets. Understanding production data, understanding the business logic, and how to combine data sources to yield the kind of insights that people want from this is not the same job as making sure the Hadoop cluster doesn't fall over [25:00] from load. So eventually we started separating the two and that's where we were met with the hard decision of should data engineering live within our org, which is kind of how it was in the past, versus trying to split it out outside. And we've experimented with different kinds of ways in the past, but I think we ended up in a hybrid model in the end, where we have an organization under data science, which takes care of the most complicated business models and the most complicated analytics, and they will write the pipelines for the other products, basically. And they report up to data science. And then on the other side, most other pipelines, not the top level most complicated data models, like most other pipelines on the product side are written by product engineers. And as data info’s folks like myself and my team, we make sure the tooling at Stripe is some of the best tooling around. If you want to write a data job, you want to schedule a job, it's super easy. Observability into your job should be super easy, testing should be super easy. That's our job to make sure the barrier is really low so that we don't need to hire specialized data engineers to kind of tackle this problem. I can talk on and on about this topic but what I think is important to call out is there's a couple of reasons why having a separate data engineering org may not be the best solution for a lot of different companies. One is data engineers are actually really hard to hire. If you think about it, what do data engineers do, in my definition of data engineers? They're the ones translating production logic into Scala or Python or whatever you write your data pipelines in, so they're literally translating things. So, it doesn't really bring a really big sense of satisfaction when you're not producing a lot of things. You're literally a translation layer. And these folks are also highly technical as well. There's no reason why they need to just write pipeline jobs, they can move to other roles as well. So traditionally, they're really hard to hire. And we've observed this at Stripe as well, they're just really hard to hire. And the other thing is when a problem has a software solution, versus a hire people solution, you should always try to bias towards the software solution a little bit, because, at some point, your company is going to grow big enough that hiring a million data engineers is just not very tenable. First up, they are hard to hire. And second, it's expensive to hire them. So, if you take the resources that you would, and put them into developing the best software solution possible, I think that's a more long-term sustainable path forward.Utsav Shah: [27:31] That's so interesting. And I think that certainly makes sense. I don't think I had a good understanding of what a data engineer does before I heard you talk about it. And yeah, how did you even go about evaluating whether somebody is a good data engineer or not, or whether somebody even fits that archetype of what you're looking for?Emma Tang: [27:50] So I personally don't hire for data engineers, I hire for data infrastructure engineers, which is, again, a very different skill set. I think for data engineers, the folks who write these pipelines, they have to have a curiosity into the business, and the product, to really want to understand why the data should be combined the way they do, be creative about coming out with solutions. I would definitely say they need to be a lot more invested in the business side of things.Utsav Shah: [28:19] That makes sense. It's like somebody who's interested in how the product works and why people are buying the product. And they want to--Emma Tang: [28:24] Yeah, working closely with product teams. Yeah, exactly.Utsav Shah: [28:28] And yeah, you hire for data infra engineers, these seem like more your standard infrastructure engineers interested in building big systems and seeing them work and maintaining them. Is that correct?Emma Tang: [28:38] Yeah. So, we definitely want to see folks who not only can dig very deep and can successfully operate in very large codebases, because all of the technology we use - Hadoop, Spark, Kafka - they're all very, very complicated codebases, and you need to be very familiar with digging around in that kind of code. I think the other side that we really like seeing is some DevOps experience or at least the potential affinity for the kind of DevOps work that we sometimes have to do. If someone just really hates DevOps, doesn't want to go onto the server to see what the logs are, that’s not the greatest thing. So, it's at once technical depth, but also operational excellence.Utsav Shah: [29:21] That also seems hard to hire really good people. Emma Tang: [29:24] Yeah. Utsav Shah: [29:25] But at least the role is clear, I would say, in the industry what this person does. Interesting.Emma Tang: [29:31] Yeah. Honestly, all the companies are going after the same pool of people. It’s really funny. If you’ve been in this industry long enough, you basically know everybody else who’s in the industry. You just check in with them once in a while, “Hi, are you ready yet?” Check it every six months or so.Utsav Shah: [29:46] Yeah. So, you're definitely looking for people who commit to Hadoop and all that, for example? Or that would be like an ideal.Emma Tang: [29:51] Not necessarily Hadoop. Utsav Shah: [29:52] Okay.Emma Tang: [29:53] I would say people with some distributed computation experience is usually helpful. It doesn't have to be Hadoop. You can be [30:00] a Kafka expert, or you can be a Scalding expert or whatever, and still be very successful at a company that does other types of distributed computation.Utsav Shah: [30:10] Okay. And how do you evaluate whether you're doing a good job and you're helping the business enough? Clearly, when things aren't burning down, that's great. But what kind of quality metrics or like, “Are we shipping enough?” How do you measure yourself?Emma Tang: [30:25] That's a really good question. So, we have a couple of mechanisms to kind of make sure that we're keeping ourselves honest. We definitely take a look at the runtimes of jobs themselves. We want to make sure that even though data is always growing, we still want to maintain a neutral, if not, lower runtime, over time. That means we keep on making optimizations on the frameworks that we're providing. We keep on providing really useful helper methods so that you can avoid processing more data than you need to. We introduce things like Iceberg data format so there's more optimized reading and writing. So, we're trying really hard. Ultimately, we're trying to make sure that we get to the results as fast as possible, as cheaply as possible. And then the other thing is, are we addressing business needs? We have check-ins with all of the user teams at Stripe, usually, at least every quarter, if not more frequently, where we really listen to what their business needs are and see if there's any ways that we can introduce new systems that help address their really painful, burning need. So, for example, Stripe Capital, they were looking at potentially, Spark ML, as well as data science was interesting. So, this will be a framework that we potentially will introduce to solve the business need. And we will check in periodically to see if what we're doing is fast enough and solves their problems for them. I think a lot of other metrics include false failure rates. We want to make sure that those get reduced. We want to count how many data validation errors that we hit. We want to reduce the false positives there, and also just reduce the number of validation errors in general. And, Stripe, in general, is very focused on metrics. So, we have a lot of KPIs. So as an org, we actually have a dashboard of like 10 or 20 different metrics that we track internally to make sure we're doing our job right. So those are kind of the things that we keep ourselves honest with. But again, like you mentioned, there's a lot of different things that are coming in that are not really well captured by metrics, for example, security or data locality. So, it’s definitely a complicated business.Utsav Shah: [32:38] Yeah, it sounds like it's one of the hardest technical challenges, trying to deal with all of this data and all of these different constraints and making sure you're doing your job right. It sounds like a pretty hard technical challenge. Emma Tang: [32:49] Yeah, but that's the fun part. Utsav Shah: [32:52] Yeah. The best part about Stripe is it enables so much stuff. I really like the mission of the community which is, I think improving the GDP of the internet, like enabling or increasing the GDP of the internet. I think that's such a great mission statement. Yeah, and there's so many technical details behind that to get all that working. Emma Tang: [33:13] Yeah, definitely. Definitely. Utsav Shah: [13:15] So then maybe you can explain to me as a layman, why Snowflake is so popular? And what did it solve in the industry that other people were not doing?Emma Tang: [33:25] Yeah. So, this is something that I personally feel pretty strongly about, which is SQL-ification of everything. If you think about initially what big data looked like, it's like MapReduce jobs. So, you write jobs in Java, they're really verbose, you have to write sometimes your own serialization, deserialization code. It's just extremely complex and you need to know a lot before you can even be productive and write anything that does any transformation. So then comes other kinds of possibilities, like Spark, for example. Spark initially, also, you can only write in JVM languages, so Scala or Java, and you also needed to write a lot, but they had a lot more helper functions. And then they introduced Spark DataFrames, which is a little bit more robust, a little bit more rails on it. They kind of modeled the Python DataFrames kind of thing. And then there's Spark SQL. And out of all of these, guess which one has the most excitement and increased adoption trends? It's the one that's easiest for folks to use. So, lowering the barrier to entry to big data has been a long-term trend. Hive is very popular for this reason as well. You can write SQL on it. And then Snowflake, Redshift, and the technologies like this are just one and the same. You can do massive data computation with SQL using familiar kind of interfaces and familiar tools. So that's why it's super powerful.I personally have never worked with Snowflake myself so I can't go too deep in here, but my understanding is that [35:00] there's a lot of advantages to the technology itself. It's also compute and storage is also separated, which is a really great thing. I think initially, Redshift, at least now, they also separate, but in the beginning, it was not separate. So, if the hardware is only like that big, and you can't separate compute and storage, there's only so much data you can put on it, and there's only so much data you can use. And that's not the case with Snowflake. There's also locking associated with Redshift. It took us forever, you can't really do your own thing, and it's pretty expensive, whereas Snowflake, you don’t have that issue. So, I am bullish on any technology that comes out with a SQL solution that's really easy to use, very difficult to mess up, and that can crunch large amounts of data. I'm bullish on any type of technology there. So, Snowflake is just an example of that.Utsav Shah: [35:51] Okay, that makes sense. And it's more about just the ease of use, and maybe the speed or something like that might be [Inaudible 35:56]Emma Tang: [35:56] Yeah, ease of use, and the possibility of crunching large amounts of data. You can't do the same on a Postgres database. Postgres is also easy to use, but it's like, you don't have the same capacity.Utsav Shah: [36:07] That makes sense. And so, you mentioned this earlier, but what do you think are some startups that are missing in this space? Something that still needs improvement. You said verification is one thing.Emma Tang: [36:17] Yeah. So, I'll just start from the top in terms of just general trends, I think not necessarily all of them are startups. Some of them are just literally further improvements of the open-source repos that we have right now. So, in terms of batch distributed computation, I kind of hinted at this, but we're seeing a lot of database optimizations being borrowed, and then transformed for distributed computation. So, a lot of the stuff that we learned a few decades ago that we applied to databases, we're trying to reapply them again to this field. So, for example, columnar format, column stats, table metadata, SQL query optimization, all of these have arrived for big data. Spark SQL and Spark Data Frame have really great optimization engines that's moving in that direction. There's another thing that is kind of taking the big data world by storm, it's table formats. And honestly, what that is, is, before, data is just data, it can be anything really. But table formats kind of bring a little bit more structure to this. So, for example, the one that we use that Stripe is called Apache Iceberg. Basically, you can think of it as a wrapper around Parquet data. It basically has some extra metadata around to describe the data that it actually is referring to. And that is very powerful, that allows for things like schema evolution, or partition evolution, or table versioning, and also optimized performance, because you have data like oh, [Inaudible 37:56] and stuff like that. So, it's super powerful. I think that's where the industry is going. I think some other trends that I see are, again, privacy and security. And over here, I think there's just been a lot of demand on data teams too. At every company, we’re facing similar problems, which is how do we make sure that data is tracked, that lineage is clear? How do we make sure that PII data is kept away, stored securely? How do we tokenize different data if we don't need them to be everywhere? A lot of similar problems at all these companies and data catalogs, data lineage products, and data governance products are becoming much more interesting and necessary. LinkedIn had this Datahub data catalog product that were released, and Lyft also released Amundsen. So definitely a lot of companies have released their open-source thing, but there's no one company that's stood out and said, “Oh, we're going to put a stake in this field, and we're going to take over the marketplace.” So, I still think it's a ripe place to explore. I personally would really like to see kind of a GUI pipeline scheduler. Airflow is definitely a step in the right direction. It has a very feature-full UI, you can do a lot of stuff on it, and that's one of the reasons it's so popular. But I think it could be one step further. The definitions of these DAGs and the jobs themselves, potentially, at least a simple version of it could be a drag and drop. There's no reason why you definitely need to write Python to write as an operator. So, I would like to see something in that direction. And kind of adjacent to that is data pipeline observability and operability. I think, also Airflow offers a little bit of this, but it's really hard. Like you mentioned, what if a job fails in the middle of the night, who gets paged? Can they be smarter about who's getting paged? If this pipeline fails, can there be an estimate about the final job and [40:00] what the delay and landing time is? Which validation failed? How did the schema change from yesterday to today? All of these things could be better captured with better tooling. Right now, it's very ad hoc. You go in and look at the S3 bucket, and you say, “Oh, the data yesterday was like this.” So, I think there's a lot of potential there as well.Utsav Shah: [40:18] The way you’re describing it, it's like the 2000s, or the at least 2010s, when there wasn't as many options like Prometheus and Datadog, and all of these things, which is now available in production, but seems like it's still not available for data infrastructure. Emma Tang: [40:34] Exactly. That’s exactly the pain point. Utsav Shah: [40:37] And just the idea of using drag and drop UIs instead of Python described schema, that makes so much sense. But then you get the advantage of version control, and all of that like with Python. But still, I can totally see, it should be really easy for someone maybe who doesn't understand how to write code at all, they only know how to use SQL, to be able to configure and write a data job. Emma Tang: [40:59] Yeah. Utsav Shah: [40:59] Is that possible today? Is there someone who only knows how to run SQL or write SQL? Can they write a data job at Stripe today? Or do you think the tools just aren't there yet?Emma Tang: [41:12] Currently as Stripe, in order to write a proper pipeline, it's not really possible. You can write kind of pseudo pipelines based on Presto, or Redshift, but in terms of all the robust capabilities that we offer, most of it is only through code. We are moving in that direction, though. So, we wrote a bespoke kind of framework within Stripe that allows folks to write pure SQL and that SQL will get translated into a Spark job. So, you don't have to write Scala at all but you need to write a little bit of Python in order to schedule it. So that still is there. So really, the next step is if we can even get rid of that little bit of Python, and have a GUI-based solution, then that could be much more powerful. So a lot of companies do have solutions like this, though. Honestly, I don't know how other companies do it but these other companies that use Hive primarily, I think they also write pipelines with Hive, and that's a pure SQL solution. So, I'm pretty sure there isn't a UI solution at other companies. I think Facebook has a solution that's UI-based. Uber also has a solution that is based on Airflow, but they put a custom kind of interface on top that is much more UI-focused. So, everybody's kind of moving in the same direction at these companies. It's just that there is no unified solution yet.Utsav Shah: [42:40] Basically, make it easier to operate on really large-scale data, and remove as many barriers of entry as possible. Emma Tang: [42:47] Mm-hmm.Utsav Shah: [42:48] Okay. Yeah, taking a step back a bit, what got you interested in this space? Traditionally, at least I hadn't heard of data munging, and all of that even in University and I hadn't heard it until I'd been working for like, a year or two years and I had to mess with Hive and try to understand how our user was doing something. And this was like, internal users. And that's when I got interested in making pretty dashboards and all that at max. What got you interested in working on this space?Emma Tang: [43:13] That's a really good question. So, I think I just got lucky in my previous job. Before Stripe, I was at a smaller company. We were acquired by a larger company, but we kind of stayed nuclear within the big company. And it was small enough where I got a chance to work on basically everything. So, I started on Salt Stack and then I moved towards the backend, and then towards infrastructure. And our infrastructure team comprises both of just normal infrastructure, like our edge servers, because it's AdTech, and then we also have the data infrastructure segment as well. And I've just found myself really gravitating towards the data infrastructure side of things. I think the fact that it's so powerful was really just really astounding. Once you've seen the magic of a big data job work, there is something magical about it. I think the other thing is distributed computation is fundamentally like-- I don't think. I think it's a second-order kind of software problem. Like you're not dealing with one server and how it's breaking, you have to deal with like 1000 servers and how they work in concert, and how they talk to each other, and how it works. So, I think that it was interesting in terms of the technical challenge side of things, so I just definitely found myself gravitating towards it. I think also the timing was a little bit lucky. I started putting a lot more time into Spark around 2015, 2016. That was when Spark became more stable, I think, from 1.3, 1.6, and then 2.0. It started getting massively more adoption as well. So, I think there was a wave of interest and I was riding that wave as well. So, I went to Spark Summits, and you meet all the people that are super excited about this field as well and a lot of like-minded [45:00] folks. So, I kind of got swept up in the storm that happened at that time, and just kept on working in this field. And it's definitely just, I still think intellectually, challenge-wise, it's one of the best places to be. It's just so interesting. It's endlessly interesting. It's moving very fast.Utsav Shah: [45:19] That's really cool. And what's your suggestion on somebody who's trying to break into the field right now? Like, if they're just in university, let's say. Or maybe they have a few years of experience, but they find these problems really fascinating. How would somebody transition into this? Or how would somebody start working on data infra or become a data engineer?Emma Tang: [45:36] Yeah. So, I would advise folks like this. Most companies these days do have big data capabilities. Try to get yourself opportunities to work using the data infrastructure, even if you're not working on the data infrastructure itself. Basically, if there's a chance to write a pipeline for your team, and usually there is, your team probably needs metrics, right? Your team probably needs to know some business insights and how they need to operate. So, if an opportunity arises, take that opportunity, and run with it, and talk with the data infrastructure folks at your company, and kind of learn what their day-to-day is, and see if there's anything you can help and contribute with. If you see that, oh, the way of doing this is kind of repetitive and there's a helper function that you can write on the infrastructure side that could help a lot of other people, try to do that. Honestly, writing libraries is a huge part of what data infrastructure does. We just abstract away things so that our users have an easier time too in writing code. So, I think just being exposed to areas is going to be super important. What I often see instead is when the opportunity arises to jump on these data challenges that are kind of separate from your work, a lot of folks just shrug away. They're like, “Oh, this is not related to what I do. What I do is Ruby, I do not want to learn Scala. This is totally different.” And I think that's like you’re kind of doing yourself a disservice. If you don't even try it, how do you know? And also, you want to bet on your future. You don't know what's going to be hot in the future, you just know Ruby. It might be good to expose yourself to a few more types of technology. So, I would say just take the opportunities as they come and just go forward with it and try different things.Utsav Shah: [47:27] Very good. Yeah, you've answered a lot of questions on big data and everything and I don't think I have any more questions for you.Emma Tang: [47:34] Awesome. Maybe I'll leave with a couple of common pitfalls that I've seen when people spin up their data infrastructure. So, I would say the number one thing is, people trying to do super bespoke solutions like they implement something on their own. They're like, “Oh, I need a scheduler, let me implement something on my own. It's going to be super cool.” It's usually going to be cool, and it will serve your needs in the beginning, but maintaining it and adapting it to your growing business need is going to be a lot harder than if you had an army of open-source people working on the same problem. So, I would say choose popular, widely used frameworks, see how many stars and how many Stack Overflow questions people have answered, just go with the most boring, common solution. That’s usually a good call. The other thing is probably, if you can afford it, maybe consider using out-of-the-box solutions on the cloud providers, like use EMR on AWS, instead of spinning up your own Hadoop cluster, that probably isn’t worth it. At some point, you can probably just move it in-house, it’s not that hard. You do want to make sure that you're using a technology that is going to be replaceable. I think TCP data flow is much harder to bring in-house than EMR AWS. But yeah, so definitely do your own research. I think the other thing I would call out, and this is much harder to do, so only do it if you absolutely can. But try to make your data immutable, especially the data writing to big data sources that you're going to use downstream. If necessary, just provide a separate path for your offline data than from your online data. They don't need to be exactly the same. I mean, the data needs to be the same, but the format does not need to be exactly the same. So, try to make it immutable if possible. And last but not least, I would say try to have a single source of truth for your data. You don't want to have a situation where the source of truth is separated in different places, like, “Oh, there are some tables that are most accurate on Redshift. There are some tables that are most accurate in S3. There are some tables that are most accurate in these different archives but not these other archives.” So definitely try to standardize and make sure that there is a very easy-to-use multipurpose data lake instead of trying to federate different sources together in the future. That's just such a pain.Utsav Shah: [49:49] Yeah, so S3 should be the source of truth and has the most accurate stuff always and everybody in the org should basically be aware of something like that. Is that a good way to summarize that?Emma Tang: [49:57] Yeah.Utsav Shah: [49:58] Okay, and [50:00] that brings me to another question which you mentioned was, use AWS EMR or use these tools instead of trying to host it open source, at least until the cost becomes prohibitive. Do you know what the markup is, and at what point you should think about, you know, “I shouldn't use Redshift and I should try hosting my own Presto,” for example? At what point do you think it makes sense to start thinking about that? And how much is the markup generally? I'm sure you all have done cost evaluations and these things. Emma Tang: [50:25] Yeah. I don't recall exactly what the markup was. Utsav Shah: [50:30] Just like an [Inaudible 50:31], like how much it makes--Emma Tang: [50:33] It’s definitely less than 10% of your entire spend...Utsav Shah: [50:35] Oh, interesting.Emma Tang: [50:36] Yeah, on the cluster. So probably much less than that, but don't quote me on this. We haven't looked at this in a while. But I think it's mostly a matter of, honestly, when people move on to their own clusters, usually, it's more of a business need like they need to install some custom software, that just they cannot do with AWS. Or they need to sync the data into some bizarre place, EMR just does not accommodate that. It's usually those reasons that people move off. And the savings is another justification for moving it off. So, it really does depend on your business need. And on the flip side, it does depend on how much resources you have. If you have the engineers, and you could possibly do this, then by all means evaluate. But a lot of companies, they might not even ever get to that point where they can afford to hire specialized data infrastructure engineers to try and spin this up. So, it really depends on the situation.Utsav Shah: [51:29] So it's more like a feature gap, rather than a cost-saving measure is why you would consider moving in-house. That's pretty interesting. Emma Tang: [51:36] That's what I observed. Yeah. Utsav Shah: [51:38] Okay. Like, there's some very weird place, you need to put data that AWS doesn't support. I mean, I'm just trying to think because it sounds so abstract to me, what kind of things would they not support? What kind of integrations would they not support? Emma Tang: [51:51] So one thing they don't support is if you want to do Datadog.Utsav Shah: [51:56] Okay, interesting.Emma Tang:[Cross talk 51:57]] At least last time we evaluated it. They run on specific versions of Linux, and it just doesn't accommodate our cost of installation of Datadog. So that was one reason we couldn’t do it. There's also security concerns as well. You know, these boxes, they're spinning up and down, they’re processing all of this data, and they're ephemeral. So, there are potential security concerns with that as well.Utsav Shah: [52:23] Interesting. Yeah, it sounds a little less abstract. So now you were saying something like you can add alerts to whether you're getting enough data on time, and that's why you would put some of your data in Datadog. Okay. That makes sense. Cool.Emma Tang: [52:35] Yeah. Utsav Shah: [52:37 Thank you. I learned a lot of stuff about big data and managing big data in general. Emma Tang: [52:43] Yeah, I'm just spilling all of it out. Utsav Shah: [52:47] Yeah, and I think it's going to be great to have this recording. And these people are going to be familiar with how much work there is in managing all these infrastructures. Emma Tang: [52:57] Yeah, cool. Thanks for having me again.Utsav Shah: [52:59 Yeah, thank you.Emma Tang: [53:00] All right, bye. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey everyone, welcome to another episode of the Software at Scale podcast. Joining me here today is Emma Tang, who is an ex-engineering manager at Stripe on the data infrastructure team or the data infrastructure group. And she used to be a lead software engineer at Newstar which is an ad tech company. Thanks for being a guest today. Thanks for having me, Utsav. Yeah, could you tell the listeners and me because a lot of us are mostly just experienced with you know full stack software development like we do front end, we do back end. Where does like data infrastructure come in or like data in general?
Starting point is 00:00:45 Like when would I use something like Spark? And when does that become like business critical? I think that's a question on a lot of folks' minds. When does my data become too big that I need to use big data? Or should I consider it right now given the set of circumstances that I have? I personally don't think there's a hard and fast rule
Starting point is 00:01:03 in terms of volume or size that defines big data. You can't say I have one petabyte of data, therefore I need to start using Spark. You're actually depending on your situation, like your architecture and your business needs. Some symptoms of when you need to consider big data include you're actually hitting your production database for analytics. And you're really worried about the load on your production application. That's a time where you need to consider maybe something that's more specialized for dealing with big data. Another example is you integrate it with a bunch of third-party vendors like Salesforce
Starting point is 00:01:34 and Zendesk and Clearbit, and you want to integrate the data into your system and yield insights from that. You should also consider big data. Another case is you just have a business need. For example, you need to produce data heavy reports for customers or regulators for other internal purposes. You also should consider spinning up big data. Okay. So maybe like walking through this as somebody who has no experience in this field at all. So I'm just guessing here, like maybe like V1 of how Stripe would be implemented is like just a relational database that has like transactions
Starting point is 00:02:05 it's like a ledger but at some point you want to do a lot of processing on those transactions like you know what the tax rate is for a certain transaction and there's also like really complicated transactions like you're paying like maybe like an instacart shopper and then you're also paying instacart something and that money is coming from some random account somewhere. At that point, you would start considering managing those transactions using something like Spark. Is that like a good way to think about it? You're definitely going at it at the right angle here.
Starting point is 00:02:36 At Stripe, some of our earliest use cases for big data was in the reporting and risk reconciliation. We're a fintech company, we deal with a lot of money. And when money is involved, accuracy is super important. So we can't be like a rough estimate of what happened yesterday. No, we have to know exactly what happened yesterday and not only yesterday,
Starting point is 00:02:52 for all the days before that as well. If a transaction was somehow reversed a few days later, we need to know about that and we need to reconcile the systems. So big data is great for that. You can very easily take in the whole universe of all the data you've ever had in your company and just crunch it in a few minutes. So that's the power that you get.
Starting point is 00:03:08 Some of our earliest use cases, including our internal billing system, at the end of the month, we need to charge our customers the fee for, and say, oh, you made these many transactions on Stripe. We need to charge you this much money. That process is a big data process. It's how we make money and big data is at the center of that. Some other examples are product examples. Straypass radar, fraud detection, or capital, which is our lending product and Sigma analytics products. All these products are powered by big data. Without machine learning, without big data, none of these are possible.
Starting point is 00:03:37 Okay. That makes a lot of sense. So you basically have something like a cron job or something, maybe like, just bear with me here, which we'll go through like one day's worth of transactions and figure out, okay, these are the final fees that I need to charge people or like charge our customers based on whatever transaction they made. And that will finally go through. Yep. Sort of like that. And we have to combine different data sources together as well. It's not just one transaction source. You also have to combine it with their custom fee plan because they might have a very different fee
Starting point is 00:04:07 plan from other merchants. Maybe they operate in different countries and there's different currency. So it's a lot more complicated than making it out to be, but that's where big data shines. You can really do very complex stuff and the system is built for that kind of processing. Okay. How does like one verify that, you know, what the number crunching did is actually correct when you have like such a large pipeline? I think it really varies by business need. If it's a B2C company, that's like social media where you want to know how many clicks happen on this post or whatever. It's not mission critical that it's super, super accurate. And therefore the way you implement that system might be a little bit different.
Starting point is 00:04:46 You don't need only one semantics, whereas at Stripe, we really do care about accuracy. So for us, we actually build in a lot of custom validation frameworks on top of the jobs themselves. Once a job finishes running, we have a bunch of separate post hoc processes are run after the job runs to validate the data. That is the right format, that it follows some rules. For example, it has to be always growing or the max may not exceed this or unique key validations. We have these that run after each job to automatically validate this. And if it doesn't satisfy these, we immediately invalidate the data and stop the entire pipeline. This is something that we built at Strape, but I feel this is one of the things that the industry needs to do better. Easy ways to build
Starting point is 00:05:27 validations on top of big data pipelines. That's really interesting. And it sounds like it's like the inverse of machine learning where you're trying to make sure your training data is clean before it goes into the pipeline. Here you need to kind of figure out your output data is clean. And so you run like a bunch of verifiers yeah and you need to um so that makes sense so maybe like a system designed for something like stripe would be you have an api which pretty much writes information to like some kind of big data warehouse like hive or something like that and then you have spark jobs that crunch all of that information that go into Hive and produce like this final output,
Starting point is 00:06:07 which is like reports and billing. And then you have a bunch of verifiers that check. Yeah. Okay, that's really... I can describe the system at Stripe for you. It's a little bit different
Starting point is 00:06:16 from other companies because again, big data is so core to our business at Stripe. Therefore, it does look a little bit different for us. There are a couple of sources of data for us. One is our online database. We use a combination of different type databases, some relational, but also some NoSQL databases. We have a Mongo presence
Starting point is 00:06:34 and there are actually ETL pipelines that take the data from Mongo and apply changes to yesterday's snapshot to create today's new snapshot. That snapshot will land in S3 for downstreams to use. In that process, we also apply a schema and we do validations on top of that Mongo data. The data is actually in parquet format when it lands in S3. Another source of data for us is from our Kafka streams. These are immutable event-based streams. They also end up being archived in S3 and we transform that to a parquet format.
Starting point is 00:07:04 And then there's other smaller data streams that flow in mostly from third party integrations, like I mentioned, Salesforce, et cetera, those also land in S3. Our data lake is S3. The source of truth is S3. We do have data warehouses, but they are not considered the source of truth, even though they have almost like all the stuff that S3 contains. We use a combination of Redshift and Presto, but we're in the process of deprecating Redshift.
Starting point is 00:07:27 I think we're very close. That basically sits on top of the data and it gets ingested in every day through another pipeline. In terms of batch distributed computation, we use Spark. We use mostly data frames and we also have quite a bit of Spark SQL.
Starting point is 00:07:41 We used to use something called Scalding, which is based on MapReduce, which is an older technology, but we've mostly ramped off that at this point. Okay. And the benefits of Spark are you get basically like a streaming infrastructure rather than like a map job, like a map phase and a reduce phase. And each phase taking like a lot, especially the reduce phase could potentially take a lot of time. We actually don't use Spark for streaming purposes. So Spark has analogous phases. There's mapper phrases, but there's like shuffles where it's kind of like a reduced phase. So I think the basic concept is very similar, but I think the main advantage to
Starting point is 00:08:13 do map reduces, it's a lot of in-memory processing, which is a lot faster than having to survey onto disk over and over again. So that's why it's a lot faster. There's so much attention on it. The optimization they're doing is really great. They have a strong optimization engine. And if you write data frames or SQL, they can actually optimize your query very well to the point where it will outperform raw Spark without using data frames by 10X, 20X, it's quite amazing. There's a lot of interesting things that are happening.
Starting point is 00:08:40 Yeah, that definitely sounds really interesting. So you, and it seems like you have a lot of different data sources that put stuff into S3. And then that gets processed by the Spark jobs. And you have like this business critical output. And it's a very different shape of a company compared to like your regular tech companies that don't have to deal with or a lot of tech companies don't have to deal with like, large scale data consistency, right? A lot of as you said, right? A lot of it is like analytics. So it's okay if it's like slightly wrong.
Starting point is 00:09:09 That's like fascinating. Yeah, it's fascinating, but it's also very challenging in different ways. I would say we care a lot more about accuracy. Our pipelines tend to be really robust, but also slightly more computationally intensive. For the same job, we've probably spending three times the computation power that a normal company would
Starting point is 00:09:27 to make sure everything's accurate. Okay. So just in the additional verification and all of that is where that additional time goes through. Yeah. There's a lot of that. Sometimes we take the more robust route. For example, you can get an estimate by just crunching three days of numbers, but we would go back in history and crunch all the data. Oh, interesting. And it's only going to grow with time and even transaction volume. Exactly. There's actually good ways to work around that issue as well, but we can go into that another time as well. There's a lot of stuff here. There's some things I didn't even touch upon that's also in a sphere of big data. For example, scheduling is also a major issue. You have pipelines. You you wanna make sure that
Starting point is 00:10:05 this job runs after job A and job B runs after job D. That what's called a directed acyclic graph is very important. At Stripe we use Airflow, but there's also other schedulers out there as well. Okay. Yeah, I'm familiar with Airflow, like where, yeah, as you said,
Starting point is 00:10:20 it seems like it allows you to make these DAGs of different jobs so you make sure that the entire pipeline goes through. But what happens then when somebody adds an intermediate jobs that starts failing? Does like the entire company get paged? How do you maintain like reliability? Unfortunately, sometimes that does happen. Different people deal with it differently. We have different types of rules for each job to
Starting point is 00:10:45 specify what the failure behavior is. There are rules like stop the entire pipeline. That's usually the default. But there's also rules if this job fails and the next job needs to run, the next job is allowed to use yesterday's data. Or if this job fails, there's some other job that you can default to. That job will run instead. And then the downstreams can run. We have a lot of complicated logic around this. And it just really depends on the situation. Yeah, that makes sense. So like, I guess one example I can think of is if you're trying to detect whether there's
Starting point is 00:11:15 like a lot of fraud or not, you don't have to use the previous day's data. I don't know. I could be wrong here on what the business criticality of that is. In case that job is like failing temporarily. But definitely calculating how much you need to bill your customers, the entire job has to, the entire pipeline has to stop. Yeah, yeah. Unfortunately, when that happens, people do get paged.
Starting point is 00:11:38 So how do you maintain like data quality, right? This is like a very persistent problem that I've seen where sometimes like data might just not be, like for example, country codes, but like there might be like an invalid country code over there in that. And then some assumptions that a downstream job makes on valid country codes could just break. How do you make sure that data isn't getting dirty because of some like intermediate job? I don't know if that makes sense. Yeah, that does make sense. We talked a little bit before. I think validations are a really important part of it. So we do spend a lot of computational resources and ultimately money on making sure that the data is correct. Some
Starting point is 00:12:14 architectural decisions that we made to make sure things are good is for the most part, in terms of Spark jobs, it's almost purely Scala. So it's type safe. We also make sure that all data that's important has a schema associated with it. So all data is written. We also make sure that all data that's important has a schema associated with it. So all data is written in Parquet, but we also have a schema representation for every data set. We use a combination of Thrift and Proto. We should only use Thrift or moving onto Proto as a schema representation. And they need to match one-to-one with the data you're reading. If you have a mismatched data schema anywhere along the way, it just won't work. So we'd rather things fail than be false is our approach. It's an extreme way to do things,
Starting point is 00:12:51 but that's just the world that we operate in. I think that's fascinating. So every single schema is basically specified via a protobuf message or at least the really critical ones. I would say almost everything. Everything that I mentioned that comes through Mongo or comes through Kafka, they automatically have a schema representation either in Thrift or Proto. We have some libraries that auto gen Scala libraries for each model, which users can then use in their type safe Spark jobs. So that's how we ensure there's consistency there. Okay.
Starting point is 00:13:23 That makes sense. So being the engineering manager for pretty much one of the most like it seems like business critical teams as a company must be pretty stressful it's uh it can get stressful that's true like you mentioned when jobs fail it can fail any time of day which which can be stressful when you're firefighting i would also say increasingly, a trend that I'm seeing a lot is increased kind of like data governance, data lineage,
Starting point is 00:13:51 data security type work. There's just more regulation these days about how data is stored, how data is moved around, what kind of data you can keep around. That also adds a lot of pressure to data teams as well. Not only do you have to do your job of making sure the infrastructure is running, that you're actually fulfilling the business needs, but you also have these security and privacy regulatory things that you have to start implementing solutions around as well. So I think there is, in the last few years, a little bit of a double whammy of pressure on data teams. So maybe to dig into that a bit, like, is an example of that something like, you know, European data has to be processed in like a European data center? Is that like an example of that? Yeah, an example is if a country comes out and says, oh, we're going to implement
Starting point is 00:14:35 data locality rules, all payment data needs to live in a data center in my country. And you might have to spend a lot of time actually working with regulators to figure out what does that actually mean? Does it mean just storage? Does it mean storage on online databases, but also S3? Does it mean compute data in fly? And if it is compute, is there a time limit? Can you keep it around for 24 hours and then ship it over by then? There's a lot of different possible specifications here. So that's an example of a rule. Is there a particular country that has the most annoying rules? I don't know if you want to share. Or one particular rule that you had to deal with.
Starting point is 00:15:16 I think these are great things for the consumer and for businesses too. These are things that we should have in the world. They do add technical challenges to the teams that have to solve for them. An example of one that was a little tricky to solve was the India data locality that's coming out specifically for payment data. And if you already have an architecture that has scaled to a certain point, then migrating away from that architecture to make data locality work can be a bit challenging. But we still figured out a really great solution. So I'm very happy about that. And you can't even fall back to like the U S when something like India, cause literally halfway around the world. So like the latency must be just really high.
Starting point is 00:15:53 Yeah. So luckily we're, we're on AWS. Latency is not as bad as you think it is, even though yes, it does. There is some, some bit of ad latency. Okay. Okay. Yeah, I just remember trying to SSH into my work production servers when I was visiting back. It was like, I can't work here at all. There's no point.
Starting point is 00:16:13 It's just too slow. Yeah, it could be like that. We did a lot of studies before we even attempted with any solution to compare the latency characteristics and it was actually much better than we assumed. Yeah, that is interesting. So how did the work of like data infrared, right?
Starting point is 00:16:29 How did the work of your team evolve over time? Because I'm guessing in the beginning, there were teams just like managing or writing Spark jobs. And at some point, somebody decided, you know, we need to really invest in the space and make sure there's like a platform-ish layer. So I'm guessing that's what your team is. So how did that work evolve over time? So I'm going to step back a bit and talk a little bit about why would someone want to have a data org? Is it okay to not have a data
Starting point is 00:16:54 org at all and just figure out whatever solution that works? I think the analogy for me is having planes around when most of the transportation are trains. You can usually get to places on trains, but there's some places that you cannot get to, like from North America to Europe. You can't get there on a train. You have to have a plane. If it's a really short distance, if you're going from San Jose to San Francisco,
Starting point is 00:17:15 maybe a plane is a little bit too extra. Probably a train is okay. That's an analogy, I think. Big data, it's going to solve problems that your current architecture just cannot solve. It's just not possible. So you're just losing out on capabilities and possible business products and scalability if you don't even consider it.
Starting point is 00:17:33 You just will be left in the dust if that happens. So the question is, when do you want to consider doing it? And like I mentioned before, when you notice that there are make or break situations in terms of business needs, products that you want to make, you want to run machine learning, for example. Let's say you have a real-time business need for some large volume data. If your production database is falling over because you're running queries on it, you should probably set up a data warehouse like a Redshift or Snowflake. So these are the times when you need to consider it, but I would recommend not waiting until
Starting point is 00:17:59 the breaking point. See if you can integrate it smoothly so that your architecture will just work well with whatever big data systems that you introduce. I'm going to go back to your original question, which is how is work involved on big data at Stripe? I'll just start out by saying there's a couple of different philosophies about how data infrastructure groups should be organized, especially on the point, where do you put data infrastructure engineers and data engineers? To clarify in terms of terminology, I think the phrase data engineer is interpreted quite differently at different companies. My personal definition is folks who are setting up the Hadoop
Starting point is 00:18:35 clusters or Spark, these folks, I call them data infrastructure engineers. And the folks who are using the infrastructure to write the jobs and the pipelines, I call them data engineers. And a lot of the organizational principles revolve around how do you organize the relationship with these two groups? One philosophy is have all of them in the same org. You have data infrastructure engineers and you have data engineers and data engineers will help production engineers write their pipelines, but they report up to the same folks as the data infrastructure engineers do.
Starting point is 00:19:01 Another philosophy is the data engineers are actually inside the product works. They report up to the product works, but they just know data engineering and they will write the pipelines for these product works. And the third is there's no data engineering. There's data infrastructure engineers and there's product engineers, but the product engineers have data engineering capabilities.
Starting point is 00:19:22 They know how to do these things. They're more generalists. Those are the main three philosophies. And Stripe definitely started out as data engineers and data infrastructure are one and the same. When it was small, the people who spun up the infrastructure
Starting point is 00:19:32 are also the ones who knew how to write this. So they're doing both jobs. When we started growing, it was very inefficient to keep doing that because they're fundamentally very different skill sets. Understanding production data, understanding the business logic
Starting point is 00:19:45 and how to combine data sources to yield insights that people want from this is not the same job as making sure the Hadoop cluster doesn't fall over from load. Eventually we started separating the two and that's where we were met with the hard decision of should data engineering live within our org, which is how it was in the past
Starting point is 00:20:01 versus trying to split it out. And we've experimented with different kinds of ways in the past. We ended up in a hybrid model where we have an organization under data science, which takes care of the most complicated business models and the most complicated analytics, and they will write the pipelines for other products. They report up the data science. On the other side, most other pipelines on the product side are run by product engineers. And as data
Starting point is 00:20:25 info folks like myself and my team, we make sure that tooling at Stripe is some of the best tooling around. If you want to write a data job, you want to schedule a job, it's super easy. Observability into your job should be super easy. Testing should be super easy. That's our job to make sure the barrier is really low so that we don't need to hire specialized data engineers to kind of tackle this problem. I can talk on and on about this topic. What is important to call out is there's a couple of reasons why having a separate data engineering org may not be the best solution for a lot of companies. One is data engineers are actually really hard to hire. If you think about it,
Starting point is 00:20:59 what do data engineers do? They're the ones translating production logic into Scala or Python or whatever you write your data pipelines. So they're translating things. It doesn't bring a big sense of satisfaction when you're not producing a lot of things. You're a translation layer. And these folks are highly technical as well. There's no reason why they just write pipeline jobs. They can move to other roles as well. So traditionally, they're really hard to hire. And we've observed this, right? But they're just really hard to hire. And the other thing is when a problem has a software solution versus a hire people solution, you should always try to bias towards a software solution. Because at some point your company
Starting point is 00:21:32 is going to grow big enough, hiring a million data engineers is just not very tenable. First up for how to hire and second, it's expensive to hire them. If you take the resources you would and put them into developing the best software solution possible, that's a more long-term sustainable path forward. That's so interesting. And I think that certainly makes sense. I don't think I had a good understanding of what a data engineer does before I heard you talk about it. And yeah, how do you even go about evaluating whether somebody is a good data engineer or not? Or whether somebody even fits that archetype of what you're looking for? I personally don't hire for data engineers i hire for data infrastructure engineers i think for data
Starting point is 00:22:08 engineers they have to have a curiosity into the business and the product to really want to understand why the data should be combined the way they do be creative coming out with solutions i would definitely say they need to be a lot more invested in the business side of things that makes sense it's like somebody who's interested in how the product works and why people are buying the product. And they want to learn that. Working closely with product teams. Yeah, exactly. And yeah, you hire for data infrastructure engineers. These seem like more your standard infrastructure engineers interested in building big systems and seeing them work and maintaining
Starting point is 00:22:42 them. Is that correct? Yeah. We want to see folks who not only can dig very deep and can successfully operate in very large code bases because all of the technology we use, Hadoop, Spark, Kafka, they're all very complicated code bases and you need to be very familiar with digging around in that kind of code. I think the other side that we really like seeing is some DevOps experience, at least the potential affinity for the kind of DevOps works that we sometimes have to do. If someone just really hates DevOps, doesn't want to go onto the server to see what the logs are,
Starting point is 00:23:10 that's not the greatest thing. So once technical depth, but also operational excellence. That also seems like hard to hire really good people. Yeah. But at least the role is clearer, I would say, in the industry like what
Starting point is 00:23:26 this person does yeah yeah all the companies are going after the same pool of people this is really funny have you been in this industry long enough you basically know everybody else's industry you just check in with them once in a while are you ready yet you know check in every six months or so yeah so you're definitely looking for people who commit to hadoop and all of that for example or that would be like an idea i would say like people with like some distributed computation experience is usually helpful it doesn't have to be hadoop like you can be a kafka expert or you can be a scalding expert and still be very successful at a company that does other types of distributed computation okay and how do you evaluate whether you're doing a good job and
Starting point is 00:24:04 you're helping the business enough? Clearly when things aren't burning down, that's great. But what kind of quality metrics are we shipping enough? How do you measure yourself? We have a couple of mechanisms to make sure that we're keeping ourselves honest. We take a look at the runtimes of jobs themselves. We want to make sure that even though data is always growing, we still want to maintain a neutral, if not lower runtime over time. That means we keep on making optimizations on the frameworks that we're providing. We keep on providing useful helper methods so that you can avoid processing more data than you need to. We introduce things like iceberg data format. So
Starting point is 00:24:37 there's more optimized reading and writing. Ultimately, we're trying to make sure that we get to the results as fast as possible, as cheaply as possible. The other thing is, are we addressing business needs? We have check-ins with all of the user teams at least every quarter, if not more frequently, where we really listen to what their business needs are and see if there's any ways that we can introduce new systems that address their need. For example, Stripe Capital, they were looking at potentially Spark ML, as well as data science was interested in. This will be a framework that we potentially would introduce to
Starting point is 00:25:07 the self that this is need. We will check in periodically to see if what we're doing is fast enough and it solves their problems for them. Other metrics include false failure rates. We want to make sure that those get reduced. We want to count how many validation errors that we hit. We want to reduce the false positives there and also just reduce the number of validation errors in general.
Starting point is 00:25:26 Stripe is very focused on metrics. We have a lot of KPIs. As an org, we actually have a dashboard of 10 or 20 different metrics that we track internally to make sure we're doing our job. Those are kind of the things that we keep self-honest with. But again, there's a lot of different things that are coming in that are not really well captured by metrics, for example, security or data locality. Yeah, it sounds like it's one of the hardest technical challenges trying to deal with all of this data and like all these different constraints and making sure you're doing your job right. It's not like a pretty hard technical challenge.
Starting point is 00:25:53 Yeah, that's the fun part. That's the part about Stripe is it enables so much stuff, right? Like I really like the mission of the company, which is like, I think improving the future of the GDP of the internet, like enabling or increasing the GDP of the internet. I think that's such a great mission statement. Yeah, and there's so much of, so many technical challenges behind that to get all of that working.
Starting point is 00:26:14 Yeah, definitely. So then maybe you can explain to me as like a layman, like why is Snowflake so popular? And like, what did it solve in the industry that other people were not doing? This is something that I feel strongly about, which is SQLfication of everything. If you think about initially what big data looks like, it's a MapReduce job. So you write jobs in Java.
Starting point is 00:26:35 They're really verbose. You have to write your own serialization, deserialization code. It's just extremely complex. You need to know a lot before you can even be productive and write anything that does any transformation. Then comes Spark, for example. Spark initially will also, you can only write in JVM languages, so Scala or Java, and you also need to write a lot, but they have a lot more helper functions. And then they introduced Spark DataFrames, which is more robust, a little bit more Rails on it. They kind of modeled the Python DataFrames. And then there's Spark SQL. And out
Starting point is 00:27:04 of all of these, guess which one has the most excitement and increased adoption trends? It's the one that's easiest for folks to use. Lowering the barrier to entry to big data has been a long-term trend. Hive is very popular for this reason. And then Snowflake, Redshift, and technologies like this are just one and the same, right? You can do massive data computation with SQL using familiar interfaces and familiar tools. That's why it's super powerful. I personally have never worked with
Starting point is 00:27:31 Snowflake myself, so I can't go too deep here, but my understanding is there's a lot of advantages to the technology itself. Compute and storage is also separated. I think initially Redshift, now they also separate, but in the beginning it was not separate. If the hardware is only that big and you can't separate compute and storage, there's only so. But in the beginning, it was not separate. If the hardware is only that big and you can't separate compute and storage, there's only so much data you can put on it. And that's not the case with Snowflake. There's also locking associated with Redshift.
Starting point is 00:27:52 You're on AWS forever and it's pretty expensive, whereas you don't have that issue. I am bullish on any technology that comes out with a SQL solution that's really easy to use, very difficult to mess up and can crunch large amounts of data. I'm bullish on any technology there. So Snowflake is just an example of that. Okay, that makes sense. And it's more about just the ease of use and maybe like the speed or
Starting point is 00:28:14 something like that might be fine. Yeah, ease of use and the possibility of your crunching large amounts of data. You can't do the same on a Postgres database. Postgres is also easy to use, but you don't have the same capacity. That makes sense. And so you mentioned this earlier, but what do you think are some startups that are missing in this space? Something that still needs to be done. You said verification is one thing. I'll just start from the top in terms of general trends. Not necessarily all of them are startups. Some of them are just further improvement of the open source repos that we have right now. In terms of batch distributed computation, I hinted at this, but we're seeing a lot of database
Starting point is 00:28:49 optimizations being borrowed and then transformed for distributed computation. A lot of the stuff that we learned a few decades ago that we applied to databases, we're trying to reapply them again to this field. For example, columnar format, column stats, table metadata, SQL query optimization, all of these have arrived for big data. Spark SQL and Spark DataFrame have really great optimization engines. There's another thing that is kind of taking the big data world by storm. It's table formats. Before data is just data.
Starting point is 00:29:20 It can be anything really. But table formats kind of bring a little bit more structure to this. For example, the one that we use at Stripe is called Apache Iceberg. You can think of it as a wrapper around Parquet data. It has some extra metadata around to describe the data that it is referring to. And that is very powerful. That allows for things like schema evolution or partition evolution or table versioning and also optimize performance because you have data like min-max fields. I think that's where the industry is going. Some other trends that I see are privacy and security. I think there's just been a lot of demand on data teams. At every company, we're facing similar problems, which is how do we make
Starting point is 00:30:00 sure that data is tracked, the lineage is clear? How do we make sure that PII data is kept away, stored securely? How do we tokenize different data? A lot of similar problems at all these companies and data catalogs, data lineage products and data governance products are becoming much more interesting and necessary. LinkedIn had Data Hub data catalog product that were released and Lyft also released a monthson. A lot of companies have released their open source thing, but there's no one company that's stood out and said, we're going to put a stake in this field and we're going to take over
Starting point is 00:30:32 the marketplace. I still think it's a right place to explore. I personally would really like to see kind of a GUI pipeline scheduler. Airflow is definitely a step in the right direction. It has a very featureful UI. You can do a lot of stuff on it. And that's one of the reasons it's so popular. But I think it could be one step further. The definitions of these DAGs and the jobs themselves potentially could be a drag and drop. There's no reason why you definitely need to write Python.
Starting point is 00:30:59 So I would like to see something in that direction. Adjacent to that is data pipeline observability and operability. I think Airflow offers a little bit of this, but it's really hard. Like you mentioned, what if a job fails in the middle of the night? Who gets paid? Can they be smarter about who's getting paid? If this pipeline fails, can there be an estimate about the final job and what the delay in landing time is? Which validation failed? How did the schema change from yesterday to today? All of these things could be better captured with better tooling. Right now, it's very ad hoc.
Starting point is 00:31:27 You go in and look at S3 bucket and you see all the data that was like this. So I think there's a lot of potential there as well. The way you're describing it, it's like, you know, it's like the 2000 or the, like the, at least 2010s when there wasn't as many, as many options like Prometheus and like Datadog and all of these things which is now available in production but seems like it's still not available for data infrastructure exactly that's exactly a pain point and just the idea of you know using like drag and drop uis instead of like python describes schemas that makes so much sense but but then you get the
Starting point is 00:32:00 advantage of like version control and all of that right with with python but still i can i, I can totally see, it should be really easy for someone maybe who doesn't understand how to write code at all. They only know how to use SQL to be able to configure and write a data job. Is that possible today? Is there someone who only knows how to run SQL or write SQL? Can they write a data job at Stripe today? Or do you think the tools just aren't there yet? Currently at Stripe, in order to write a proper pipeline, it's not really possible.
Starting point is 00:32:33 You can write kind of pseudo pipelines based on Presto or Redshift. But in terms of the robust capabilities that we offer, most of it is only through code. We are moving in that direction though. We offered a framework within Stripe that allows folks to write pure SQL and that SQL would get translated into a Spark job. So you don't have to write Scala at all, but you need to write a little bit of Python in order to schedule it. So that still is there. The next step is if we can even get rid of that little bit of Python and have a GUI-based solution, that could
Starting point is 00:33:03 be much more powerful. So a lot of companies do have solutions like this though. These other companies that have Hive primarily, they also write pipelines with Hive and that's the pure SQL solution. So I'm pretty sure there is a UI solution at other companies. I think Facebook has a solution that's UI-based. Uber also has a solution that is based on Airflow, but they put a custom interface on top that is much more UI focused. Everybody's kind of moving in the same direction at these companies. It's just that there is no unified solution yet. Yeah. Basically make it easier to operate on really large scale data and remove as many barriers of entry as possible. Yeah. Taking a step back a bit, like what got you interested
Starting point is 00:33:42 in this space? Like traditionally, like at least I hadn't heard of data munging and all of that at even university. And I hadn't heard it until like I'd been working for like a year or two years. And I had to mess with Hive and try to understand how our user was doing something. And this was like internal users. And that's when I got interested in making like pretty dashboards and all that at max. What got you interested in, you know, working on this space? I think I just got lucky in my previous job before Stripe. I was at a smaller company. We were acquired by a larger company, but we stayed nuclear within the big company. And it was small enough where I got a chance to work on basically everything. I started on full stack and then I moved towards the backend and then towards infrastructure. And our infrastructure team comprises both of just normal infrastructure, like our edge servers, because it's ad tech. And then we also have the data infrastructure side as well. I've just found myself really gravitating towards the data infrastructure side of things. The fact that it's so powerful was really astounding. Once you've seen the magic of
Starting point is 00:34:41 a big data job work, there is something magical about it. The other thing is distributed computation is fundamentally the second order kind of software problem. You're not dealing with one server and how it's breaking. You need to deal with like a thousand servers and like how they work in concert and how they talk to each other. It was interesting in terms of a technical challenge. So I found myself gravitating towards it. The timing was a little bit lucky.
Starting point is 00:35:04 I started putting a lot more time into Spark around 2015, 2016. That was when Spark became more stable from 1.3, 1.6, and then 2.0. I started getting a massive adoption as well. There was a wave of interest and I was riding that wave. I went to Spark summits and you meet all these people that are super excited about this field as well. So I kind of got swept up in the storm that happened in that time and just kept on working in this field. And I still think intellectually, you know, challenge wise, it's one of the best places to be. It's endlessly interesting. It's moving very fast. That's really cool. And like, what's your suggestion on somebody who's trying
Starting point is 00:35:39 to break into the field right now? Like if they're just in university, let's say, or maybe they have a few years of experience, but they find these problems really fascinating. How would somebody transition into this or how would somebody start working on data infra or become a data engineer? Most companies these days do have big data capabilities. Try to get yourself opportunities to work
Starting point is 00:36:01 using the data infrastructure, even if you're not working on the data infrastructure itself. If there's a chance to write a pipeline for your team, and usually there is, your team probably needs metrics. So if an opportunity arises, take that opportunity and run with it. Talk with the data infrastructure folks at your company and learn what their day to day is. See if there's anything you can help with. If you see that the way of doing this is kind of repetitive and there's a helper function that you can write on the infrastructure side, that could help a lot of other people try to do that. Writing libraries is a huge part of what data infrastructure does. We just abstract away things so that our users have an easier time in writing code.
Starting point is 00:36:34 Just being exposed to the areas is going to be important. What I often see instead is when the opportunity arises to jump on these data challenges that are kind of separate from your work, a lot of folks just shrug away. This is not related to what I do. What I do is Ruby. I do not want to learn Scala. This is totally different. I think that's kind of doing yourself a disservice. If you don't even try it, how do you know?
Starting point is 00:36:54 And also, you want to bet on your future. You don't know what's going to be hot in the future. You just know Ruby. I don't know. It might be good to expose yourself to a few more types of technology. Just take the opportunities as they come and just go forward with it and try different things. Cool. Very cool. Yeah. You know, you've answered a lot of questions on like big data and everything, and I don't think I have any more questions for you. Awesome. Maybe I'll leave with a couple of common pitfalls that I've seen when people spin
Starting point is 00:37:21 up their data infrastructure. I would say the number one thing is people trying to do bespoke solutions. They implement something on their own. I need a scheduler. Let me implement something on my own. It's going to be super cool. It's going to be cool and it will serve your need in the beginning, but maintaining it and adapting it to your growing business need is going to be a lot harder than if you had an army of open source people working on the same problem. I would say choose popular, widely used frameworks. See how many stars and how many stack overflow questions people have answered. Just go with the most boring, common solution.
Starting point is 00:37:53 That's usually a good call. The other thing is probably if you can afford it, maybe consider using out-of-the-box solutions on the cloud provider. Use EMR on AWS instead of spinning up your own Hadoop cluster. That probably is worth it. At some point, you can probably move it in-house. It's not that hard. You do want to make sure that you're using a technology that is going to be replaceable.
Starting point is 00:38:12 GCP data flow is much harder to bring in-house than an EMR on AWS. So definitely do your own research. The other thing I would call out, and this is much harder to do, so only do it if you can, but try to make your data immutable, especially the data writing to big data sources that you're going to use downstream, provide a separate path for your offline data than from your online data. They don't need to be exactly the same.
Starting point is 00:38:34 The data needs to be the same, but the format does not need to be exactly the same. So try to make it immutable as possible. Last but not least, try to have a single source of truth for your data. You don't want to have a situation where the source of truth is separated in different places. There's some tables that are most accurate on Redshift.
Starting point is 00:38:51 There's some tables that are most accurate in S3. There's some tables that are most accurate in these archives, but not these other archives. Try to standardize and make sure that there is a very easy to use a multipurpose data lake instead of trying to federate different sources together in the future. That's just such a pain. Yeah, so like S3 should be like the source of truth and has the most accurate stuff always.
Starting point is 00:39:12 And everybody in the org should basically be aware of something like that. Is that a good way to summarize that? Okay. And that brings me to like another question, which you mentioned was like, you know, use AWS EMR or use these tools instead of trying to like host it open source at least until the cost becomes prohibitive do you know what the markup is and
Starting point is 00:39:30 at what point you should think about you know i shouldn't use like redshift and i should try hosting my own presta for example at what point do you think it makes sense to start thinking about that and how much is the markup generally i'm sure you'll have done like cost evaluations and these things i don't recall exactly what the markup was. Just like an order of magnitude, like how much? It's less than 10% of your entire spend. Yeah. On the cluster, probably less than that, but don't quote me on this. We haven't looked at this in a while. When people move onto their own clusters, usually it's like more of a business need. Like they need to install some custom
Starting point is 00:40:04 software that they cannot do with AWS, or they need to sync the data into some bizarre place EMR just does not accommodate that and the savings is another justification for moving it off. On the flip side, it does depend on how much resources you have. If you have the engineers, you could possibly do this and by all means evaluate. But a lot of companies, they might not even get to that point where they can afford to hire specialized data infrastructure engineers to kind of spin this up it's more like a feature gap rather than like a cost saving measure is why you would consider moving in-house
Starting point is 00:40:33 that is that's pretty interesting okay that makes sense yeah thank you i learned a lot of stuff about big data like managing big data in general i'm just looking all of it out yeah and i think it's going to be great to have this recording and these people are going to be familiar with how much work there is in managing all of this infrastructure. Yeah, cool. Thanks for having me again.
Starting point is 00:40:55 Yeah, thank you. All right. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.