Software at Scale - Software at Scale 60 - Data Platforms with Aravind Suresh

Episode Date: August 5, 2024

Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI.Apple Podcasts | Spotify | Google PodcastsEdited TranscriptCan you tell us about the scale of data Uber was dealing wit...h when you joined in 2018, and how it evolved?When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period.Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data.That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber?This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide.We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames.Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips.These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool.How does Uber manage real-time versus batch data processing, and what are the trade-offs?We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications.For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data.On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time.The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours.The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems.What challenges come with maintaining such large-scale data systems, especially as they mature?As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity.For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem.Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets.We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well.Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance.Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources.How do you see the future of data infrastructure evolving, especially with recent AI advancements?The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies.We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges.Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference.Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Joining us today is Arvind Suresh, ex-staff software engineer at Data Infrared Uber. Now you're working on OpenAI stuff. Again, thank you so much for being on the show. Yeah, thanks for hosting me, Utsav. It's fun to be here. Yeah. So I'd love to start off with your story. How did you get into all things computer science? How did you get into data infrastructure stuff? I'd just love to learn more. Sure, yeah. So as you already mentioned, I'm currently working in OpenAI, but most of my
Starting point is 00:00:43 background and experiences in real-time infrasystems. I graduated out of Indian Institute of Technology, Madras, and then I kind of joined Uber right after that. I got an offer from Uber to join the data platform team, and this team ended up building the infrastructure that enabled exabyte-scale analytics at Uber. I spent quite some time, around five years in this team,
Starting point is 00:01:06 and I got an opportunity to work on like various verticals of the stack, be it like batch, real-time, something related to failover. And there was a lot of focus on cost efficiency as well. So I got an opportunity to work on all of these things.
Starting point is 00:01:20 And after this, I joined OpenAI. Amazing. And you joined Uber in 2018 so like just describe a little bit about where the data stack was what the company was like at that time. Sure yeah I joined mid-2018. Uber was part of was going through a scaling journey at that time both in terms of launching in new series as as, which indirectly means the amount of data that comes into the system, the users, the number of users we have,
Starting point is 00:01:50 all of that was going through a lot of growth. So at that point, Uber was like, had a lot of on-prem footprint. So most of the infra systems are managed in house. And there was, we had a pretty big infra team to handle all of this so there was some amount of solid infra tools and footprint there more or less the different parts of the stack like batch more or less all of those pieces were already there at that point there was more focus more work on the real-time stuff that kind of happened after that. But more or less, the bare minimum batch analytics stack was already there.
Starting point is 00:02:28 So that's how I would put it. Yeah. And if you could explain even high level, like what was the scale of the data being managed or processed at that time? Sure. Yeah. So I think I vaguely remember this number. The amount of data grew by a factor of 20 or something
Starting point is 00:02:48 in that three, four years period from 2019. And when I moved out, it was more than an exabyte. So that's kind of the scale. So it was starting with like, it was around few pbs somewhere around 2018, I would say though, but then it ended up going to 100 petabytes soon enough. Okay, amazing. And this sounds like it was analytics data at rest. How about like, you know, the amount of data being read early or daily being written out? I'd love to get to know some of that context. Sure. Yeah, most most of these are it's a bit of both these are at rest uh but the amount of data at which the number the growth at which the amount of
Starting point is 00:03:31 incoming data was also growing in addition to just you know there were also a bunch of use cases which wanted a higher retention which means you ended up storing more data than what you would traditionally expect so that would also increase your address size. But the amount at which your incoming data was growing was also higher. So I don't remember the exact number, maybe five years back. But maybe right now, I would put it somewhere around close to a petabyte or maybe even more on a daily basis that gets ingested. Of course, this includes some kind of replication and stuff. But I would put it somewhere even more on a daily basis that gets ingested. Of course, this includes some kind of replication and stuff,
Starting point is 00:04:07 but I would put it somewhere over there on a daily basis. Incredible. So a petabyte of data being ingested every day, even if the number is like half of that or like a tenth of that, that's just enormous. I mean, what goes into that kind of data? Why is there so much data being ingested every day? So Uber calls the core demand-supply
Starting point is 00:04:26 matching called marketplace. There is a lot of complex logic that goes into that. One way or the other, you find use cases to log data, you have use cases to process the data on top of it, and maybe flush it down to tables and things like that. So all of this kind of forms a critical part of that one petabyte. Roughly 60-70% of that would be raw. Raw in the sense they are just... So one important thing to notice, the data infrastructure itself doesn't create any data. This data comes from somewhere outside, which could be online systems or message buses or something like that. And all of those things which are like, you know, directly ingested from these external systems are called raw, right?
Starting point is 00:05:11 On top of that, you have some derived data sets, model data sets and whatnot. Around 60-70% is raw, which means we do log a lot. Makes sense. And if you just had to estimate how much of this data is actually used ultimately by like an analyst or someone in the stack. Yeah, I think more or less the entire, so again, the answer could vary depending on which year we target, because there was times where we found out that some amount of data was like maybe not even queried. There were some some datasets which we ended up logging more than what we require. And there were some efficiency opportunities on top of it.
Starting point is 00:05:49 But I think most of those problems have more or less been solved in the recent time. So more or less what we log today, more or less the entire thing gets used in one form or the other. So you might not have the entire raw data sets being used on a daily basis, but one form or the other, the insights from that raw data sets helps you make data-driven decisions on a daily basis. Yeah, that makes sense. So it sounds like there was some kind of cost optimization effort, which would have culled logs that were not used.
Starting point is 00:06:19 And even after that, probably like you're saying 60 to 70% of data off a petabyte a day. And that's just enormous scale. Yes. Right. So why does Uber need to log so much data? That's the main question, right? If you could tell us some interesting insights or some data-driven decisions that would have been made through this, like either it's like a one-time thing, either it's a system based
Starting point is 00:06:42 on this data. Sure. Yeah, that makes sense. Ultimately, like every company, when it starts, every product, it would start with just maybe a single Postgres or a NoSQL or a NewSQL, some kind of an online database, right?
Starting point is 00:06:55 Where you store your core entity. For Uber, the core entities are mainly trips, users, earner accounts, and things like that, and maybe cities information. But you will soon reach a point where that particular database wouldn't fit your use case. You'll end up having a single giant Postgres where you won't be able to run analytics queries on and things like that.
Starting point is 00:07:17 So the thing is, which is when all of this data stack stuff got started, they help you make a lot of these kinds of questions. Let's say you end up having some kind of questions like, how many concurrent trips are we handling throughout the year? The concurrent trips is one of the important metrics for Uber because that kind of determines how many workers, how many CPUs you actually need up and running at this particular point in time
Starting point is 00:07:46 so that you can actually serve trips throughout the globe, right? So now it will be hard to put such kind of a query. If you want to analyze maybe the last two years worth of data, you won't be able to serve with an online database. This is where all the use of data stack comes in. Another example is, let's say,
Starting point is 00:08:09 which is the fastest growing city in the world? One important thing to notice, as with a lot of other companies and products out there, Uber's traffic is also seasonal, which means you will end up getting more traffic during the holidays and things like that, which means just to answer these kinds of questions, looking for the last one month or two months won't be enough. You'll have to look maybe last three years.
Starting point is 00:08:26 The more data you have, the more accurate your statistics will be. Another example of where this data would help us is identifying some kind of anomalous user patterns where let's say you were looking for some insight like a single user account. Do you see some fraudulent or anomalous activity where that particular user account is logged throughout the globe? Or do you have some kind of users? Let's say which city has users where they end up, you know, canceling a lot more than actual trips, right? These are, these might not be just one-off decisions, even key product decisions. Let's say like you end up looking at last one year worth of data and you kind of plot a heat map
Starting point is 00:09:05 of all the coordinates in which the trip was taken. You will obviously see some kind of overlap in plots, right? Which is exactly how, I guess, Uber pool was maybe started, right? Because you end up seeing some kind of patterns in data after you crunch all these numbers together. I'm not sure if that answered your question, but these are one of the key decisions.
Starting point is 00:09:25 No, no, it certainly helps. It helps you understand that a lot of the product can be shaped just by the data that you ingest. Exactly. And that naturally leads to a follow-up, which is how many services are querying the data stack in production? Are they reading from the live OLTP databases
Starting point is 00:09:44 or are they actually querying from the data stack? Yeah, so that's a very good question. I think there are different, the right answer is it depends, but I can go into details. One is anything where you need some kind of correctness and things like that, where you want some kind of consistent behavior across multiple entities, you would obviously use some kind of an OLTP system, which is where most of Trips serving stack will end up using one way or the other. But there are some cases where you would be okay with some kind of, you know, it's okay for the delay to be there for some amount. And it also depends on what kind of system you're building. Let's say you're building some kind of a real-time system. You want the data that's being produced,
Starting point is 00:10:26 let's say, within like 10 seconds or 20 seconds. You will maybe, and if you have a lot of numbers to crunch and it doesn't need some kind of, you are okay with some kind of an eventual consistency query shape, then you will end up using some kind of an OLAP system. So an example is, let's say, restaurant manager, where we end up using something called Pino, which is like an OLAP system. So an example is, let's say, restaurant manager, where we end up using something called Pinot, which is like an OLAP system. And the idea here is data eventually makes into Pinot
Starting point is 00:10:52 through multiple hops. One is your serving stack will maybe push it into a message bus like Kafka, and data from Kafka will end up going into Pinot, where you'll end up querying it. So this is one flow where the data stack is used in the production flow. There is another flow where, let's say, you have the data from Kafka that ends up going into your data one way or the other, which is like you end up ingesting this data. So Uber adopted the Hadoop stack. So of course, there are different companies which adopt more of cloud offerings, would have a different choice.
Starting point is 00:11:24 But ultimately, you will have some kind of a storage system, some kind of a distributed file system to store all of this data and stuff. So data from Kafka ends up going into that file system. You will maybe create tables out of it, and 60% or 70% of that storage will be raw. And you'll have some derived tables out of it, which people would analyze and make make decisions going forward, right? So these are basically two different flows.
Starting point is 00:11:49 One is the offline flow. The other is the online flow. The online flow is what I mentioned about Pino. The offline flow is basically what we spoke to now. You can make intelligent decisions by just crunching as much data you have. And so it sounds like it was recommended that for certain non-critical, non-consistent, or like, it's okay to have eventual consistency, it's up having data that goes through multiple systems or multiple hops.
Starting point is 00:12:29 And for use cases that are real-time and that are where you need some kind of analytics that needs to be done, like Pinot is something that's a good fit. But in some cases where you don't need such kind of a freshness or guarantees or something, you'll end up maybe using the offline flow. The choice between batch and real-time or like let's say offline versus online is something that I've learned a lot through the process on like asking very simple questions like do you need the system to be real-time?
Starting point is 00:12:57 Can you not do it in a batch way, right? Asking such questions to yourself goes a long way when like deciding which is better and the systems built the right way would obviously be more maintainable over time yeah and that brings me to you know the cost to serve right there's definitely a trade-off i'm guessing the real-time systems are probably significantly more expensive what does that difference look like like if i had to run the same set of queries on the real-time system versus what's running on like the batch system today how much more expensive would it be like 10x or 100x how do i know yeah uh so the thing is
Starting point is 00:13:31 you're right that like doing everything that's done in the batch world uh it's not a question around uh so these are like completely different system right so bad systems tend to be optimized more for throughput the amount of data you can process per unit time, right? Whereas real-time systems are more or less optimized for latency or response times, right? So they tend to make critical decision trade-offs just based on that. So a bad system would end up reading most of this data when it requires into some kind of a compute layer, crunching numbers and things like that. Whereas real-time systems end up co-locating some of your storage with compute and there are only specific type of query shapes you can make along with it. If you are familiar with OLTP systems, think of this, let's say you end up querying a table that
Starting point is 00:14:16 has a billion rows without having any indexes on it. Of course, you have the same problem. You cannot crunch billions of rows in real-time. So most of these OLAP systems end up, you have the same problem. You cannot crunch billions of rows in real time, right? So most of these OLAP systems end up, you'll need to carefully tune them, carefully tune what kind of indexes are there, carefully tune partitioning so that they work in real time, right? Of course, if you want me to put a number, I would say it's somewhere between 10 to 100x
Starting point is 00:14:39 depending on like the type of data you have and the amount of data you get on a daily basis as well. Yeah, so it sounds like the cost to serve is one thing, but some kinds of queries might just not be possible at all. Exactly. Yeah, yeah, yeah. I'm immediately thinking about sharded data across, yeah, if you have some kind of sharding key, trying to run a query without that sharding key is just not possible in so many systems.
Starting point is 00:15:01 Yeah. Exactly. So if you think about it the other way, like offline systems since they you can basically log anything you want to the event bus and you can store it and stuff right um and even your query systems in an offline manner let's say you use something called trino or like spark or something like that you end up reading more more or less the entire data there are some some knobs here and, let's say at the file format layer, you can skip reading a few blocks, few chunks of data and things like that.
Starting point is 00:15:29 But those systems, it's fine to do all of that, right? Because they are optimized for throughput. You can end up doing those things, like Spark is going to do all of the compute for you in a distributed fashion across a bunch of focus. But of course, that takes order of minutes to hours, right? The OLAP system is the other end of the spectrum where order of minutes to hours right the whereas the the olab system is the other end of the spectrum where you want everything to run within milliseconds or
Starting point is 00:15:49 worst case seconds yeah and i keep hearing a lot of these different like open source system like wachi spark uh pino trino i think is the other one you mentioned yeah who who builds and maintains these systems so spark definitely data breaks and that's a little more famous. But who's working on these systems? And how is their development being advanced? If you look at most of these systems would have been started by one company or the other, which had some kind of a large scale that they wanted to solve for. And over time, a lot of these projects end up getting open source. For instance, Trino started off as Presto, which originated in Facebook. It started off, I think, by a bunch of
Starting point is 00:16:32 researchers, but it ended up being open source. And then the Databricks kind of maintains Spark. They frequently commit code and improve Spark and stuff. Then you have Pino, which started off in LinkedIn. But right now, there is StarTree. But yeah, Pino started off in LinkedIn, and it was open source. And then there's a company called StarTree, who's actually giving Pino as a cloud offering and stuff. So more or less, a lot of these systems end up getting open source at different points in time.
Starting point is 00:17:04 But most of these end up having one large-scale use case, which a company tried to solve. For instance, Pinot started off with LinkedIn, who views your profile rate? They want to make sure you get answered to that, you get notified to that as soon as possible. And that's the real-time nature of the problem, which Pinos halt. Interesting. Yeah, it's interesting to think about this workflow, which we probably deal with every day and they had to build a brand new database to support it.
Starting point is 00:17:32 That's really cool. Yes. So how much work is it to maintain these systems within Uber? Even roughly, so they're open-source projects. I'm assuming they're being self-hosted
Starting point is 00:17:44 in many cases. Just because of the scale that you're source projects. I'm assuming they're being self-hosted in many cases, just because of the scale that y'all are dealing with. I'm sure that there's like cost considerations on like using the cloud system directly. Is Uber like contributing code and helping? Is it work to maintain? Is it like, I just have to learn, like what is that like working model?
Starting point is 00:18:00 It depends. It varies from system to system, but more or less, there are a lot of things that had to be done in-house to make these systems work for instance we have more or less a team for every kind of offering we use it's also not about making it work at scale it's also around there are there'll be new feature requests new kind of use cases that need to be adapted on something existing plus the organic growth which you always have to solve for.
Starting point is 00:18:26 So there is a lot of work that ends up happening along those lines. For projects where open source is active, Uber also actively tries to open source it back. For instance, there have been a lot of contributions in the P&O space by Uber recently as well. So that is something that it all depends on the timeline, but Uber still tries to make that happen yeah yeah no i think it's a very exciting space i keep seeing new projects like apache
Starting point is 00:18:50 iceberg and stuff come out so it looks like there's a lot of advancements so we talked a lot about like the first phase of introducing a new data tool and like using it what ends up being the second phase like the scaling the actual platform, right? Like we introduce a new tool, customers start using it or like services start using it. What happens next as the data volume increases?
Starting point is 00:19:13 Yeah. Yeah. So there are, I would put it along three different buckets. One is you have some kind of organic growth that you're trying to tackle. The second is you kind of have some kind of feature requests or let's say gaps in your system. The second is you kind of have some kind of feature
Starting point is 00:19:25 requests or let's say gaps in your system. Trust me, you'll always find some gaps one way or the other. So there's always a moving target and that would be a way for you to evolve your data stack. So I can go into specific examples and let's leave the scaling part aside now that's something you need to solve one way or the other. So let's assume you have some version of your data stack where you have something like, let's say, Spark and let's say some kind of a distributed file system like HDFS and you're able to run queries and stuff, right? Soon you will see there are some things
Starting point is 00:19:57 that you'll need in addition to that, right? Let's say you're able to run Hive queries or you're able to run Spark jobs and stuff. Is that enough? What if you need, what if we can have some kind of a fancy tool, some kind of a UI where people can author these queries,
Starting point is 00:20:14 share these reports, because it's not just about running queries. There would be a lot of reports or dashboards that need to be periodically run. So you will need some kind of a BI tool for that. And that should not just allow you to run queries, maybe share these reports with others and things like that. And of course, there are a lot of intelligent, interesting applications with LLMs and Gen AI that you can also integrate
Starting point is 00:20:35 into your stack where you give in a text and you get a SQL out of it and things like that. Secondly, you also need something similar for spark jobs so spark solves this problem where you can write pi spark code where instead of like you know actually sequels where you come up with some use case you come across some use case where it is hard to represent that logic as a sequel but it's easier to represent it as code now having some kind of a notebook environment where you can quickly iterate over spark apps. So that is something that will get added to your stack. Third thing, let's say you have a bunch of tables. It's only a matter of time where you end up having hundreds of users and let's say thousands and thousands of tables. Now you'll need a way to know what each of these tables store.
Starting point is 00:21:22 Let's say I want to look at users data per city. Which table should I choose? So some kind of a discovery tool, some kind of a way for you to search for what data you actually want. We had a tool called DataBook and Uber, which ended up solving all of these needs, right? Last thing, you'll also have some kind of security,
Starting point is 00:21:40 compliance, or use cases, which will end up warrant something like a governance tag. So I can go into example. Let's say you end up logging a lot of customer PII data, right? Now, there will be different kind of policies you would want to enforce on them. Let's say you cannot store this data more than 30 days, or this data needs to be deleted across all of the tables. Now, the problem here is it's easy to say to delete the data, but it's actually very hard to delete because it's not just one table you're talking about. Let's say you log some sensitive data, some PI data into a raw data table, right? Now you have
Starting point is 00:22:16 thousands of tables deriving from it. Now, how will you end up deleting it from all of them? Setting retention is simpler, but deleting it across the stack is hard. It's hard for two reasons. One, because actual deletion itself is costly and you have to go through all the layers retention is simpler but deleting it across the stack is hard it's hard for two reasons one because actual deletion itself is costly and you have to go through all the layers in the stack but the second is also around tracking this so we you need a way to track all of this right you need a way to say which table derives from which so you will need something like lineage so uber ended up building an in-house lineage system just to track which workflow derives from what. And you also need some ways to audit every action in the system so that you know who did what and things like that.
Starting point is 00:22:51 You will also have Kafka in the picture, but it's not just about Kafka. Your source of data can come from anything, maybe other online stores, OLTP systems as well. So you will soon realize your ingestion framework actually does a lot more things. For instance, it needs to pull in data from any source to any sync. So you will end up generifying your stack to build that way. So Uber ended up doing that. So we even have flows where we ingest from like Google Sheets or CSVs into the data lake so that people and people can like access join some data with some google sheet which gets periodically ingested so these are like some ways which i have seen
Starting point is 00:23:31 data stack evolve at ober yeah that's that's just incredible you can kind of see the data maturity curve almost like evolve like you have a tool and you have to worry about all these hundred things that come with all the data data tooling and like i'm very curious about the data deletion stuff because we we learn about how deletes are really expensive in databases and i'm sure this is not easy like and like managing lineage and stuff it feels like it solves the problem one layer up which is where you know how to do the delete but then how do you actually perform the delete reliably i don't know if you have any insight into that yeah i can talk briefly about that right so how do you so first of all after lineage you know exactly what tables to touch you're right about that but um there are there will be some
Starting point is 00:24:14 kind of an investment into what file format to choose and things like that way before your data stack reaches this point right so uber uses hoodie uber has open sourced hoodie as well so what it ends up doing is a tldr version of hoodie is it enables you to write into your data in like a commit by commit fashion so that you can somehow enforce reader writer isolation right because the reason how the system scale is you cannot have some kind of a lock where every writer acquires, and that's just not practically possible. But you can afford to do that in an OLTP system. So what happens is, let's say I write to a table and you are consuming from the table. What if when your job is running, I end up overwriting all of the data, right? So Hudi gives you some ways to do all of that, right? So we ended up closely coupling a lot of these things
Starting point is 00:25:03 as some kind of, and we ended up modeling it as some of these things as some kind of and we ended up modeling it as some kind of incremental write on top of hoodie and most of this it varies from use case to use case if you are talking about deleting entire rows of data uh sure dropping it is fine but deleting like specific rows there is literally no other way other than building some kind of a reverse index which kind of tells you oh this id is kind of stored across index, which kind of tells you, oh, this ID is kind of stored across these files, and then rewriting those files one way or the other. So depending on how well your data stack, like how well your stack is, you will end up using different kinds of solutions to do this, but you have to do this.
Starting point is 00:25:40 Okay. And then do you perform an actual deletion itself, or do you do some kind of tombstone record, like some masking? Oh, yeah. okay and then do you perform an actual deletion itself or do you do some kind of tombstone record like some masking oh yeah so you can obviously do some kind of tombstone records if it's just about your if you just want to hide it from some other system consuming but again you will most of these are compliance requirements so the data has to get deleted beyond some point so you have to rewrite those files so let's say your initial file has like 100 records and you want to remove 10 of them.
Starting point is 00:26:09 There's literally no other way other than to rewrite 90 of them. The efficient way to do this is you maybe maintain a track of what on all IDs are something that need to be deleted and you do this rewrite in a batch. That way, at least you try to, because rewriting files aren't cheap, right? So to make
Starting point is 00:26:25 it efficient, you can maybe batch your things and do the rewrite in a batch fashion, but you have to end up rewriting them. So it all comes from the fact that files are more or less kept immutable in this batch world. So even if you have new data coming in, you end up writing a new bunch of files and you somehow make sure all the queries that hit this data end up consuming the new files instead of the old files, right? So that's kind of the model in which all of this batch world operates, which means you have to rewrite files. Makes sense.
Starting point is 00:26:55 Makes a lot of sense. Yeah, that all tracks. And on the other side of the world where I'm on now, there's like all these startups, especially like this, like vector databases and stuff is also like a very popular thing how does that track to data infrastructure and data stacks like this like are people thinking about introducing vector databases as part of this is that helpful at all like curious to get your take sure yeah i think with gen ai and ai advancements there is it kind of opened up a new dimension of search right right? Like initially it was all keyword searches and things like that. Now embedding search, semantic search is becoming popular.
Starting point is 00:27:30 So the requirements for vector databases is growing, but I feel you would have to introduce that into your software stack only when you actually have a use case that warrants it. Of course, there are different details twofold. One is, where is data that comes into your vector database? If that comes from the data stack, you will obviously have to enhance your dispersal framework or some kind of injection framework to move data from one place to the other. Maybe your source will be some Kafka data or something in your data lake, and your sync will be vector database.
Starting point is 00:28:06 Maybe that way it naturally kind of gets added to your stack. The other is you also need to set up some kind of a vector database offering and do some kind of benchmarking. And vector databases also, I would say, require the same kind of tuning that you would actually need to do, similar to how you do it in an OLTP or an OLAP system right you need to tie your query shape with your data and stuff so that also requires some kind of tuning so the deal here is if you have a use case it will
Starting point is 00:28:36 naturally get added to your stack one way or the other the touch points where it gets added would differ on that note though like is there there demand for using that kind of vector search in data stacks? Maybe that's the starting point, right? Maybe data analysts are asking that I want to do this kind of embedding. Do you see use cases for that come up? I see two different kind of use cases coming up. One is,
Starting point is 00:28:58 again, I'm back to the same offline versus online, right? If, let's say, you have some kind of a large chunk of data which you would just want to analyze and try to see what is similar and what is not right let's say you you have a set of text data and you have some set of image data right now assume they all are embedded with some kind of a model that that gives embeddings that kind of are in the same space so let's say you can the text embeddings here and the image embeddings there are of are in the same space. So let's say you can, the text embeddings here
Starting point is 00:29:26 and the image embeddings there are more or less going to be closer if that makes sense, right? Now, if you want to do some kind of a nearest neighbor search where you want to find the closest 10 images in this for every text, and you want to find the closest 10 text for every image, something along those lines
Starting point is 00:29:44 where you want to join a huge chunk of data. Of course, there are ways and there are good amount of Spark functions as well that enable you to do this. That is one way vector search, the abstract concept will get introduced into your data stack. Of course, there will also be the other online variant where let's say you have a lot of data that comes in where you have some kind of a query text let's say you end up having a large corpus of text data where you want to search through some kind of query text that will require more of an online system where you end up using again there are a lot of different implementations of vector search algorithms out there so depending on your scale
Starting point is 00:30:19 you could choose whichever one fits you but all you have to do is you have to pull the data maybe from the data lake and put it into that vector database and make sure you index it right and put it into your serving stack. That way, let's say you get some input query text and you want to find nearest neighbors on that, you will hit the online system and not the offline one. Makes sense.
Starting point is 00:30:38 Yeah, and on a slightly different note, right? Like you talked about data tools and a lot of different things you need to do as you move up the data maturity stack who run like an efficient like data infrastructure like if you're advising a company today how would and let's say you don't know anything about the company so just have to ask them a bunch of questions like what questions would you ask them if they asked you know should i use data break should i use snowflake should i build something in-house like there's so many different changes happening and how do i know what my data stack should be
Starting point is 00:31:09 i see that's a good question so of course i'm gonna ask a lot of questions i'll try to maybe think a minute so you'll make first things first is i would try to understand the team size is going to manage this infrastructure. If it's a small team, then of course, it's better to try some kind of hosted offering instead of hosting it yourself, at least to start with. Then naturally, there are some things that you would obviously require. You will need some way to emit events. You will need some way to ingest them and things like that. So there are different, for instance, let's say you want to outsource the ingestion, right? Ingesting data from one source to one sink.
Starting point is 00:31:53 There are a bunch of offerings as well. For instance, Fivetron is one. So I would say the main question I would ask is what team size? And I think depending on your use case, you just need to evaluate which parts of the stack you're going to need. To start with, I'm assuming you need some kind of, you'll obviously have OLTP systems before. So you'll need some kind of an ingestion
Starting point is 00:32:14 or a dispersion framework. You'll need some kind of message buses and you'll also need some way to store and query all of this data. So you need some kind of a storage layer and a compute layer. So I would say these are like the bare minimum to start with.
Starting point is 00:32:27 And maybe going forward, you can invest in other things like governance, some kind of a discovery tool, some kind of a nice, well-integrated BI tool to query and things like that. Yeah. And how would I learn, you know, what requirements I'll need to have in the future? Like, is there like one place where I find out about these things? What's the best way for me to learn this on my own? So there are a few blogs which I've seen. A company kind of puts all of the different components they have in the data stack in a diagram and stuff. If you union them across, let's say,
Starting point is 00:32:59 I did see some blogs from Uber, some from Netflix and stuff. So I can share them with you. But if you do a union across like four or five blogs, you will pretty much know all of the different concepts involved. Of course, the actual offerings that they end up using, for instance, you will see, you will definitely see a dispersion framework out there. Some would be in-house built, some might be a cloud offering and things like that. But you would definitely see all of these concepts if you just union three, four blogs out. Makes a lot of sense. And you just union three four blocks out makes a lot of sense and you move
Starting point is 00:33:25 on to open ai which is definitely a very hyped company especially around yeah and all these advances we've been seeing what are you excited about in the world of data now yeah i think the thing is that is i i got to know that there's no shortage of you know real-time problems because as i said uh it's it's better to do... If there was a way for me to do everything in real-time, yeah, if there's an efficient way to do it, I'll end up doing it, right? So a lot of products are kind of trying to be built
Starting point is 00:33:57 around the real-time domain. So I see a lot of challenges in making things real-time. So that is something. And with a lot of AI advancements, you end up putting a lot more GPUs in like the serving flow. So handling them is also another challenging problem in itself.
Starting point is 00:34:14 Of course, there's this new dimension of like vector search and semantics that's becoming popular. I also see some interesting challenges coming up in that area. Makes a lot of sense. Well, Arvind, thank you so much for the time. I've learned a lot of sense well arvin thank you so much for the time i've learned a lot from this and hope you had a good time too oh yeah
Starting point is 00:34:30 yeah definitely it's always nice talking about talking about all of this yeah perfect thanks Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.