Grey Beards on Systems - 138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, and trends affecting the data center today. And now it's my pleasure to introduce Adit Madan, Director of Product for Alexio. I just got back from CloudField A15 last week and thought Alexio would make an interesting continuation to those discussions. So Adit, why don't you tell us a little bit about yourself and what Alexio does for cloud workloads? Hi, everyone. This is Adit Madan. I'm Director of Product Management

Starting point is 00:00:45 at Luxio, where I've been for several years. I would say I've spent the better part of my working career at Luxio, which is six years is the term. I have been working in the company right since the beginning, started off in engineering, working across different roles before I ended up in product management a few years back. It's been an exciting journey at Luxio, seeing the company evolve, go through different stages of usage with customers, how the use cases have evolved. And with that, maybe I'll talk a little bit about Aluxio itself. We are a company which started off as part of the AMP lab at UC Berkeley, as you guys all know, that's the same lab which gave birth to the Spark project,

Starting point is 00:01:47 which is now Databricks. So we were in the same research lab, but we went off in a different direction. So instead of trying to be yet another compute engine, what we did was we decided to address a different part of the problem. So we are trying to address the access of data across different compute engines, across different environments, and we can talk more about that as we go. So it's a data gravity issue, as we would like to say in the storage business.

Starting point is 00:02:22 Data is tough to move around and tough to access from lots of different places, considering it's only located in one location, right? Yeah, exactly. So I think the problem of data access itself surfaces in different ways for companies in different segments. For large enterprises, it is exactly what you said. They might have, for example, for business intelligence applications,

Starting point is 00:02:50 they might have a variety of different data sources. And since it's a large organization, it's not uncommon to have these spread across different silos in different regions of the country, different parts of the country, different parts of the world, or even between some data sources on-premises and some in the cloud. So where we come in is providing whatever applications need access to data, sometimes a federation of data across these sources.

Starting point is 00:03:20 We are providing a unified way of accessing it, regardless of what the application on top is. So your solution is open source software, is that correct? Yes, we are an open source software. The company itself follows the open core model. an open source, which and the corresponding community edition, which is free to download, free to look at the source code, contribute. And then we have the enterprise edition of the product, which is closed source, which builds on top of the open source. And that's what we call our enterprise edition. That's a rather large problem. Is there like a target market or use case

Starting point is 00:04:10 for that the team is focusing on? Yeah. So when I said, I mean, the larger vision, like I said, I started off by saying any kind of application. So the first way in which we focus in is when we say application, there are different kinds of data-driven applications. The market in terms of the applications, the market that we're focusing in on is obviously since we talked about the likes of Spark and other engines like that. It's focused initially on large-scale analytics and BI and SQL OLAP application, not general application, not like a general purpose file system. What we've built is specifically for the needs of initially analytics and then going down

Starting point is 00:05:03 into machine learning and deep learning. That's how the company evolved. So analytics historically has been, I'm guessing, Hadoop kinds of things. And I'm thinking sequential access and things of that nature, maybe object storage kinds of stuff. So in your solution, you deploy a software that runs various locations of a company's environment. Is that how it would work? Yeah. So I think a lot of times what happens is the software that we provide is deployed in one region of the company's infrastructure, and then it kind of evolves as we expand within the organization to having multiple instances of the same software.

Starting point is 00:05:52 And the software provides sort of a protocol stack at both locations. So, you know, if I'm at a, let's say I'm sitting in AWS and I want to access S3 objects sitting in my enterprise on-prem here. So I'd have software at both locations, presumably. And one would be, you know, a target. One would be a client. So the situation that you described, actually, we would need only one instance of our software in this case. And that's a typical scenario which comes in for, let's say, for large organizations who are looking for some agility to move, to be able to utilize the cloud, but not completely move away from their on-prem interface. So it would be like a cloud bursting kind of solution or something like that?

Starting point is 00:06:46 Exactly. So like for a cloud bursting kind of solution, we would deploy a Luxio. A Luxio is always deployed close to the application which needs access to data. So in this case, the application, the compute, which is running in the cloud. So we would have a Luxio in the cloud providing access to data which may reside on-premises. So that's really interesting from an architecture perspective. I'm really interested to see how this kind of mess or fabric works. And I guess we should start there. There's the concept of a data mesh or data fabric. Where does Alexio sit in the definition of those two different types of approaches?

Starting point is 00:07:33 You think there are two different types, Keith? I mean, that's a whole different discussion. I think you have a data mesh running on a data fabric. Yeah, yeah, exactly. I suppose. I mean, you don't need a data fabric to have a data mesh. What is a data mesh? Let's start there, Keith. So I view a data mesh as kind of what we're talking about

Starting point is 00:07:55 now, this access layer, this ability to take data from unique sources and provide a consistent API or experience. So a mesh of data. A data fabric is specifically focused on how do I get the bits from point A to point B. So one's logical and one's physical kind of view. One takes care of the logical and one takes care of the physical movement of data. Now you need to answer the question.

Starting point is 00:08:31 That's a great question and I think data mesh itself has been a hot topic of discussion these days. So I like the high level categorization of calling a data mesh the logical layer, or even like sometimes people say, it's more of how the organization works, who is the owner of data, which team is responsible for operationalizing the data.

Starting point is 00:09:02 I feel like that's a lot of the conversation around data mesh itself. Whereas, like you said, the fabric itself, it's encompassing kind of like, almost like it's defining a layer in the data stack, which serves a particular purpose. Whereas data mesh is more of a concept which can be realized using different tools.

Starting point is 00:09:29 Different vendors kind of have different tools for making up the solution, whereas the fabric approach is kind of prescribing an approach that. Connectivity and all that stuff. Yeah, yeah, yeah. No, I agree. I agree completely. So back to this AWS cloud bursting compute solution here. So the S3 data is sitting in my on-prem and you have a layer sitting in the AWS compute, I assume, doing some stuff to provide access to the EC2 instances sitting there to the data that's sitting on-prem? Are you caching data? Are you cataloging the data? I mean, there's lots of stuff that I could consider doing as, let's say, a data mesh solution for S3 data or HDFS data, or even NFS data, things of that nature. I mean, so where does Eluxio fit into that sort of thing? I mean, are you caching?

Starting point is 00:10:33 That's the first question. Are you going out and gathering all the metadata for the data sitting at the target location? Or the source location, rather. I'm sorry, my mistake. No, that's a great question. So what Alexio is doing with this layer, first question, yes, Alexio is providing caching abilities as well. Once you have to provide access of data across these different regions without any dependencies on compute, which is running on-prem in this scenario, the caching clearly is a necessity for this kind of a solution.

Starting point is 00:11:18 And I mentioned that it is a requirement for this kind of a solution because this kind of federation could be constructed in different ways which do not depend on caching itself. So you could, when you're running compute in the cloud, you could send your compute job on-premises, get the results back, and then just transmit the results. But this kind of mechanism is not really depending on caching, but it has the downside of that you're still dependent on your compute on-premises. So if you were actually

Starting point is 00:11:53 in the situation in which the reason why you wanted to burst to the cloud was that you didn't have enough compute resources on-prem, it's not really solving the problem. So yeah, Aluxio is caching and Aluxio is providing a view of all of the data. So it is collecting the metadata for whatever data is present across different sources. But at the same time, once we get into what the entire solution looks like, we are not providing governance, for example. We hook into systems which provide governance. So Alexio is not coming in and saying that Alexio is the cataloging and governance solution. Alexio is plugging into a few different components in the stack to provide the solution itself. When you say governments, you're talking about security, protection, access rights,

Starting point is 00:12:53 those sorts of things, or access logging kinds of things? Exactly. So if you think about governance and just even limited to two of the things that you mentioned, logging and also access control, which team can access which data set and or which individuals can access which data sets, allows you to hook into different components to provide that solution. So this is, you know, kind of next level conversation that we'll get right into, which I'm not complaining. One of the challenges when we're talking about stretching access of data from on-premises systems to the public cloud is that ETL of the metadata, of that who has access to what systems

Starting point is 00:13:45 so that when I'm making, not necessarily, I'm assuming we're not making copies of the data, but we're extending access of the data. So when we inject some type of proxy for that data to speed up access, et cetera, et cetera, the question is how do we maintain access control, especially as we begin to extend the capability? We're talking about OLAP analytics. I can never say that. We're talking about analytics data. So we're not transacting on the data, but access control is

Starting point is 00:14:22 still critical because this data could still be sensitive. So I guess the question is, how is access control extended to this new cloud environment? If I'm building a net new app on EC2 instances that's hitting the Eluxio appliance, how do I ensure my security policy is enforced across this new app? Yeah, so maybe let's use the data mesh concepts and terminology to talk through that example. Maybe let's imagine an enterprise which has two different teams. One is the owner of data, like who is responsible for deciding who can access the data. And then there's another team which needs access to the data itself. Now, if you just break it up on, say, that who has the responsibility of deciding

Starting point is 00:15:21 what are the policies of who should be able to access what, it would be the data owner, the team which is the owner of the data itself. So that doesn't really change. But if you look at where Alexio comes in, Alexio is on the consumer side. So team B, which needed access to data from team A, that is using Alexio and Alexio is kind of hooking into the data governance tools like an Apache Ranger,

Starting point is 00:15:53 Prevacera, or any other ways like a Styro OPA. There's many different modules which the data owner, they could use for enforcing these data policies. So you become sort of like a pass-through for whatever the credentials that are required on-prem. You're, you know, the application is doing, is presenting those credentials to the Alexio solution at the consumer side where the compute is, and you're passing those credentials across, I guess. Is that what you're trying to say? Yeah. So Luxio does become a pass-through. In more technical terminology, it's impersonation. We impersonate to be the user and whoever is enforcing the policies are kind of enforcing the

Starting point is 00:16:46 policies as if the user was accessing it, not Nautilux. It seemed like I was reading your, your, your website and, and you talk about multiple cloud support. You talk about, you know, I mentioned HDFS and S3, but it must be like four or five other different access protocols as well. Could you talk about some of that? Oh, yeah, definitely. So I think that a little bit of history of the project. So when we began, as you guys know, we talked about Hadoop, and we talked about generally the kind of interfaces which

Starting point is 00:17:27 were prevalent in the big data ecosystem and that's where our HDFS interface came into play so we the purpose of the HDFS interface is simply and the rest of the interfaces as well now and I'll get to that in a second. It's just making it a lot easier to introduce Aluxio into the mix. So on the data API front, I would say HDFS, it used to be the most popular way of accessing Aluxio, but these days, the S3 interface and the POSIX interface specifically for machine learning and deep learning applications. That's kind of like the three main interfaces of accessing Aluxio across these variety of applications that a data platform may be onboarding. And then different cloud support. Yeah, that's an interesting one.

Starting point is 00:18:26 Maybe we can spend a little bit of time on that as well. On the support for multiple clouds, I mean, we talked about one situation in which you may want to access data across these different regions. We started talking about cloud bursting, which is kind of on-prem to a cloud. But increasingly what we're hearing from our customers is,

Starting point is 00:18:56 I mean, we all know that no one likes to be vendor locked, right, and vendor locked even means no one likes to be tied to one specific cloud and i'm sure you you've heard the same thing as as we have most of the customers if not all of the customers that we're dealing with they they start off with uh uh if they're using the cloud they uh they start off with one cloud but they will will definitely migrate to another cloud. Not migrate, I would say. They would also adopt, add in a second cloud,

Starting point is 00:19:33 if not a third cloud at some point in the future. And some of our customers already have achieved that, and others are kind of headed in that direction. So for just keeping these kind of enterprises in mind, one of the things that we also promote as something that Eluxio is solving is the fact that it's making your applications portable. Eluxio is not the one thing which is making it portable, but it's contributing in a significant way on the data API side that is making your applications portable, such that you can just lift your applications

Starting point is 00:20:16 and run it in whichever environment is most suitable. And most suitable can be for a couple of different reasons. Sometimes most suitable could just mean that you have access to a particular service from a particular cloud vendor, which is more suitable for the job at hand.

Starting point is 00:20:38 So it's kind of application semantics dependent. And other times it's like a cost reason. You may negotiate a better price from a different cloud vendor. So this ability to just move your application without moving the data itself. So this separation of application and storage is critical here. Just the ability to move wherever without considering the data gravity problem, which we started off with, that's really critical. And that's kind of behind, like, that's kind of what you might have seen with our multi-cloud

Starting point is 00:21:10 messaging. BI kinds of things, or even AIML, we're talking about lots and lots of data. Even though you're sitting there and caching things and stuff like that, when you're talking about accessing, I don't know, terabytes, petabytes of data, we're still talking considerable, you know, the latency becomes an issue, the bandwidth becomes an issue that's allowed. How do you deal with some of those sorts of things? Obviously, caching can deal with some of the latency things, but at some point, you actually have to go and grab the data

Starting point is 00:21:44 from wherever it is, right? Definitely. So, and this is one of the first questions that we always get from prospects. It's almost like we hear that it sounds good. And if it worked, we would definitely use it. But there's always this doubt that it's too good to be true in some ways. So for this, caching plays a huge role. And in addition to caching, we have the first thing I would say on the network side with the latency and bandwidth, you have to keep in mind that we are, the first thing is that there is a

Starting point is 00:22:30 selection of data which moves. So we are not blindly moving or even in the context of caching, we are not blindly caching everything under the hood. Just taking the example of like a BI application, which might be operating on years of data, let's say three plus years of data, petabytes of data, which is residing in one region

Starting point is 00:22:55 and you're accessing it from another region. So we are able to select what needs to be moved. And the second thing is that there's a lot of capabilities around preloading, prefetching, these policies in the layer that we provide, which are able to eliminate the latency effects. And just taking this a little further is that you only take the hit the first time. And as you like the access pattern of these applications based on what we've observed with something that we validated across a lot of our community and enterprise users. If the access patterns are such that it makes caching effective. So that's kind of a few

Starting point is 00:23:49 things that I would, I mean, just taking an extreme example, we have a lot of people who are splitting their machine learning or deep learning pipeline by, let's say, pre-processing in the cloud, but they want to own the GPUs on-premises and run the application on-premises, but while the data itself resides in a cloud object store. And if you just look at the access pattern of these training jobs, they repeatedly, continuously, they fetch it once, which takes the latency and bandwidth hit. But then once the data is available, the model is just incrementally, you keep reading the same data with a little bit of a difference, and you keep doing this in a loop. So that's where it really makes a solution more effective.

Starting point is 00:24:37 Yeah, yeah, yeah, exactly. So I mean, A, they're batched, and B, they do a number of epochs across the same data and things of that nature, randomized, of course. So that brings up the question of how big your cache can be. So if I'm front-ending a petabyte of data and allowing customers to access pretty much that data in a, let's say, sequential pattern, stuff like that,'re still talking, you probably need a significant amount of cash, right? Yeah. I mean, it really depends on the situation, but just a few of our larger examples, it's not uncommon to have a half a petabyte of cash for, for example. Real stuff. Okay.

Starting point is 00:25:24 In the larger scenarios. I mean, depending, for example. Oh, we're talking real stuff. Okay, now I understand. In the larger scenarios. I mean, obviously it really depends on what the working set of your data is. And half a petabyte of cash doesn't mean you're spending the same amount of money on your analytics platform.

Starting point is 00:25:41 If you just look at where the amount of spend that you have on storage versus GPUs or compute, the storage spend is kind of a very small percentage of the entire spend. So with that in mind, one of the things that's coming up as a question becomes observability and improvement

Starting point is 00:26:04 and what knobs we can turn to improve latency and just the throughput etc so I would imagine the target audience for a lot of this is not necessarily IT infrastructure people they They're application developers, people who are born in the cloud and extending capabilities. How do you help those operators identify the knobs they can turn on the network or the cloud provider side,

Starting point is 00:26:41 whether it's increasing the cache size from a storage perspective, resizing that virtual machine that's doing the caching versus simply just doing a direct connect, doing a direct connect and the speed of direct connect. So it's really the visibility of the performance of the data mesh, I guess I'd call it. And what sort of knobs, or how do you tell the users in this environment, you know, what knobs they can play with and what knobs they can't, I guess?

Starting point is 00:27:15 That's again a great question. And I wouldn't claim that we've solved the problem entirely, but we have made significant progress, which I can definitely share because as you can imagine, and you pointed out a few things, how big should my network pipe be? Figuring out these answers

Starting point is 00:27:37 to these kinds of questions are not trivial by any means. Just to take one specific example of a collaboration that we've done with Meta actually. Meta is one of the users of our community edition and that's how our open source also plays into our company strategy as well in that a lot of the the innovation uh for for these kinds of problems happen with with the internet giant and uh we and we talked about like when we were describing the problem uh we mentioned two two things how big uh should the cash be how big should my network pipe be? So for these kinds of things,

Starting point is 00:28:27 what we've done on the community side and some of these things we are productionizing as well these days is we call it cache insights. And the workload itself themselves are not like, this is not a one-time exercise. These kinds of insights that you need as the workload keeps changing over time, you might keep adding more teams to your platform. So you kind of have, this needs to be a continuous exercise.

Starting point is 00:28:58 And on the observability front, we actually baked in functionality into our client itself, which is providing insights based on the access pattern itself. So it has kind of a decision tree, which spits out an answer like, if my workload is sorry, it would answer questions like if my cache size is increased from 1x to 2x, what would be the impact on the workload? So it is able to answer that kind of question because it is seeing a lot of the access pattern. It's seeing what is hitting the cache, what is not hitting the cache. And that's kind of some of the things that we are doing in that direction. I wouldn't say it's a completely solved problem yet, but that's a step in the right direction. So you're giving sort of like a predictive view of what the performance of the application

Starting point is 00:29:57 would be if I were to double the network pipe or double a cache size or something, whatever the parameters are. Are those the major ones that affect the performance of the data mesh? I would say sizing in the context of Eluxio especially, I mean, sizing the cache and sizing the network are definitely the major factors. The only other thing that we haven't said is just how many cores that you need, which is, which is kind of proportional to the concurrency or the workload itself. But yeah, I feel like these are definitely, we captured all kinds of resources,

Starting point is 00:30:38 right? So we kind of, we talked about CPU concurrency, storage, cache, and then network, which are the three major factors. So, I mean, so, you know, we've been talking a lot about Kubernetes in this world here. We're gray beard kind of stuff. So, is it a Kubernetes solution? Does it support multiple nodes for its client support? Can you scale up the number of nodes? Or is it just a single virtual machine or dual virtual machine with high availability?

Starting point is 00:31:11 Well, the high availability question is a different one. But, you know, so I guess is it multi-node solution? That's the first question. Definitely. So Luxio is a scale-out distributed system. It can be deployed on Kubernetes, which is increasingly becoming the de facto way of deploying Luxio. It wasn't always the case, and I think there's still a migration happening,

Starting point is 00:31:43 and increasingly everyone knew we come across Kubernetes as the way that they deploy and manage and operate a Luxio. So I guess going the other direction, can you consume this as a SaaS? It can be. It's not there yet. I mean, we are not, I mean, can be, I meant, would you see

Starting point is 00:32:00 value if there was a SaaS service? Yes. But a Luxio is not a SaaS service yet. That's not something that we provide as of now. Okay. Well, that's good. So it supports Kubernetes clusters. So your client software would be deployed as containers in the Kubernetes cluster? Is that how it would work?

Starting point is 00:32:23 Or it would be be separate Kubernetes cluster with your client software, somehow connected to other Kubernetes clusters? So we have both ways actually, and it makes sense for different enterprises, both kind of make sense. We do have the situation in which, let's say the client itself are, just as an example,

Starting point is 00:32:47 let's say I'm using Spark on Kubernetes, running ephemeral clusters of these in the cloud. So you would deploy a Luxio as a separate Kubernetes cluster, which has a different lifecycle from the ephemeral Spark clusters themselves, which Alexia also on Kubernetes. The client itself is embedded inside Spark itself. So our client for Spark itself, our client is not a separate process. It's not running anywhere, but it's something which is a library embedded in Spark itself. And like I said, once you're using something like an S3 API, you don't even need any custom client. So out of the box, Spark on any of your applications, which is able to

Starting point is 00:33:40 talk the S3 interface, can interface with the Luxio without any changes. So I'm still trying to understand it. So the client software itself ends up being deployed as part of the Spark functionality. Is there other applications where that's the case? Or, you know, Kafka is probably a dozen different SQL and OSQL databases out there, those sorts of things. I mean, how would they deploy your client software? Yeah, so maybe let's look at a different category of applications. Let's say we are using something like PyTorch for machine learning, deep learning.

Starting point is 00:34:27 And in those scenarios, we provide something called CSI driver on Kubernetes, container storage interface, which makes Eluxio look like a local file system. So on the client side, we would install our driver, so to say, our CSI driver, which is able to interface with Luxio. And then the applications themselves, the containers, they'll just talk to a mount point inside their containers, like a local file system. What I hear, there isn't a fixed way to deploy this. So if I have a container app and I don't want to deal with the networking of making external calls from the cluster to a cluster based on S3 or a different mount or whatever, I can build that app with the Luxio cluster or Luxio nodes within that cluster. So if that's best for my application operations design, I can do that. If I want the data to live independent of the app lifecycle or the app's instances, then I can build a dedicated cluster and just simply make S3 calls to that cluster.

Starting point is 00:35:43 So it really depends on the application and the kind of application. And whatever my operations are. So if I'm a data team and I'm providing data to multiple applications in a public cloud, then I build the cluster and it'd be independent of the individual app cluster. So even at the, we were kind of focusing on Kubernetes, but it's not unique to Kubernetes. I could have AWS services. I could have GPU ML, AI instances running against this data hosted in this cluster. As a data service provider, I'm just

Starting point is 00:36:20 managing the data independent of the applications and just providing, you know, centralized caching and capability for multiple teams and applications. Absolutely. And we actually published a case study of an organization, Expedia actually, which is doing precisely that.

Starting point is 00:36:45 So this is something we published a couple of weeks back. So that's why it's fresh in my memory. But they're using different services, like they're using different variants of Spark and Trino, open source flavors, but also services like a Databricks or a Starburst in AWS too, with a dedicated Aluxio cluster, as you were describing. That brings up a question now. In this sort of solution, do you support, let's say, multiple locations for the source data to the same target?

Starting point is 00:37:17 So let's say I've got multiple on-premise locations throughout America for high availability or something like that. Can I have my application sitting in GCP talk to all three locations? I mean, it might be separate mount points, I guess. Is that on the configuration, I suppose? Exactly. And we do have a lot of people who are deploying Eluxio in that way. So we provide something called a namespace, which looks like an object and file namespace. And precisely what you said, we would have different mount points for the different data sources.

Starting point is 00:37:57 So it's essentially mapping a section of the Luxio namespace to a different data source. Yeah, we used to call it thing global file systems kinds of thing. I also noticed on your website that you support different vendor storage. I noticed NENEP and Dell. I think MinIO is there as well. Yeah, we support a huge variety of different stores on the south side. I would say the most common ones, actually the most common protocols, one is the S3 protocol,

Starting point is 00:38:35 which is extremely common, followed by the protocols by the major cloud vendors like GCS and ADLS from Azure, but also HDFS for organizations who still have HDFS around. For us, like speaking to a MinIO or speaking to an Amazon S3 or speaking to a Cloudian, it looks functionality-wise, it's the same. Obviously, once you get into the operations, there are differences of how you would tune the system.

Starting point is 00:39:12 But functionality-wise, it's the same driver that we use to speak to different kinds of storage modules on the south side of Eluxia. Yeah, so let's say a NetApp solution would have SMB3, if it have NFS version 3, maybe version 4, it would have S3 potentially, types of protocols. And are you only supporting S3 in those sorts of environments? No, we also do support a local file system interface on our south side. So anything which speaks POSIX pretty much, we would also support that. So since you mentioned NetApp, we actually collaborated with NetApp recently over the last year. And they didn't mention us even in their earnings report, which came out a couple of weeks ago.

Starting point is 00:40:08 So which is on the S3. So we have heavily collaborated with them on the S3 front because especially for the kind of workloads that the application that we, which is our market with these data-driven applications, the S3 interface is the more popular one compared to some of the other kinds of storage that you mentioned. How is something like your enterprise solution priced?

Starting point is 00:40:34 Yeah, our enterprise solution is, right now, we price it based on the amount of resources that you allocate to Aloxio. So it's very similar to other vendors in our space in which you kind of price based on how much CPU, how much storage of cash that you have allocated to Aloxio. And these parameters are what we charge on. We used to do, I mean, we still are the primary way in which we sell our software is annual licenses.

Starting point is 00:41:10 So you would, based on the resources, we would give you a price of how many resource hours can you use across the year, and you would get into an annual contract with Eluxia. So I would have thought that you might have the source data size as being a component of the price. Like if I wanted to take a petabyte data lake, for instance, I'm sitting with my home computer system,

Starting point is 00:41:39 and I want to be able to access it through Azure or something like that, the client is going to take, you know, cache and storage and networking and EC2 instances or whatever the counterpart is for Azure. But, you know, having a petabyte of storage under accessible ability, I guess, that could be one of the components. But you're not doing that.

Starting point is 00:42:04 It's really, it's the amount of storage. It's the amount of resources, compute, storage, networking resources consumed by the client at wherever the client's deployed. Exactly. So in the scenario that you described, if you have like a couple of petabytes of data, but you only end up accessing half a petabyte, we wouldn't charge you for everything. So like the second factor that I said, it's more like what is at any given point in time, how much data would you be accessing? It doesn't matter if you have like tons and tons of data. It's kind of a working set measure almost. Exactly. It is exactly a working set measure. So we are, and I mean, generally this is agreeable to customers as

Starting point is 00:42:55 well because, I mean, you don't want to price based on something that they have. You kind of want to price based on what value they're getting out of Eluxio. If they're not accessing a lot of data, which is there for archival or historical purposes, then you're not really getting anything out of Eluxio. So why should Eluxio charge for that? Yeah, I guess like a sample use case or how you value this would be if I had a bunch of ERP data that was in my data warehouse sitting on-prem, and I go to ask that data warehouse, that traditional data warehouse, a business logic question, and I just don't have the CPU or capability to answer that question on-prem, I deploy this solution to the cloud where I do have the CPU and TPUs to answer that question.

Starting point is 00:43:48 And at the end of the day, it should be kind of this thing that I can turn on and off to say, instead of building, you know, SAP HANA solution or Spark solution on-prem, I can use it as needed in the cloud and I should only pay by the drip. And that drip is how much, how quickly can I get that business answer versus how quickly I could have gotten it on prem. Well, that's a great point. I asked a question about HA or high availability earlier. I'm assuming because your multi-node solution use those sorts of capabilities to support high availability is that

Starting point is 00:44:26 is that correct uh yeah definitely so we uh we do support uh high availability uh and uh the component of luxor which is responsible for managing uh the metadata across this system uh we we do replica we kind of have a replicated state machine. We use certain libraries for consensus, and we are able to make sure that if any one node goes down, we are still providing high availability access to data. And the other thing to really note is that, I talked about the metadata portion of it,

Starting point is 00:45:04 but if you kind of, if I talked about the metadata portion of it, but if you kind of, if you talk about the data itself and you terminate Eluxio completely, we have, this is one of the things which is core to our philosophy was that whether Eluxio is there or not, you should still be able to access the data. So even if Eluxio is terminated and you lose data cached in Eluxio, Eluxio can always recover by accessing the underlying source directly. There is this question about writing. I mean, obviously reading for BI and machine learning is probably the predominant access. But if I were to say create an object using Alexio to create it on-prem, and I'm sitting in GCP in this case, are you able to create files or objects with Alexio

Starting point is 00:46:01 through the client or is that not? Yes, yes, yes. We are a read-write solution. That brings up a lot of potential conflicts. You know, where the data is at any instant in time. Is it available at the source location or wherever you're actually storing it? How does it get there? How often is it updated? Those sorts of things. the source location or wherever you're actually storing it? How does it get there? How often is it updated?

Starting point is 00:46:29 Those sorts of things. So, I mean, there's a whole bunch of data integrity issues with respect to supporting this sort of proxy rights kind of thing throughout, you know, multiple clouds, right? Multiple on-premise locations. Wherever the client software is running, you could potentially have access to an object bucket or an object, yeah, a bucket. No, definitely. It's not an easy problem and something that we've only built across the years. So for the write path itself, we have different policies or different ways of writing in certain scenarios.

Starting point is 00:47:08 But let's say you're doing the computation in the cloud. What could happen is, let's say there's certain analysis that I want to do as an analyst, but then when I'm writing, I'm writing it back to a bucket that I own, which doesn't really have the consistency problem that you described. But in other scenarios, I want the right to be written back to on-prem and consumed by other applications running on-prem. So we have different ways or different policies in Eluxio of when should data propagate from Eluxio to the underlying store? Should it go synchronously? Should it go asynchronously?

Starting point is 00:47:49 But also on top of that, once you're operating in these multiple environments, we have sophisticated mechanism of synchronization of like what happens should the update be synchronized immediately? Can I bear with eventual consistency, a lag of a few seconds? If you look at some of the more transactional workloads

Starting point is 00:48:15 on top these days, especially when you're looking at the table formats like an iceberg or a hoodie, you kind of need a little different semantics operating in multiple environments compared to if you were just operating on, let's say, a raw Parquet file, which can, since there's no, the semantics are loser, eventually that kind of policy works better in those

Starting point is 00:48:45 scenarios. Alright, well this has been great. Keith, any last questions for Adit? No, not that we have time for it. There's a ton of, I think we could spend another hour at least talking through some of this. Yeah, yeah, yeah. Adit,

Starting point is 00:49:00 anything you'd like to say to our listening audience before we close? Yeah, maybe the only thing I would say is that if you are a large enterprise, I would always encourage you to plan for agility. Plan to be able to move or to make your applications reside where they should be without really caring about the data gravity problem because there are solutions for that, but always plan for agility. Maybe that's the last thing I would say. Well, this has been great.

Starting point is 00:49:37 Thank you very much for being on our show today. Thank you for having me. Until next time. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.

Grey Beards on Systems - 138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.