Grey Beards on Systems - 138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio
Episode Date: October 13, 2022We have never talked with Alluxio before but after coming back last week from Cloud Field Day 15 (CFD15) it seemed a good time to talk with other solution providers attempting to make hybrid cloud eas...ier to use. Adit Madan (@madanadit) , Director of Product Management, Alluxio, which is a data orchestration solution that’s available … Continue reading "138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Keith Townsend.
Welcome to another sponsored episode of the Greybeards on Storage podcast,
a show where we get Greybeards bloggers together with storage assistant vendors
to discuss upcoming products, technologies, and trends affecting the data center today.
And now it's my pleasure to introduce Adit Madan, Director of Product for Alexio.
I just got back from CloudField A15 last week and thought Alexio would make an interesting continuation to those discussions.
So Adit, why don't you tell us a little bit about yourself and what Alexio does for cloud workloads?
Hi, everyone. This is Adit Madan. I'm Director of Product Management
at Luxio, where I've been for several years. I would say I've spent the better part of my
working career at Luxio, which is six years is the term. I have been working in the company
right since the beginning, started off in engineering, working across different roles before I ended up in product management a few years back.
It's been an exciting journey at Luxio,
seeing the company evolve, go through different stages of usage with customers,
how the use cases have evolved. And with that, maybe I'll talk a little
bit about Aluxio itself. We are a company which started off as part of the AMP lab at UC Berkeley,
as you guys all know, that's the same lab which gave birth to the Spark project,
which is now Databricks.
So we were in the same research lab,
but we went off in a different direction.
So instead of trying to be yet another compute engine,
what we did was we decided to address a different part of the problem.
So we are trying to address the access of data across different compute engines,
across different environments, and we can talk more about that as we go.
So it's a data gravity issue, as we would like to say in the storage business.
Data is tough to move around and tough to access from lots of different places, considering
it's only located in one location, right?
Yeah, exactly.
So I think the problem of data access itself surfaces in different ways for companies in
different segments.
For large enterprises, it is exactly what you said.
They might have, for example,
for business intelligence applications,
they might have a variety of different data sources.
And since it's a large organization,
it's not uncommon to have these spread across
different silos in different regions of the country,
different parts of the country, different parts of the world,
or even between some data sources on-premises and some in the cloud.
So where we come in is providing whatever applications need access to data,
sometimes a federation of data across these sources.
We are providing a unified way of accessing it,
regardless of what the application on top is.
So your solution is open source software, is that correct?
Yes, we are an open source software.
The company itself follows the open core model. an open source, which and the corresponding community edition, which is free to download,
free to look at the source code, contribute. And then we have the enterprise edition of the product,
which is closed source, which builds on top of the open source. And that's what we call
our enterprise edition. That's a rather large problem. Is there like a target market or use case
for that the team is focusing on? Yeah. So when I said, I mean, the larger vision, like I said,
I started off by saying any kind of application. So the first way in which we focus in is when we say application,
there are different kinds of data-driven applications.
The market in terms of the applications,
the market that we're focusing in on is obviously since we talked about the
likes of Spark and other engines like that. It's focused initially on large-scale analytics and BI and SQL OLAP application, not general
application, not like a general purpose file system.
What we've built is specifically for the needs of initially analytics and then going down
into machine learning and deep learning.
That's how the company evolved.
So analytics historically has been, I'm guessing, Hadoop kinds of things.
And I'm thinking sequential access and things of that nature, maybe object storage kinds of stuff. So in your solution, you deploy a software that runs various locations of a
company's environment. Is that how it would work? Yeah. So I think a lot of times what happens is
the software that we provide is deployed in one region of the company's infrastructure,
and then it kind of evolves as we expand within the organization to having multiple instances
of the same software.
And the software provides sort of a protocol stack at both locations.
So, you know, if I'm at a, let's say I'm sitting in AWS and I want to access S3 objects sitting in my enterprise on-prem here.
So I'd have software at both locations, presumably.
And one would be, you know, a target.
One would be a client.
So the situation that you described, actually, we would need only one instance of our software in this case.
And that's a typical scenario which comes in for, let's say, for large organizations who are looking for some agility to move, to be able to utilize the cloud, but not completely move away from their on-prem interface.
So it would be like a cloud bursting kind of solution or something like that?
Exactly. So like for a cloud bursting kind of solution, we would deploy a Luxio.
A Luxio is always deployed close to the application which needs access to data.
So in this case, the application, the compute, which is running in the cloud.
So we would have a Luxio in the cloud providing access to data which
may reside on-premises. So that's really interesting from an architecture perspective.
I'm really interested to see how this kind of mess or fabric works. And I guess we should start
there. There's the concept of a data mesh or data fabric.
Where does Alexio sit in the definition of those two different types of approaches?
You think there are two different types, Keith?
I mean, that's a whole different discussion.
I think you have a data mesh running on a data fabric.
Yeah, yeah, exactly.
I suppose.
I mean, you don't need a data fabric to have a data mesh.
What is a data mesh? Let's start there, Keith. So I
view a data mesh as kind of what we're talking about
now, this access layer, this ability to take data
from unique sources and provide a
consistent API or experience.
So a mesh of data.
A data fabric is specifically focused on how do I get the bits from point A to point B.
So one's logical and one's physical kind of view.
One takes care of the logical and one takes care of the physical movement of data.
Now you need to answer the question.
That's a great question and I think data mesh itself has been
a hot topic of discussion these days. So I like
the high level categorization of calling
a data mesh
the logical layer, or even like sometimes people say,
it's more of how the organization works,
who is the owner of data,
which team is responsible for operationalizing the data.
I feel like that's a lot of the conversation around data mesh itself.
Whereas, like you said,
the fabric itself,
it's encompassing kind of like,
almost like it's defining a layer
in the data stack,
which serves a particular purpose.
Whereas data mesh is more of a concept which can be realized using different tools.
Different vendors kind of have different tools for making up the solution, whereas the fabric approach is kind of prescribing an approach that.
Connectivity and all that stuff. Yeah, yeah, yeah. No, I agree. I agree completely.
So back to this AWS cloud bursting compute solution here. So the S3 data is sitting in my
on-prem and you have a layer sitting in the AWS compute, I assume, doing some stuff to provide access to the EC2 instances sitting there to the data that's
sitting on-prem? Are you caching data? Are you cataloging the data? I mean, there's lots of
stuff that I could consider doing as, let's say, a data mesh solution for S3 data or HDFS data, or even NFS data, things of that nature.
I mean, so where does Eluxio fit into that sort of thing?
I mean, are you caching?
That's the first question. Are you going out and gathering all the metadata for the data sitting at the target location?
Or the source location, rather.
I'm sorry, my mistake.
No, that's a great question.
So what Alexio is doing with this layer,
first question, yes, Alexio is providing caching abilities as well.
Once you have to provide access of data across these different regions without any dependencies on compute, which is
running on-prem in this scenario, the caching clearly is a necessity for this kind of a solution.
And I mentioned that it is a requirement for this kind of a solution because this kind of federation
could be constructed in different ways
which do not depend on caching itself.
So you could, when you're running compute in the cloud,
you could send your compute job on-premises,
get the results back, and then just transmit the results.
But this kind of mechanism is not really depending on caching, but it has
the downside of that you're still dependent on your compute on-premises. So if you were actually
in the situation in which the reason why you wanted to burst to the cloud was that you didn't
have enough compute resources on-prem, it's not really solving the problem. So yeah, Aluxio is caching and Aluxio is providing a view of all of the data.
So it is collecting the metadata for whatever data is present across different sources.
But at the same time, once we get into what the entire solution looks like, we are not providing governance, for example.
We hook into systems which provide governance.
So Alexio is not coming in and saying that Alexio is the cataloging and governance solution.
Alexio is plugging into a few different components in the stack to provide the solution itself.
When you say governments, you're talking about security, protection, access rights,
those sorts of things, or access logging kinds of things?
Exactly.
So if you think about governance and just even limited to two of the things that you
mentioned, logging and also
access control, which team can access which data set and or which individuals can access which
data sets, allows you to hook into different components to provide that solution.
So this is, you know, kind of next level conversation that we'll get right into, which I'm not complaining.
One of the challenges when we're talking about stretching access of data from on-premises systems to the public cloud is that ETL of the metadata, of that who has access to what systems
so that when I'm making, not necessarily,
I'm assuming we're not making copies of the data,
but we're extending access of the data.
So when we inject some type of proxy for that data
to speed up access, et cetera, et cetera,
the question is how do we maintain access control, especially as we
begin to extend the capability? We're talking about OLAP analytics. I can never say that.
We're talking about analytics data. So we're not transacting on the data, but access control is
still critical because this data could still be
sensitive. So I guess the question is, how is access control extended to this new cloud environment?
If I'm building a net new app on EC2 instances that's hitting the Eluxio appliance, how do I
ensure my security policy is enforced across this new app?
Yeah, so maybe let's use the data mesh concepts and terminology to talk through that example.
Maybe let's imagine an enterprise which has two different teams. One is the owner of data, like who is responsible for deciding who can access
the data. And then there's another team which needs access to the data itself.
Now, if you just break it up on, say, that who has the responsibility of deciding
what are the policies of who should be able to access what,
it would be the data owner,
the team which is the owner of the data itself.
So that doesn't really change.
But if you look at where Alexio comes in,
Alexio is on the consumer side.
So team B, which needed access to data from team A, that is using
Alexio and Alexio is kind of hooking into the data governance tools like an Apache Ranger,
Prevacera, or any other ways like a Styro OPA. There's many different modules which the data
owner, they could use for enforcing these data policies.
So you become sort of like a pass-through for whatever the credentials that are required
on-prem.
You're, you know, the application is doing, is presenting those credentials to the Alexio
solution at the consumer side where the compute is, and you're passing those credentials across, I guess. Is that what you're trying to say? Yeah. So Luxio does become a pass-through.
In more technical terminology, it's impersonation. We impersonate to be the user and whoever is
enforcing the policies are kind of enforcing the
policies as if the user was accessing it, not Nautilux.
It seemed like I was reading your, your, your website and,
and you talk about multiple cloud support. You talk about, you know,
I mentioned HDFS and S3,
but it must be like four or five other different access protocols as well.
Could you talk about some of that? Oh, yeah, definitely. So I think that a little bit of
history of the project. So when we began, as you guys know, we talked about Hadoop,
and we talked about generally the kind of interfaces which
were prevalent in the big data ecosystem and that's where our HDFS
interface came into play so we the purpose of the HDFS interface is simply
and the rest of the interfaces as well now and I'll get to that in a second. It's just making it a lot easier to introduce
Aluxio into the mix. So on the data API front, I would say HDFS, it used to be the most popular way
of accessing Aluxio, but these days, the S3 interface and the POSIX interface specifically for machine learning and deep
learning applications. That's kind of like the three main interfaces of accessing
Aluxio across these variety of applications that a data platform may be onboarding.
And then different cloud support. Yeah, that's an interesting one.
Maybe we can spend a little bit of time on that as well.
On the support for multiple clouds,
I mean, we talked about one situation
in which you may want to access data
across these different regions.
We started talking about cloud bursting,
which is kind of on-prem to a cloud.
But increasingly what we're hearing from our customers is,
I mean, we all know that no one likes to be vendor locked,
right, and vendor locked even means no one likes to be tied
to one specific cloud and i'm
sure you you've heard the same thing as as we have most of the customers if not all of the customers
that we're dealing with they they start off with uh uh if they're using the cloud they uh they start
off with one cloud but they will will definitely migrate to another cloud.
Not migrate, I would say.
They would also adopt, add in a second cloud,
if not a third cloud at some point in the future.
And some of our customers already have achieved that,
and others are kind of headed in that direction.
So for just keeping these kind of enterprises in mind,
one of the things that we also promote as something that Eluxio is solving is the fact that
it's making your applications portable. Eluxio is not the one thing which is making it portable, but it's contributing in a significant way on the data API side
that is making your applications portable,
such that you can just lift your applications
and run it in whichever environment is most suitable.
And most suitable can be for a couple of different reasons.
Sometimes most suitable
could just mean that you have access
to a particular service
from a particular cloud vendor,
which is more suitable
for the job at hand.
So it's kind of application semantics dependent.
And other times it's like a cost reason.
You may negotiate a better price from
a different cloud vendor. So this ability to just move your application without moving the data
itself. So this separation of application and storage is critical here. Just the ability to move
wherever without considering the data gravity problem, which we started off with,
that's really critical.
And that's kind of behind, like, that's kind of what you might have seen with our multi-cloud
messaging.
BI kinds of things, or even AIML, we're talking about lots and lots of data.
Even though you're sitting there and caching things and stuff like that, when you're talking
about accessing, I don't know, terabytes, petabytes of data, we're still talking
considerable, you know, the latency becomes an issue,
the bandwidth becomes an issue that's allowed. How do you deal with some
of those sorts of things? Obviously, caching can deal with some of the latency things, but
at some point, you actually have to go and grab the data
from wherever it is, right?
Definitely.
So, and this is one of the first questions that we always get from prospects.
It's almost like we hear that it sounds good.
And if it worked, we would definitely use it.
But there's always this doubt that it's too good to be true in some ways.
So for this, caching plays a huge role. And in addition to caching, we have the first thing I
would say on the network side with the latency and bandwidth, you have to keep in mind that we are, the first thing is that there is a
selection of data which moves.
So we are not blindly moving or even in the context of caching, we are not blindly caching
everything under the hood.
Just taking the example of like a BI application,
which might be operating on years of data,
let's say three plus years of data,
petabytes of data,
which is residing in one region
and you're accessing it from another region.
So we are able to select what needs to be moved.
And the second thing is that
there's a lot of capabilities
around preloading, prefetching, these policies in the layer that we provide, which are able to
eliminate the latency effects. And just taking this a little further is that you only take the hit the first time.
And as you like the access pattern of these applications based on what we've observed with something that we validated across a lot of our community and enterprise users.
If the access patterns are such that it makes caching effective. So that's kind of a few
things that I would, I mean, just taking an extreme example, we have a lot of people who are
splitting their machine learning or deep learning pipeline by, let's say, pre-processing in the
cloud, but they want to own the GPUs on-premises
and run the application on-premises, but while the data itself resides in a cloud object store.
And if you just look at the access pattern of these training jobs, they repeatedly, continuously,
they fetch it once, which takes the latency and bandwidth hit. But then once the data is available, the model
is just incrementally, you keep reading the same data with a little bit of a difference,
and you keep doing this in a loop. So that's where it really makes a solution more effective.
Yeah, yeah, yeah, exactly. So I mean, A, they're batched, and B, they do a number of epochs across
the same data and things of that nature, randomized, of course.
So that brings up the question of how big your cache can be.
So if I'm front-ending a petabyte of data and allowing customers to access pretty much that data in a, let's say, sequential pattern, stuff like that,'re still talking, you probably need a significant amount of cash, right?
Yeah. I mean, it really depends on the situation,
but just a few of our larger examples,
it's not uncommon to have a half a petabyte of cash for, for example.
Real stuff. Okay.
In the larger scenarios. I mean, depending, for example. Oh, we're talking real stuff. Okay, now I understand.
In the larger scenarios.
I mean, obviously it really depends
on what the working set of your data is.
And half a petabyte of cash
doesn't mean you're spending
the same amount of money
on your analytics platform.
If you just look at where
the amount of spend that you have
on storage versus GPUs or compute,
the storage spend is kind of a very small percentage
of the entire spend.
So with that in mind,
one of the things that's coming up as a question
becomes observability and improvement
and what knobs we can turn to improve latency and
just the throughput etc so I would imagine the target audience for a lot of this is not
necessarily IT infrastructure people they They're application developers,
people who are born in the cloud
and extending capabilities.
How do you help those operators
identify the knobs they can turn on the network
or the cloud provider side,
whether it's increasing the cache size from a storage perspective,
resizing that virtual machine that's doing the caching versus simply just doing a direct
connect, doing a direct connect and the speed of direct connect.
So it's really the visibility of the performance of the data mesh, I guess I'd call it.
And what sort of knobs,
or how do you tell the users in this environment,
you know, what knobs they can play with
and what knobs they can't, I guess?
That's again a great question.
And I wouldn't claim that we've solved the problem entirely,
but we have made significant progress,
which I can definitely share
because as you can imagine,
and you pointed out a few things,
how big should my network pipe be?
Figuring out these answers
to these kinds of questions
are not trivial by any means.
Just to take one specific example of a collaboration that we've done
with Meta actually. Meta is one of the users of our community edition and that's how our
open source also plays into our company strategy as well in that a lot of the the
innovation uh for for these kinds of problems happen with with the internet giant and uh we
and we talked about like when we were describing the problem uh we mentioned two two things how
big uh should the cash be how big should my network pipe be? So for these kinds of things,
what we've done on the community side
and some of these things we are productionizing as well these days
is we call it cache insights.
And the workload itself themselves are not like,
this is not a one-time exercise.
These kinds of insights that you need as the workload keeps changing over time,
you might keep adding more teams to your platform.
So you kind of have, this needs to be a continuous exercise.
And on the observability front, we actually baked in functionality into our client itself, which is providing insights based
on the access pattern itself. So it has kind of a decision tree, which spits out an answer like,
if my workload is sorry, it would answer questions like if my cache size is increased from 1x to 2x,
what would be the impact on the workload? So it is able to answer that kind of question because
it is seeing a lot of the access pattern. It's seeing what is hitting the cache, what is not
hitting the cache. And that's kind of some of the things that we are doing in that direction.
I wouldn't say it's a completely solved problem yet, but that's a step in the right direction.
So you're giving sort of like a predictive view of what the performance of the application
would be if I were to double the network pipe or double a cache size or something, whatever
the parameters are.
Are those the major ones that affect the performance of the data mesh?
I would say sizing in the context of Eluxio especially, I mean, sizing the cache and sizing
the network are definitely the major factors.
The only other thing that we haven't said is just
how many cores that you need, which is, which is kind of proportional to the concurrency or
the workload itself. But yeah, I feel like these are definitely, we captured all kinds of resources,
right? So we kind of, we talked about CPU concurrency, storage, cache, and then network,
which are the three major factors.
So, I mean, so, you know, we've been talking a lot about Kubernetes in this world here.
We're gray beard kind of stuff.
So, is it a Kubernetes solution?
Does it support multiple nodes for its client support?
Can you scale up the number of nodes?
Or is it just a single virtual machine or dual virtual machine with high availability?
Well, the high availability question is a different one.
But, you know, so I guess is it multi-node solution?
That's the first question.
Definitely.
So Luxio is a scale-out distributed system. It can be deployed on Kubernetes,
which is increasingly becoming the de facto way of deploying Luxio.
It wasn't always the case,
and I think there's still a migration happening,
and increasingly everyone knew we come across Kubernetes as the way that they deploy and manage and
operate a Luxio. So I guess
going the other direction, can you consume
this as a SaaS? It can
be. It's not
there yet.
I mean, we are not,
I mean, can be, I meant, would you see
value if there was a SaaS service? Yes.
But a Luxio is not a SaaS service yet.
That's not something that we provide as of now.
Okay.
Well, that's good.
So it supports Kubernetes clusters.
So your client software would be deployed as containers in the Kubernetes cluster?
Is that how it would work?
Or it would be be separate Kubernetes cluster
with your client software,
somehow connected to other Kubernetes clusters?
So we have both ways actually,
and it makes sense for different enterprises,
both kind of make sense.
We do have the situation in which,
let's say the client itself are, just as an example,
let's say I'm using Spark on Kubernetes, running ephemeral clusters of these in the cloud.
So you would deploy a Luxio as a separate Kubernetes cluster, which has a different
lifecycle from the ephemeral Spark clusters themselves, which Alexia also on Kubernetes.
The client itself is embedded inside Spark itself.
So our client for Spark itself, our client is not a separate process.
It's not running anywhere, but it's something which is a library embedded
in Spark itself. And like I said, once you're using something like an S3 API, you don't even
need any custom client. So out of the box, Spark on any of your applications, which is able to
talk the S3 interface, can interface with the Luxio without any changes.
So I'm still trying to understand it.
So the client software itself ends up being deployed as part of the Spark functionality.
Is there other applications where that's the case?
Or, you know, Kafka is probably a dozen different SQL and OSQL databases out there, those sorts of things.
I mean, how would they deploy your client software?
Yeah, so maybe let's look at a different category of applications.
Let's say we are using something like PyTorch for machine learning, deep learning.
And in those scenarios, we provide something called CSI driver on Kubernetes, container
storage interface, which makes Eluxio look like a local file system.
So on the client side, we would install our driver, so to say, our CSI driver, which is able to interface with Luxio.
And then the applications themselves, the containers, they'll just talk to a mount point inside their containers, like a local file system.
What I hear, there isn't a fixed way to deploy this. So if I have a container app and I don't want to deal with the networking of making external calls from the cluster to a cluster based on S3 or a different mount or whatever, I can build that app with the Luxio cluster or Luxio nodes within that cluster.
So if that's best for my application operations design, I can do that.
If I want the data to live independent of the app lifecycle or the app's instances,
then I can build a dedicated cluster and just simply make S3 calls to that cluster.
So it really depends on the application and the kind of application.
And whatever my operations are.
So if I'm a data team and I'm providing data to multiple applications in a public cloud,
then I build the cluster and it'd be independent of the individual app cluster.
So even at the, we were kind of focusing on Kubernetes, but it's not unique to Kubernetes. I could have AWS
services. I could have GPU
ML, AI instances running against this data hosted
in this cluster. As a data service provider, I'm just
managing the data independent of the applications and
just providing,
you know,
centralized caching and capability for multiple teams and applications.
Absolutely.
And we actually published a case study of an organization,
Expedia actually,
which is doing precisely that.
So this is something we published a couple of weeks back.
So that's why it's fresh in my memory. But they're using different services,
like they're using different variants of Spark and Trino,
open source flavors,
but also services like a Databricks or a Starburst in AWS too,
with a dedicated Aluxio cluster, as you were describing.
That brings up a question now. In this sort of solution, do you support,
let's say, multiple locations for the source data to the same target?
So let's say I've got multiple on-premise locations throughout America for high availability or something like that.
Can I have my application sitting in GCP talk to all three locations?
I mean, it might be separate mount points, I guess.
Is that on the configuration, I suppose?
Exactly.
And we do have a lot of people who are deploying Eluxio in that way.
So we provide something called a namespace, which looks like an object and file namespace.
And precisely what you said, we would have different mount points for the different data sources.
So it's essentially mapping a section of the Luxio namespace to a different data source.
Yeah, we used to call it thing global file systems kinds of thing.
I also noticed on your website that you support different vendor storage.
I noticed NENEP and Dell.
I think MinIO is there as well.
Yeah, we support a huge variety of different stores on the south side. I would say the most common ones,
actually the most common protocols,
one is the S3 protocol,
which is extremely common,
followed by the protocols by the major cloud vendors
like GCS and ADLS from Azure,
but also HDFS for organizations who still have HDFS around.
For us, like speaking to a MinIO or speaking to an Amazon S3
or speaking to a Cloudian, it looks functionality-wise,
it's the same.
Obviously, once you get into the operations, there are differences of how you would tune the system.
But functionality-wise, it's the same driver that we use to speak to different kinds of storage modules on the south side of Eluxia. Yeah, so let's say a NetApp solution would have SMB3, if it have NFS version 3, maybe
version 4, it would have S3 potentially, types of protocols.
And are you only supporting S3 in those sorts of environments?
No, we also do support a local file system interface on our south side.
So anything which speaks POSIX pretty much, we would also support that.
So since you mentioned NetApp, we actually collaborated with NetApp recently over the last year.
And they didn't mention us even in their earnings report,
which came out a couple of weeks ago.
So which is on the S3.
So we have heavily collaborated with them on the S3 front
because especially for the kind of workloads
that the application that we,
which is our market with these data-driven applications,
the S3 interface is the more popular one
compared to some of the other kinds of storage that you mentioned.
How is something like your enterprise solution priced?
Yeah, our enterprise solution is, right now,
we price it based on the amount of resources that you allocate to Aloxio.
So it's very similar to other vendors in our space
in which you kind of price based on how much CPU,
how much storage of cash that you have allocated to Aloxio.
And these parameters are what we charge on.
We used to do, I mean, we still are the primary way
in which we sell our software is annual licenses.
So you would, based on the resources,
we would give you a price of how many resource hours
can you use across the year,
and you would get into an annual contract with Eluxia.
So I would have thought that you might have the source data size
as being a component of the price.
Like if I wanted to take a petabyte data lake, for instance,
I'm sitting with my home computer system,
and I want to be able to access it through Azure or something like that,
the client is going to take, you know,
cache and storage and networking and EC2 instances
or whatever the counterpart is for Azure.
But, you know, having a petabyte of storage
under accessible ability, I guess,
that could be one of the components.
But you're not doing that.
It's really, it's the amount of storage.
It's the amount of resources, compute, storage, networking resources consumed by the client at wherever the client's deployed.
Exactly.
So in the scenario that you described, if you have like a couple of petabytes of data, but you only end up accessing half a petabyte, we wouldn't charge you
for everything. So like the second factor that I said, it's more like what is at any given point
in time, how much data would you be accessing? It doesn't matter if you have like tons and tons of
data. It's kind of a working set measure almost. Exactly. It is exactly a
working set measure. So we are, and I mean, generally this is agreeable to customers as
well because, I mean, you don't want to price based on something that they have. You kind of
want to price based on what value they're getting out of Eluxio. If they're not accessing a lot of data, which is there for archival or historical purposes, then you're
not really getting anything out of Eluxio. So why should Eluxio charge for that?
Yeah, I guess like a sample use case or how you value this would be if I had a bunch of ERP data that was in my data warehouse sitting on-prem,
and I go to ask that data warehouse, that traditional data warehouse,
a business logic question,
and I just don't have the CPU or capability to answer that question on-prem,
I deploy this solution to the cloud where I do have the CPU and TPUs to answer that question.
And at the end of the day, it should be kind of this thing that I can turn on and off to say,
instead of building, you know, SAP HANA solution or Spark solution on-prem, I can use it as needed
in the cloud and I should only pay by the drip.
And that drip is how much, how quickly can I get that business answer versus how quickly
I could have gotten it on prem.
Well, that's a great point.
I asked a question about HA or high availability earlier.
I'm assuming because your multi-node solution use those sorts of capabilities to support high availability is that
is that correct uh yeah definitely so we uh we do support uh high availability uh and uh the
component of luxor which is responsible for managing uh the metadata across this system
uh we we do replica we kind of have a replicated state machine.
We use certain libraries for consensus,
and we are able to make sure that if any one node goes down,
we are still providing high availability access to data.
And the other thing to really note is that,
I talked about the metadata portion of it,
but if you kind of, if I talked about the metadata portion of it, but if you
kind of, if you talk about the data itself and you terminate Eluxio completely, we have,
this is one of the things which is core to our philosophy was that whether Eluxio is
there or not, you should still be able to access the data. So even if Eluxio
is terminated and you lose data cached in Eluxio, Eluxio can always recover by accessing the
underlying source directly. There is this question about writing. I mean, obviously reading for BI
and machine learning is probably the predominant access. But if I were to say create an object using Alexio to create it on-prem,
and I'm sitting in GCP in this case, are you able to create files or objects with Alexio
through the client or is that not? Yes, yes, yes. We are a read-write solution.
That brings up a lot of potential conflicts.
You know, where the data is at any instant in time.
Is it available at the source location
or wherever you're actually storing it?
How does it get there? How often is it updated? Those sorts of things. the source location or wherever you're actually storing it?
How does it get there?
How often is it updated?
Those sorts of things.
So, I mean, there's a whole bunch of data integrity issues with respect to supporting this sort of proxy rights kind of thing
throughout, you know, multiple clouds, right?
Multiple on-premise locations.
Wherever the client software is running,
you could potentially have access to an object bucket or an object, yeah, a bucket.
No, definitely. It's not an easy problem and something that we've only built across the
years. So for the write path itself, we have different policies or different ways of writing in certain scenarios.
But let's say you're doing the computation in the cloud. What could happen is, let's say there's
certain analysis that I want to do as an analyst, but then when I'm writing, I'm writing it back
to a bucket that I own, which doesn't really have the consistency problem that you described. But in other scenarios, I want the right to be written back to on-prem and consumed by
other applications running on-prem.
So we have different ways or different policies in Eluxio of when should data propagate from
Eluxio to the underlying store?
Should it go synchronously?
Should it go asynchronously?
But also on top of that,
once you're operating in these multiple environments,
we have sophisticated mechanism of synchronization
of like what happens
should the update be synchronized immediately?
Can I bear with eventual consistency,
a lag of a few seconds?
If you look at some of the more transactional workloads
on top these days,
especially when you're looking at the table formats
like an iceberg or a hoodie,
you kind of need a little different semantics operating in multiple
environments compared to if you were just operating on, let's say,
a raw Parquet file, which can, since there's no,
the semantics are loser,
eventually that kind of policy works better in those
scenarios. Alright, well this has been great. Keith, any
last questions for Adit? No,
not that we have time for
it.
There's a ton of, I think we
could spend another hour at least
talking through some of this.
Yeah, yeah, yeah. Adit,
anything you'd like to say to our listening audience
before we close? Yeah, maybe the
only thing I would say is that if you are a large enterprise, I would always encourage you to plan for agility.
Plan to be able to move or to make your applications reside where they should be
without really caring about the data gravity problem
because there are solutions for that, but always plan for agility.
Maybe that's the last thing I would say.
Well, this has been great.
Thank you very much for being on our show today.
Thank you for having me.
Until next time.
Next time, we will talk to the system storage technology person.
Any questions you want us to ask, please let us know.
And if you enjoy our podcast, tell your friends about it.
Please review us on Apple Podcasts, Google Play, and Spotify, as this will help get the word out. Thank you.