The Data Stack Show - 180: Data Observability and AI for Data Operations Featuring Kunal Agarwal of Unravel Data
Episode Date: March 6, 2024Highlights from this week’s conversation include:The evolution of data operations (1:13)Unravel's role in simplifying data operations (2:17)Kunal’s journey from fashion to enterprise data manageme...nt (5:23)\The Unravel platform and its components (10:08)Challenges in data operations at scale (16:34)Users of Unravel within an organization (22:32)Calculating ROI on data products (25:55)Understanding the cost of data operations (27:01)Measuring productivity and reliability (30:59)Diversity of technologies in data operations (34:52)Efficiency in cost management (44:15)Implementing observability in AI (47:55)Challenges of AI Adoption (50:17)Final thoughts and takeaways (51:36)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, Data Stack Show listeners. I'm Pete Soderling, and I'd like to personally invite you to Data
Council Austin this March 26 to 28, where I'll play host to hundreds of attendees,
100 plus top speakers, and dozens of hot startups in the cutting edge of data science,
engineering, and AI. If you're sick and tired of salesy data conferences like I was,
you'll understand exactly why I started Data Council and how it's become known for being engineering, and AI. If you're sick and tired of salesy data conferences like I was,
you'll understand exactly why I started Data Council and how it's become known for being the best vendor-neutral, no BS, technical data conference around. The community that attends
Data Council are some of the smartest founders, data engineers, and scientists, CTOs, heads of
data, lead engineers, investors, and community organizers who are all working together to build
the future of data and AI. And as a listener to the Data Stack Show, you can join us at the event at a
special price. Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20.
But don't just take my word that it's the best data event out there. Our attendees refer to
Data Council as Spring Break for Data Geeks. So come on down to Austin and join us for an amazing time with the data community.
I can't wait to see you there.
Welcome to the Data Stack Show. Each week we explore the world of data by talking to the
people shaping its future. You'll learn about new data technology and trends and how data teams and
processes are run at top companies. The Data Stack Show is brought to you by Rudderstack,
the CDP for developers. You can learn more at rudderstack.com. We are here with Kunal from Unravel Data.
Kunal, thanks for spending a few minutes with us today.
Eric Costas, thank you so much for having me here.
All right, give us your background.
How did you get into data?
And what are you doing today at Unravel?
Yeah, so Kunal Agarwal, founder and CEO of Unravel Data, that I started with my co-founder Shivnath Babu, who is a professor of computer science at Duke University.
We both started this company to simplify data operations.
We feel data engineers and data teams spend too much time firefighting issues rather than being productive on the data stack.
And we wanted to automate and simplify some of Mal, because I really would like to chat about
how data operations have changed in the past 10 years.
It's extremely interesting that you've seen this whole, from the Hadoop days up to today,
all the changes that have happened.
And I have a feeling that the complexity around data operations has exploded, right?
Especially with having pretty much every one or two years new use cases around data coming in, right?
So even like, let's say, observability in data.
What does it mean?
What it meant like five years ago and what it means today when we have also like AI, for example, right?
In the mix.
So I'd love to get more into like this journey,
how things have changed and what it means today,
like to operate.
We like to be an operator around like data.
And of course, learn about like Unravel
and like how it helps in that.
What about you?
What are like some topics that you're excited about?
Yeah, no, of course, there's never a dull moment in the life of a data team member,
especially for the last 10 years.
So we've gone from doing things with Hadoop as a one Swiss army, if you may,
to having a multi-system stack now, primarily running,
used to run on-prem, now primarily running on the cloud.
That's a mega change that's happened.
The other is we've gone from doing these batch workloads with ETL to now doing real-time
or near real-time workloads in production and not just as a science project.
And then we've gone from doing these BI business intelligence or just advanced analytic
workloads to now doing machine learning and AI in production. So if you're a part of a data team
as a data engineer or a data scientist or a data analyst, you've had to keep up with the demand
of your business and also had to keep retooling and reskilling yourself on how do you work on
a map-produced base system
to now a BigQuery and a Snowflake system.
It's incredible the rate of, the pace of change and the rate at which things are getting
evolved in this ecosystem, right?
So that's a very exciting part for us because what Unravel is ultimately helping to do is
to simplify how these data engineers or data analysts are creating their applications, how they're making sure that these applications are reliable, that they work on time every time, that they don't making sure that they are able to scale in a very efficient manner.
It's not a linear scale in dollars versus productivity.
Can we bend the cost curve as these environments are scaling up so that they're starting to get more bang for their investments?
And as we now see, the most exciting thing, it's actually here right now, not even the near future is AI
and how those workloads are changing businesses and turning industries upside down.
So it's a really powerful industry to be a part of and really exciting time to be a part of this
industry, but it's not for the faint hearted. It's for people who are up for a challenge, who like change,
who like evolving,
who like to try out new things.
And that's what makes it exciting overall
for everybody in this industry.
And I'm sure you've had the same experience
too, Costas.
Yeah, 100%.
I can't wait to get deeper into all that.
Eric, what do you think?
Let's dive in.
Let's do it.
You know, I'm so excited to chat with you today and dig into all things data ops.
But your story actually, as a tech founder, started in the fashion industry.
You know, and you've come a long way in enterprise data ops management.
So go back to the beginning beginning how did you start in the
fashion industry as a tech you know just spent a lot of time trying to figure out what to wear
every day i'm sure we all did and still don't look that good do we um it came from uh an actual
frustration of you know why don't we have something like we have,
you know, for songs that recommend what you should be listening to. You're not always
thinking about the exact song you want to listen to. It just shows up and it's the right song at
the right moment. So we decided to create a algorithm that helps you decide what to wear
based on a lot of different factors around where you are,
what the weather is like, what your friends are wearing, and then it picks out stuff that you
actually have in your wardrobe versus things that you should be getting in your wardrobe.
It was exciting, but I realized that B2B enterprise software is where I have a more
experience plus more liking into. But if you break down even that fashion experience,
it's really a, call it a big data model that had a recommendation engine running on top of it,
based on a lot of data that helps you connect the dots and figure out what you should be wearing.
But that experience, and I was consulting with Oracle products back in the day, working with large enterprises.
I started to get the first exposure to technologies like Hadoop,
which obviously was very nascent.
We're talking back in 2012, 2013 timeframe.
It was very powerful.
You could get a lot of large-scale processing done
for a really cheap price because it's open-source software.
They can run in commodity hardware, unlike Vertigo or Teradata back in the day, which is costing
millions of dollars. But we realized that it needed to be a complete product. It needed to
not have rough edges. It needed to be simple, intuitive for more users to get on the platform
and start to use this powerful technology.
So that's when I met my co-founder, Shivnath, when I was at Duke University.
And we figured that if we were able to simplify running applications in a high-performance way and make that automated, then that would reduce the amount of toil a data engineer spends in getting
their applications into production.
And that was the hypothesis that we started Unravel with.
And then since, we've actually extended the platform to obviously continue to focus on
performance and reliability, but then also start to think about efficiency and cost.
And then as Kostas was talking about earlier, is the evolution has also led us to make sure
that we are able to support all technologies that data teams are using.
Back in the day, it was just one technology called Hadoop and Back to Use.
Now, it's a whole zoo of animals, all complicated names, but really powerful stuff.
So we want to give users a choice and bring any technology that they're using, and then be able to get that same quality of service and an efficient way to go and scale your environment, right?
That's really a promise to the customers.
But yeah, it's been a fun journey from the fashion days to now, Eric. Yeah.
Before we jump into Unravel, I do have to ask, what was the most surprising thing you learned about the fashion industry or even fashion consumers with your dining into that world?
You know, interestingly enough, we learned that men engaged more than women did.
Yeah, much higher than anticipated and, you know, marginally higher than certain categories of women in different demographies.
You know, when you bring it down by regions and age, there were some men that were participating
and being more active about this than women were.
And I think the reason for that is women have so many other outlets
to discuss fashion and men did not.
And this became one of those places where they would actually engage
with and understand like, oh, what are my choices for,
you know, where I'm going to sit at.
But then you also have men who go all the other way,
like the Zuckerbergs and, you know, the Steve Jobs of the world.
We just have a uniform for that every day.
And, you know, come to think of it,
that may just be a better time saving way than to...
Yeah, yeah, totally.
But that was definitely, you know, interesting and insightful.
Yeah, maybe we're just much more clueless when it comes to fashion.
And so you kind of created this.
That's what you're talking around more.
Yeah, that's probably it.
Well, give us an overview of the Unravel platform.
There are multiple components here.
We talked about DataOps and maybe we can start just with a definition of DataOps.
How did you define DataOps? Yeah, can start just with the definition of DataOps. How did you define
DataOps? Yeah. So it's rather simple. Think about all the stages your data pipeline or your code has
to go to get an outcome. All the code that you have to write, all the sequencing you have to do,
the infrastructure that it's running on, the services that it's touching.
It's a rather complicated tangling of wires, if you may.
That's the kind of visual that comes to mind.
And when something goes wrong or something's not behaving the way you're anticipating it to behave,
then you start to ask the questions of what's going on, why it's happening, and how do I
go and fix it?
And to answer those questions, you need this thing called observability, right?
That's the simplest way to think about it.
You need to understand everything that's happening inside to then be able to ask it questions.
So the Unravel platform at its center is an observability platform for data ops teams
and for data ops, really, that helps do a couple of things.
Number one, makes your applications highly performing.
So your business is depending on certain data pipelines
or certain AI models finishing correctly and on time,
otherwise revenues hurt or your products aren't advancing.
So Unravel makes sure that happens, that your
service level agreements internally and externally are met, which are called SLAs.
The second thing is, if you do have a reliability issue, then Unravel helps you troubleshoot that
and fix those issues in a proactive and automated fashion, which we'll dive into.
And then third is, nobody's running a small data environment these days
because every company is becoming a data company.
So when you've got them spending $100,000
to $1 million to $10 million
to the bigger company spending hundreds of millions of dollars,
you need to make sure that
you're doing it in an efficient manner.
And what we're seeing is
companies are wasting upwards of 30 to 40%
of the cloud bill by just doing wasteful things and inefficient things that they may not even be
aware of. There are some common things like keeping the tap on when you're brushing your teeth.
So you should be turning them off from something as mundane as that to writing more efficient code.
But there's a lot of, you know, efficiency to be gained out there.
So when we step back, we look at, hey, let's connect to everything.
So if a data team has 7, 12, 14 different components that they put their stack up together, Unravel connects and collects data from everywhere.
So we know absolutely everything that's going on.
And then collect data from all the layers of the stack,
horizontally as well, and vertically.
So from your code all the way down to infrastructure,
see everything, gather everything, measure everything.
And once all this data is inside the unravel platform or our service,
that's when we run our algorithms and our AI models on top of it
to automatically detect what problems are. So you don't have to go hunting for that.
Tell them why it's happening so you don't have to do the cross-connection, so it's connecting
the dots for you so you don't have to go understand why something happened. And then,
in cases, give you an automatic resolution, and in some cases, give you a automatic resolution.
And in some cases, give you a guided remedy where it's not possible to automate things.
But at least tell you what to go and do to go and get out of this particular issue so that it stops the trial and error that's going on in your head.
Maybe I should try this out.
Maybe I should try that out.
You don't have to do that anymore. And what we've seen is if you approach it in this way, then you can save several hours per problem per engineer inside a company, which ultimately manifests itself into better productivity and just improvement in productivity rates and efficiency rates across your organization, but also improvement in your efficiency of your infrastructure.
But more importantly, you can now start to depend on your
data outcomes. Companies are betting their reputation and their money on data outcomes.
And if it doesn't work half the time, then it's useless. Now you can stand behind it and say,
you know what, this thing that we're launching, this recommendation engine that we're creating,
or this fraud prevention app that we're launching, it will work on time every time.
And that's when companies can start to confidently invest the second wave
of AI or any other applications that they may be creating out there.
Yeah, it makes total sense. I wanted to get into the analogy. I loved turning the tap off
while you're brushing your teeth. It made me think about something you said earlier, which was you started in data back
when, in terms of big data, Hadoop was really the main game in town.
And it made me think back to business intelligence originally was a finance function, right?
A lot of times it reported up to the CFO, right?
Which in many ways makes a lot of sense.
But then, you know, there are a couple of dynamics that happen.
Number one, the cost of storage just starts to plummet.
Storage and compute separate with this big migration to the cloud.
And so all of a sudden, even just that is this massive workflow optimization, right?
Wow, like, you know, we can be so much more efficient
than we used to.
We can run way more queries, et cetera.
Pipeline technology advanced significantly.
And so it's way easier to move data around
and cheaper to move data around, you know?
And so free-for-all is probably too strong of a term, but like, you know, and so free for all is probably too
strong of a term, but like, you know, it's like, well, yeah, I mean, let's just load the data
warehouse, load the data lake. We can do all sorts of analytics, self-serve analytics, you know,
machine learning, all this sort of stuff. And now it's sort of, we're coming full circle, right?
And like, when you get the compute bill at the end of the month,
finance is like, okay, who's,
we got to figure out who's, you know,
who owes what on this big compute bill.
Can you kind of talk like, talk through that?
Cause you've lived through that story
and unravels lived through that story in many ways.
Yeah, no, you're absolutely right, Ernie.
So if you break it down,
there's three things that have increased,ne. So if you break it down, there's three
things that have increased, right? So the number of use cases for data has increased. We've gone
from this, hey, it's good for financial reporting and it's good for understanding our sales data
to now, you know, we want to create brand new products. We want to improve our operations,
right? So the use case for data has increased.
The data sets that we're capturing has increased.
We are only capturing a subset of our financial data and our sales data.
Now we know everything about the customer.
Right, right, right.
Yeah, it was just transactions mainly, and now it's every digital touchpoint.
Exactly.
And the users of the data technologies have increased as well.
Earlier, this was limited to the hardcore engineers.
Maybe the financial analysts who knew how to switch from Excel pivot tables to maybe getting into a more powerful system.
But that's really it.
Now, product guys are on it.
Marketing guys are on it.
Every department of the company wants to get on it. Legal teams want to are on it. Marketing guys are on it. Every department of the company wants to get on it.
Legal teams want to get on it, right?
So the number of people jumping on these platforms has increased.
By the way, all those three things are good things to happen because you can get some
great outcomes with data.
But back to your point around, this does become a mess as companies start to scale this out.
Because the promise that we had heard that cloud will solve all problems and world hunger
is actually not true because cloud is better suited for data analytics.
Absolutely.
It's got limitless compute, limitless storage that you can definitely scale out your systems
for sure.
But as companies started to democratize data access and give it away to a large number of audiences,
there started to be spurts of people using data analytics
in a fashion that should not be used in that way,
knowingly or unknowingly.
And a big part of that is the range of skills
that people have in these different departments.
Not everybody is an expert on the data systems.
And you may have, you know,
on the other end of the spectrum,
some people that are, you know,
just who know click and drop tools,
maybe some SQL.
And unknowingly what's happening with them
is they're creating
inefficiencies in code, inefficiencies in the way these pipelines are being scheduled and run,
or just how these AI models are being used. So I'll give an example. Again, like a mundane one,
like turning the tap off, if you're brushing it deep, could be a select star that a novice user does on mega tables.
And this is the case that I hear about from our customers every week.
And it racks up hundreds of thousands of dollars of bills.
And you just scratch your head like, who did that?
Why did they do that?
Sometimes being a select star, maybe what you need to do on a table.
But then who did that? Why did you do that? How do we control it from happening next time? Is the question people start asking once
they're shocked. That's just one simple example. There are a hundred other ways in which people
can creep up on these inefficiencies. Other ones are, for example, in architectures that are not
serverless or even in serverless architectures, you have to understand what's the size of your warehouse
or what's the size of your containers.
And if you give people a small, medium, large, extra large,
guess what?
Everybody's choosing extra large.
Right.
Of course.
The most important, biggest, baddest workloads
compared to the next guy.
Nobody ever chooses small, maybe medium.
But then, you know, you run a profiler on that and you understand that you're only using
10% of the resources.
So it could have been one tenth of your cost, but people don't know that.
So, you know, I can go on.
There's so many of these inefficiencies that happen all the time.
But even before you go to improving the system,
just understanding who's spending what
becomes a critical issue.
So companies have a policy of showback now
as the enterprises start to increase their usage.
Showback really is, hey, look,
we spent a million dollars last month.
We've got five departments.
Did you all spend $200,000?
No, this person spent $100,000. This person, this person spent 300, et cetera, et cetera. So can we please break down and understand
who's spending what? And then companies are also going to a chargeback where the group actually
has to pay for what they spent out of that million dollars and that's how we're going to go and pay
this particular bill. So it's an interesting evolution because on-prem, you didn't have to think about that
because of a set of resources. So the worst you could do is Eric could steal from Costas
and Costas workloads would stop, but the bill would still be the same because it's hardware
that you are appreciating over time. Yeah.
Yeah. Now, yeah, I mean, this is a fascinating conversation. Yeah. And I agree to show back
and then the chargeback dynamic is, you know, super interesting. Can you help break down,
you know, we're talking about larger companies here where this, you know, a small, you know,
a smaller company isn't necessarily going to face these issues, because their workloads are fairly simple, right? And, and they're even their sack is simpler.
But at scale, when these things really become a problem, can you talk about who is the sort of
owner of unravel within an organization? Or who is that group? And can you just kind of break down
what are like their day to day-day problems that they face you know
and i'm sure we have some of those people in our audience and then some people who work with data
in an org but maybe they're not as familiar with that person's sort of day-to-day issues relative
to this infrastructure yeah so multiple people in the data teams use unravel let's talk with the
data engineers because they're always near and dear to our heart.
So Unravel helps data engineers in a couple of ways.
Number one, when you are developing your application,
removing errors, removing bottlenecks from your code
is something that Unravel helps you out with automatically.
Putting that code into production
and making sure that it's running there,
meeting its SLAs,
everyone, is something that Unravel also helps out with using its AI engine, making sure that
it can understand deviation and performance, what happened, something new was introduced.
And if you don't have something like Unravel, then data engineers get called into these production
issues or into the cycle that,
you know, moves their applications from dev to product when somebody's doing a code check or a
code review, for example, right? The other side that we had data engineers out with is, again,
when it's running in a production system, your boss, the head of product or the head of business
unit may say, I need you to go and cut this cost down.
I need you to run this more efficiently.
And now these data engineers have to go and hunt for ways
in which they can reduce their costs,
which is also something that Unravel can help you automate
and tell you in plain English,
look, go and do this thing.
You will not sacrifice performance,
but your cost is going to improve by so much, right?
So that's one group of people that use Unravel on an everyday basis.
The other group of people are the centralized group of leaders and operators who are responsible
for making sure that this environment serves the purpose for every business unit, meaning
they may get a Databricks environment, they may get a Snowflake environment with some Kafka, with some Starburst, right? And they are providing this to their
business and saying, hey, use the stack and it will run well. So that's the other group that
uses Unravel to make sure that both the performance, reliability, and the cost part are taken
care of. And then this group is able to also set budgets and guardrails
for all these subgroups of products that are using this platform so that they can
proactively understand how the cost trends are going towards this month. So not surprised at
the end of the month. And then if there's any misuse or rogue usage,
you're catching it live rather than catching it in retrospect, where you've already burned
through the dollars, and you're able to fix that problem in real time. And ultimately,
when companies mature to becoming a true data-driven organization, where they're actually
generating revenue from their data applications and products,
then we have business leaders using our product
to go and understand what's the ROI
and what's the agents of running these data endeavors
to then ultimately go and generate revenue for the company
and can we improve that in a certain way.
So it really starts to go bottoms up
where people are running, you know, applications
all the way up to how those applications are serving the business.
Yeah.
One quick question.
And I know Kosta says a ton of questions that I want to dig into the ROI question because
you've mentioned a couple of times, you know, sort of data output or data product, right?
And I'm just going to pull an example and tell me if this is a bad one, and maybe you can pick one. But I think about, you know, TurboTax is their system that allows
end users to submit their information, you know, and essentially file a tax return, right?
That's a hugely intensive data operation, because it's ingesting all this information,
it's running it through
all sorts of queries. I'm sure there's machine learning going on in the background. It has to
check it against all sorts of regulations. I mean, that thing is probably a gnarly app,
and it requires a huge amount of data infrastructure. And so can you walk us
through how would you think about calculating the ROI on that product
from a data and infrastructure standpoint?
Such a good example, Eric.
So you can take Intuit TurboTax that's ultimately costing you $10 a pop, right?
And then you walk backwards and you have to really understand what the unit cost of serving
just Eric is to understand the problems you're making on that product.
So any data product has multiple stages.
You have to collect the data.
You have to cleanse the data.
Then you're running some algorithms on top of it.
And then you're getting some outcomes.
And I'm making it very simple.
All the engineers listening to me on this podcast probably like, yeah, that's like a hundred steps
for us. That's what they presented
in the board meeting.
Exactly. Especially the guys
in Intuit are like, Kudali, you're making this sound way too
simple. It's probably a
hundred nested
workflow looking at, right? Running something
on Airflow, something running
on Spark.
And I know Intuit now is on Amazon.
It is this mega migration from on-prem again as part of the evolution.
So anyhow, what you need to ultimately do is understand what is the cost of one unit
of work.
So if you're running it on Spark, what's the cost of that Spark job?
And how many Spark jobs do you have?
And then understand that end-to-end from source to outcome, what's that cost of all
those different stages and multiple pipelines put together really is.
And then you have to think about what is the optimized cost version of doing that?
So there's a cost.
It's costing me $10,000 to run this pipeline.
But if I understand room for optimization, then I can make this pipeline run for $6,000 to run this pipeline. But if I understand there's room for optimization,
then I can make this pipeline run for $6,000, for example, right?
Now, how many users can 6,000 people serve?
Or it can serve 1,000 people.
Great, it's six bucks a pop, right?
On the cost side.
And then you want to bend the cost curve as you're scaling up.
So if it's for 1,000 people, what does it look like for 10,000 people?
And the answer should not be linear.
And if it's 100,000 people, right?
And that's how you start to scale it out.
And then you understand how much margin you can get.
Now, that's a very advanced, mature company that is using Unravel's data to be able to
do that.
But what we're encouraging people to do is start to think about that from the get-go
because you don't want to run a full project and spend millions of dollars to then come
to the outcome that, you know what, this is actually not a feasible product or this is
not a feasible project for our business to even get into.
This is especially true, Eric, in the age of AI, as everybody wants to create for themselves
an AI outcome.
But then if you get LLMs off the shelf, you're spending about $3 to $10 million a year.
But if you create your own LLM, you're looking at $150, $200 million of spend.
So really understanding how are you going to measure those costs? How are you
going to break them down? And then thinking about products and what it could actually mean,
super important from the design phase itself. And that comes back to our philosophy of
measure everything, right? It's a philosophy of bring all your data into a data lake.
Now that you've done that, start to measure every process from the get-go. So you at least know
how much this costs from the get-go
to then think about how you should be scaling this out. Yeah, makes total sense. Okay, Kostas,
one more question. Forgive me. So we're talking about infrastructure, right? But I'd love to know,
you know, and I'll use the example of sales and marketing costs. You usually measure,
well, you measure it in a ton of ways, right? But when you're measuring it from a finance perspective, you'll measure like the spend, you know, okay, so how much spend are we, you know, marketing spend do we have?
And then you have a fully loaded cost, which includes all of the headcount, all the commissions on the sales side, right? When it comes to data and the types of ROI you're talking about,
how are organizations thinking about the human capital aspect of it, right? Because it's not
like these systems just run themselves, at least now, maybe in the future. But you have people
who need to run these systems, right? And how do you think about that as part of the cost equation there?
It is one of the bigger parts of the cost equation, actually.
So even on the data side, just the data stack side, it's infrastructure, it's the data sets itself, it's the services that you're using.
You know, all of that stuff adds up to your total cost of the stack.
And then you've got the cost of the people. The way we think about the cost of the people is
thinking about a measurement, like a throughput measurement of what kind of productivity are you
getting from a class of people. That's what we've seen works best. The productivity of the throughput
metric could be anything that's more relevant for your organization. It could be how many data pipelines per team, per member of the team. It could be how many AI models,
how many new pipelines are you able to generate every month with your team. And then, you know,
you could also map that back to how many issues, how many problems, how much downtime did you have in your environment, and then start to see the productivity of your team across that.
What we have seen, though, Eric, is the people's productivity is nowhere near it should be.
Even getting half productivity, meaning four hours of productive time a day out of the eight hours a data engineer is working on is average right now. That's what you're getting. So people are spending half their time firefighting,
wasting time on troubleshooting, debugging, fixing problems, things are breaking,
trying to stand them back up. Things were working yesterday, today they're not.
It's a complicated piece of tech that these guys are running
and unfortunately
they haven't had enough time to train
themselves
people running on Oracle systems
have been masters of
Oracle systems over 20 years
people running Databricks, Snowflake, BigQuery
they've been running for 2 years, 3 years
at most
so they haven't gone through those experiences and sorted this out.
So productivity will get better as more maturity happens and more experience of these data
teams, but the business cannot stop.
The businesses are running because the competitors are creating amazing data outcomes, and they
just need to get theirs out in the market as well.
And that's where automation around what we do with Unravel,
you don't have to be an expert. It tells you in plain English how to go and fix certain things.
So you could be a person who's coming in straight from a Teradata onto Snowflake,
and you would know Snowflake overnight. And if you had any issues, you wouldn't be spending four
hours a day doing that. It'd be a couple of clicks and a minute or two if it's not completely automated.
Okay, I have a question about reliability, especially in the environment,
like the mature environment that you have seen in the enterprise
and let's say more like purely data-driven companies, right?
My experience, especially with enterprise, because what is interesting with them is that they've been around long enough, right? My experience, especially with enterprise, because what is interesting
with them is that they've been around long enough, right, to go through like many different
products like for what they are trying to do. And what I've seen in practice is that
usually technology does not get like replaced immediately, you usually end up with pretty much everything
running together.
I think if anyone could take a look into a big account of a Fortune 100 company, they
would probably see pretty much every possible vendor in there operating. How is reliability managed when you have so many different systems and so many
steps that the data has to go through? Let's start from a technology perspective
for now, because when we get into the people
aspect of it, it gets
even more complicated.
But how
have you seen things working there
with all this diversity
of technologies operating
together at the end?
You're right, Kostas.
The people side is hard
because no one person is an expert in all the systems in the stack.
And that's an inherent problem.
On the technology side, look, people are choosing different technologies for different use cases.
That's a reason why they have different stacks.
And the other reason is just compliance, that certain data cannot move the cloud.
That's what they call the on-prem version and a cloud version.
And then the third, as Eric was pointing out, is with democratization and opening up the data stack to the company, people are kind of encouraged to go.
You're like, hey, if you want to go spin something up, go spin something up, right?
If you want to start a Snowflake cluster, start that out.
And before you knew it, you had these bursts of clusters here and there.
And then before you know it, the entire company started to use it.
All these technologies are very different.
There are similarities which end at, hey, that's an SQL engine, but you worked in Presto and Trino.
The way you triage that versus you triaging
a Spark SQL application is completely different.
So reliability is something that has always been an issue
since the early days from MapReduce.
And the only way to solve that is to understand
what's happening under the hood.
And a lot of people just don't have that skill set.
Like you know how to drive a car, but you don't know how to fix a car.
It's the same thing.
And what Unravel does is attacks that problem head on by automating all the steps that somebody
would do in triaging.
So collecting logs, collecting metrics, connecting the dots between
all of these different causes and effects, and then bubbling up and saying, look, there could
be a hundred things that could be causing this problem today, but this is what it is, and this
is how you need to resolve it. So instead of even giving you a check light engine, imagine it was
more descriptive, but you didn't even have to take out a mechanic. And we just say, hey, this problem in
your wheel, get this fixed. And that would be a faster way to resolve it. So that's where we have
seen the cloud fallacy of, hey, the cloud is a no ops or a low op solution actually falls flat because bad code sometimes is bad code, right? It doesn't
matter where you write it. So you can have the same experience and problems no matter which
environment you're running on if the underlying cause of those problems is similar across these
different environments. So while it's allowed to make some things easier, it's not a silver bullet that will resolve all your reliability problems itself.
And the way it manifests itself, and coming a little bit to the people side of the question, is it could be an internal or an external application that you're running.
If it's an external one, like your consumers are running on it, say you're doing an online banking app, and if that doesn't work, your customers can't use your services.
And if it's an internal one, then there are people who are waiting for that report, who
are waiting for that analysis that business decisions are getting held up for.
Each of them have an SLA.
And what Unravel helps you do is guarantee those SLAs.
So we've seen in companies where those SLAs were missed 10% of the time, 7% of the time,
sometimes 20% of the times. So we've gone from 80% SLA to about a 99% SLA attainment
for those kinds of different data applications. Just because something, a system is looking over
and making sure that problems are caught proactively.
And there is a fast solution to fixing that problem without it hairballing into an even bigger issue.
And, okay, you said like unravel like guarantees, like the SLAs, but there's also like the human factor here, right?
Like at some point, someone needs to go there and like fix something, right? Like, at some point, someone needs to go there and, like, fix something, right? So, how is this
working
between, like, the technology
and the person who is,
like, on call that day, right?
How is, like, this
relationship, like, working with Unravel?
Yeah. So, before Unravel,
you would get the problem,
you would be notified about the problem much later because now it's visible to somebody. So Costas did not get their report. Costas is the CEO of a company and now on a Monday morning meeting with his exec team was not able to make the decisions he was able to make. 10 AM, this problem gets logged in. Somebody on the data team gets
called. The person on the data team understands if it's a code level problem or infrastructure
level problem, and then tries to ping the relative teams. And by the way, there's a big fight that's
happening over here right now. There's a lot of finger pointing going on. Infrastructure guys are
saying it's a code problem. Code guys are saying it's an infrastructure problem. I'm sure we've
all been there. And then it turns out, okay, say we've identified that it's become a
code level issue. Then we try to find the data engineer who actually created that application
and wrote that piece of code, but then go and debug and dissect that. So as you can see,
this is a very involved process, lots of people in it, lots of time spent. And then this person is going to dig into logs,
check out a lot of metrics.
And by the way, each unit of work can have a 100-page log.
So you look at thousands and thousands of logs to go and understand what's happening.
So it's a very inefficient process, really.
This used to take several hours in man-hours,
which could actually be days in terms of clock time.
And then, you know, forget about loss and productivity.
That even happens on the business side
because the applications aren't working properly, right?
With Unravel, because we are able to do the identification proactively,
you will firstly understand this problem before you see this
problem.
It's like, hey, this application is not going to finish on time, but Monday morning, report
is not going to be generated on time.
We'll notify you about that when the app is running and then tell you what you need to
do to go and fix that.
Secondly, because it's root causing the problem, there's no more finger pointing.
It's like, look, today's issue is infrastructure.
Today's issue is code. Today's issue is code.
Today's issue is data layout or your services itself.
So you're going to go pinpoint and not bring the team together on one side of the table
rather than be combative and then selling you a guided remedy or it's taking an action
on your behalf.
So in the guided remedy, it'll tell you what to go and do.
So depending on your role and permission, you go and do those actions and fix or improve the reliability
and performance of this application. But then in a lot of cases, Unravel can also take the action
on your behalf. So you can complete the loop of doing the action as well and see the results. So a lot of times we see people wanting to,
as a simple example,
prevent any app from spending more than $10,000 on the cluster,
as an example.
So Unravel can take that action on your behalf
and stop this data pipeline or this machine learning model
as soon as it nears $9,000, for example, right?
So that you don't have to suffer about it
and then resolve this
problem reactively.
So what we've done is improve efficiency,
improve the productivity of this team,
and made it more like teamwork,
that everybody's on the same team rather than being
on different teams, because when problems
happen, that's where finger-pointing starts, so you want to avoid
that as well. Yeah, 100%.
Okay,
and if we switch now
to the cost
management,
again, you have a very unique perspective here
because you have seen things happening on the cloud,
but you also have seen how things
work on-prem, right?
And by the way, there are
cases where you have a hybrid solution,
especially, as we said, in the
enterprise. you might have
data systems running on their own data centers and also have parts of the workloads that are running
on the cloud. But let's say the economics of one and the other are very different.
When you have your own data center, you bought your own hardware, you have it there.
You can't really go and ask for more hardware.
That's probably going to take some time to become available.
And on the cloud, you have a completely different situation. You pretty much, at any time, you can release whatever you want, right? So the equations there of trying to figure out what cost is
when you operate these workloads is different.
Can you help us understand a little bit the differences there
and what it means to operate efficiently on-prem
and what it means to operate efficiently in the cloud?
Yeah, so when you think about on-prem costs,
you're thinking about cost per machine,
the fully loaded cost per machine.
So the hardware for getting that machine,
all the software and services you're going to run on that.
So what's your licensing cost for everything?
And then depending on the type of hardware
you try to depreciate that over three to five years,
straight line depreciation.
So if it costs you $30,000,
it's about $10,000 a year, right?
Just roughly.
On the cloud, obviously,
it's pay by the drink,
you know, 20 cents per hour
for running one machine.
And then, you know,
you keep adding more machines,
keeps adding up, obviously.
So there's a lot of differences
in how people approach
both of these equations.
In some cases, people say, look, if you have predictable workloads, stuff that just needs to run every day, it's not going to change.
It's going to be the same way every day.
It's better and cheaper to run it on-prem.
That's what we've seen across the majority of the enterprises, especially for large scale workloads. And then if you have experimental workloads,
things that you may be just trying out,
or you've got seasonality in your environment,
you've got Friday, Saturday, Sunday workloads
are bigger than Monday, Tuesday, Wednesday workloads,
for example.
In any kind of situation like that,
having a more liquid environment that can scale up and down is a better use of resources as well as cost.
That's the primary difference.
The way to start thinking about the cloud cost in particular is nobody knows what it's going to be on day one. You can have some sort of an idea
if you break down your workloads into CPU and memory and just the basic units,
you're never going to be right. So it's always good to, again, measure everything from day one
so you can start to see the trends and patterns of these things. So by the end of month two,
month three, you at least have an idea of what this yearly
cost could be.
And then start to put proactive guardrails to avoid exactly the problem that Kostas,
you were talking about, which is, hey, yeah, cloud has infinite scale, but do we want to
give people that power because you don't have infinite money?
And how do we put some sort of guardrails against that? Now, obviously,
looking at just the number of cells, you have to part of the story. You've got to talk to your
team and understand what they're actually trying to do. In some cases, they may not even be knowing
that they're doing these inefficiencies. In some cases, it may be that's their actual use case.
And they're like, yeah, spend $100,000 on that query because it was doing this amazing thing
and we needed to run that way.
And then put the guardrails appropriately. But then people who are running hybrid environment,
they're also using it in a unique way because they're thinking of, let's use the power that we have on-prem. And only when we need to burst workloads, only when we need to scale up workloads,
then we use the cloud. But then
everybody's got their own patterns and
anti-patterns and how they run these things, but these are
the most common ones that we see.
That's super interesting.
One last question for me, because we're close
to the end here.
How things have changed
because of AI, and I'm
talking about data observability
here, and I don't necessarily care that much about how it changes in terms of helping
someone to perform observability, but more about how we implement observability when we are
implementing AI. It's different, I would assume, when you have BI. It's different when you have ML.
It might be even more different when you have AI, although they're similar with ML.
But what have you seen out there? I'm sure you have much more experience into that.
And I'm very curious to hear from a vendor what is missing today or what works when we're trying like to actually
bring the same value with of observability but like when we are doing ai yeah look ai is super
interesting every company is rushing to create some innovative products with ai or at least
they're starting off with using ai to improve their own operations, right?
But when you break it down,
it's again a series of data steps
and sequencing of data steps that need to happen
to create meaningful AI outcomes.
So a couple of steps are actually similar
to say BI workloads,
where you would have your ETL or your ELT of bringing
data in and prepping that data as a common step.
So in fact, we almost always recommend to people, think about all your data apps as
being modular pieces, and think about what you can repeat and use again so that your costs as well as your efficiency is great.
So that's one of the ways.
But yeah, to answer your question, you still have to have something that can observe multiple systems.
Because AI is, again, not a one system or one technology-based app.
You need something for data ingestion. You need something for data ingestion,
you need something for data modeling,
you need something for running your AI algorithms
on top of, something to serve it, et cetera.
So you need observability that is capable
of measuring things in multiple services
across multiple environments.
And what we're seeing, this is becoming very real,
is people are actually moving
to a multi-cloud environment as well. So you need a technology that cuts across these
pieces too. Now with AI, you will again have more teams and more users using your data platform
because the ideas for AI-generated apps are going to come from everywhere in the organization.
You're going to have your legal teams, for example, jumping on and saying, hey, we can
use this data set.
We're doing these amazing things with AI for our company, which means that leaders need
to be even more careful and recognize that you're going to have varying skills of people.
And with that may come in more complexity and inefficiencies into your platform.
So having observability from the get-go
to measure all these soft pieces
is going to be even more crucial
as you're going to be at work.
Okay, that's great.
Eric, back to you.
Yes, Kunal.
Okay, I have to ask on a personal note,
now having done a consumer startup in an enterprise startup would
you ever go back to consumers that's the itch left to scratch again eric for sure uh there are
all these exciting things uh that that need to yet be created on the consumer side believe it or not
uh but yeah that's that's going to be one of the companies, you know, that I do create in the future.
I don't know how much in the future, but definitely an itch to scratch.
Awesome.
Well, thanks so much for joining us today.
We learned so much.
And best of luck with Unravel and your future consumer app.
Thank you.
Eric Costas, thank you so much for having me here.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. Thank you so much for having me here. show.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at Rudderstack.com.