The Data Stack Show - 106: Optimizing Query Workloads (and Your Snowflake Bill) with Vinoo Ganesh of Bluesky Data
Episode Date: September 28, 2022Highlights from this week’s conversation include:Vinoo’s background and career journey (2:43)How to benchmark cost (7:54)How Bluesky addresses rising Snowflake bills (14:01)“Workload” as defin...ed by Bluesky (17:14)Space for BI optimization (22:55)How products manage bill growth (28:34)How to optimize your workloads (35:37)Bluesky’s partnerships (39:53)Getting real-time feedback on your work (44:50)Where to begin reevaluating your Snowflake game (50:47)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today,
we are going to talk about a really interesting topic, and it's ROI related to all of the data
workloads that you run. Kostas, I know that you have questions about what the definition of
workload is. We want to dig into that. But we're going to talk with Banu from Blue Sky.
And what I'm really interested in is on their website,
they say, if you're spending $50,000 or more
on your Snowflake bill, you should talk to us
because we can help you drive better ROI,
which is fascinating.
So I want to know about that number.
I also want to know about their
relationship with snowflake right because reducing your snowflake bill like are they friendly with
snowflakes so that'll be interesting so i have so many questions to ask and then you know of course
what does the product do but how about you yeah i mean it's a little difficult to spend 50 grand on Snowflake, right? So you know how easy it is.
So yeah, it's going to be very interesting to hear war stories.
Let's say what they experienced with their customers.
I definitely would like to chat about the definition of workloads
and what they see out there in terms of what is the most expensive part of the operations around data.
And yeah, also the other thing, which I think is going to be quite
interesting is, I know that right now the product is focusing on Snowflake,
but what it means to take this kind of product, this kind of service and
deploy it on different data warehouses
or data lakes or data infrastructure in general.
So I think it's going to be a very interesting conversation.
There's a lot of discussion lately about the cost of Snowflake.
So I think it's the right timing to have this conversation today.
I agree.
Well, let's dive in and talk with Vinu.
Yeah.
Vinu, welcome to the Data Sack Show.
We are super excited to chat today. Thank you, Eric. Very excited to be here.
All right. Well, give us your background. You've done some really interesting things
in some really interesting industries. So tell us about your background and then what led you to
Blue Sky. Absolutely. I started my career off at Palantir.
Was there for almost seven years.
I did virtually every job you can imagine,
from software engineer,
building some of our core distributed systems,
to salesperson selling our product,
to deploying it,
and commercial, healthcare, military environments,
before eventually leading our core compute team.
So every bit per byte of data that flowed through Palantir flowed through my team at
one point.
After Palantir, I realized that we had built these incredibly powerful analytical tools,
but a lot of our customers and a lot of just analytics tools consumers didn't have the
data size or scale to warrant the power of these tools.
So I decided to focus on that area and built a data as a service company.
Marisette, I think it's about like 15 million ARR now, still chugging along.
Oh, cool.
Really fabulous.
Thanks.
Yeah.
Still doing well.
Really focused on how do we take a huge amount of data, make it accessible and make that
data accessible to consumers who don't have to do these crazy expensive cleaning operations.
After that, an old friend from Palantir reached out and ended up joining Citadel,
the hedge fund leading business engineering on Ashlar Capital.
So building all the tools, technologies, managing the data engineering team for
the last mile aspect of portfolio managers, alpha generation processes, just trying to help people make money effectively.
Before, actually, I guess right after, another mutual friend made an introduction and got introduced to Blue Sky.
And Blue Sky, where I am right now, I'm on the founding team.
Our goal is really to provide an optimization mechanism and a mechanism for people to introspect their own query workloads and their own data cloud workloads and really get the maximum
ROI of their data cloud. And that means a number of dimensions.
Awesome. Okay. I have a question about your time at Palantir
because the spectrum of job titles that you mention
is astounding in many ways, right?
You just rarely ever hear of someone
who sort of goes from software engineering to sales,
to sort of owning the data platform.
And it sounds like multiple jobs in between.
I would just love to know, having that breadth of perspective inside of a single organization,
what was some of the most interesting things or unexpected things that you learned doing
such drastically different roles?
Absolutely.
I think this is one of the things that Palantir does best,
where it's almost like you can have 10 different jobs
just in the same umbrella company.
So first and foremost,
the reason that I ended up forward deploying,
as they would call it,
is a lot of the design decisions that,
I guess, mine and my co-engineers made
on some of the early data storage products
were not always optimal.
And until we had the real customer workloads,
and I'm using workloads again,
but workload and understanding,
there was no way we could have designed a system
that actually made sense.
So I think the first big and surprising thing
was truly how different,
anyone who's worked in production software will
know this, but how different production versus development is. Really just understanding
how we build software, how we actually develop a user focus, especially when the tools and
technologies aren't directly consumer facing. Like a distributed system or data storage system,
you wouldn't think of as being
particularly customer facing. But all the decisions that you make from a design perspective,
from everything from compliance to storage, all directly affect the user experience.
The second thing is really almost the value of having a technical slant,
not necessarily in your sales cycle, but amending your sales
cycle. Being able to actually communicate with the people procuring your software with a deeper
understanding of why things are implemented the way they are, some of the challenges and
limitations, I think all gave me a lot of respect for the engineering background that I had.
And conversely, going back to the engineering side,
really understanding how hard it is to move a contract from a initial POC to an enterprise
agreement. It's just so difficult. Yeah. Yeah, that's great. Actually, I'm glad you brought
that up because I was going to ask you from the engineering side, you know, a lot of our listeners are technical. And so, you know, that's really helpful to hear
that perspective on the sales side, right? Because I'm sure, you know, for salespeople,
building production software probably seems really, really hard, right? Moving a contract
is difficult too. Okay. Well, I know Kostas has a bunch of questions, but I actually want to start
with a really specific data point that you list on the Blue Sky website.
And I think this will be just a great jumping off point.
So I know that ROI means that there, you know, ultimately it kind of boils down to, you know, what is it producing. And I'm a marketer, so it stuck out to me.
I got my attention for those reasons alone. But I'd love to know why that specific breakpoint
and what does that number, whether or not it's the perfect number as a proxy for what Blue Sky
helped solve, what is represented underneath that? And I think specifically,
I'd love to know, how can we help our listeners benchmark cost even?
Absolutely. So I will say transparently, the 50K number was kind of a number that was just picked.
However, it's one of those, like a backer name where we pick the number and then realize,
wow, this is actually indicative of something pretty powerful. I think what's been really interesting, especially with something
like the Snowflake ecosystem is starting off as a small scale user. And Snowflake is an incredibly
powerful tool. SQL is really easy to write. There's all these built-in integrations,
actually a blessing and a curse. But the number one thing that I've heard from our Snowflake customers
is the speed at which you can ramp up your Snowflake spend
by actually doing things that add business value
is unparalleled.
So you deploying like a Sigma or like a DBT,
these are incredibly powerful technologies and tools,
but they almost add this exponential growth aspect
to your Snowflake spend.
And so that number in particular, I think it almost is the beginning, almost the precipice
of, I'm now going to become a heavy Snowflake spender or heavy Snowflake user.
And so Blue Sky has customers and partners that go anywhere from that 50K number up to
the double digit millions for Snowflake's bet.
And so where you are on that data journey or that data process, and when you actually
decide to engage us, tells a lot about how you think about your utilization of a data
cloud.
So we picked that number largely because it really does look like the precipice of starting
to expand your utilization
and your snowflake footprint.
Super interesting.
And one follow-on question to that,
and this is probably multiple questions
packaged into a single one,
but when you think about the,
let's just use the examples that you mentioned, right?
So of course, SQL is easy to write. And I mean, it is wonderful that we live in an age where you can
deploy Snowflake, start writing SQL, get a huge amount of value in a really short amount of time.
But then when you think about a tool like Sigma, or even dbt might be a better example where it's it's pretty low
dbt in particular so it's pretty low in the stack in terms of like where it interacts with the data
and then in and then sort of pushes value out in a large variety of contexts right so you almost
have what I would call like ROI fragmentation so howmentation. So there's the cost side of it, but then how do you think about ROI in such a fragmented
way because it's touching so many parts of the business?
That's not necessarily a simple calculation.
Definitely.
I think maybe this is the finance side of me, but anytime I think about ROI, I think
about really like, am I effectively deploying my
capital as a business? And what I mean by that is not necessarily like, am I spending a certain
dollar amount, but the dollar amount that I'm spending, am I actually spending that in the
most effective way possible? You can kind of think about it in the, I was like using this car
analogy. Like if I'm driving around in a car, I can either be very gas efficient or like very
bad at consuming gas.
I'm slamming the brake or slamming the gas.
I'm going to burn through a lot of gas really quickly.
Even the car that I use, like whether it's I'm driving a Hummer around, it's going to
be guzzling gas like crazy.
So that I'm paying for the gas either way.
Does that gas consumption actually add the value
that it should to my business? So when I think about ROI, I don't necessarily just think about,
am I getting a dollar value back for this effective cost of compute that I'm putting in?
Am I deploying that capital for my business in the most effective way possible. As a concrete example, I think dbt is a super powerful tool, right?
Being able to test and almost have like a CI, CD process around SQL is incredibly powerful.
Absent something like dbt, you can run a series of failed queries, like one after another
after another, shrinking up more and more cost. Now, the capital deployment
of just letting a query run and failing
is a horrible way to deploy capital.
But I could instead use a tool like dbt
and almost get all of that,
dbt has its own costs,
but get that failed query capital back
and deploy it against another business critical problem.
That to me is a much better ROI
and a much better deployment of capital.
Yeah.
Makes total sense.
Exactly.
Okay.
I'm going to ask one more question,
but Costas, I feel like I've been hogging the mic.
Vinu, could you help us understand?
So let's say I'm looking at my Snowflake bill.
It's 75 grand.
We're starting to have internal discussions around like,
okay, you know, we're getting some inquiry about like, wow, this cost is really ramped up.
Describe the process of how Blue Sky would come in and help us address that situation.
Absolutely. So the first thing is, as any engineer is what, we start out with data, right?
We want to understand not just the data of what the bill is, but what actually makes
up that bill and why does it look the way that it does?
So the first thing that we do is we never need access to any of your business data or
anything other than metadata of your query history.
From that, we can actually tell using some proprietary algorithms.
First, where is your compute actually going?
Am I over-speccing my warehouses in Snowflake?
Do I have a bunch of idle compute?
Do I have these massive queries that take up thousands of credits after one execution?
So we really start with an understanding of
what is the
information that I have on the ground from Snowflake. From there, we start introspecting
by adding our own kind of flavor and opinions into our product. Some of the examples I gave,
like warehouse idle credit, or even an ability to look at a query and say, this query is ordered by
this insert is ordered by this table rather is ordered by a particular column.
Consumers of that table should take advantage of that order by and filter where they can.
Those are insights that we can display as well.
So it starts with the understanding and onboarding of the unique aspects of a data cloud.
Then we look forward.
So it's really easy to say, okay, well, we're in
this position now, let's just do like a P zero tourniquet cost cutting exercise, only to end up
in the same situation three months from now when the spend has grown. So we instead do is also
provide tools and mechanisms for controlling costs from a guardrail perspective as you move forward.
And these are ways of coalescing
functionally equivalent queries
or semantically equivalent queries together
to actually attribute a cost
or even highlighting something like a misconfiguration
where I've sized a warehouse a particular way
when the workload doesn't actually warrant
that sizing of the warehouse.
So it's a data-driven approach that really starts with visibility before extending into this insights level of what you can manually change before building BlueSky's end vision,
which is an automated tuning and healing layer. Eventually, you're going to get tired of
implementing these insights. Maybe just turn on an autopilot and we can figure out what to
do for you. Super interesting. All right. Well, that's a great point on which to hand it off to
Costas. Costas, thank you for your patience. Thank you, Eric. Thank you. So Vinu, let's
talk a little bit about workloads, right? I mean, people are using data warehouses,
obviously, for analytical purposes,
but there are many different things that are happening
in a data warehouse before we can get a dashboard
or a report or whatever, right?
So, can you help me understand,
like, how do you define a workload in Blue Sky?
And yeah, let's, we'll get deeper into that.
So let's start with this.
Sounds good.
To me, a workload is, you know, in the olden terminology, there's like the old TP, old AP,
and like our batch or streaming or analytical compute.
For me, a workload is really going back to that finance.
It is the way I'm deploying my capital in my data cloud.
So the workload can involve anything from me writing data, persisting that data, me
actually doing snowflakes, like auto clustering behind the scenes to me repartitioning data,
me even doing things like a reverse ETL process
of writing data out of the cluster.
So these don't fall into necessarily batch analytical
or these clean definitions of what were previously
like your, I'm a batch compute heavy workload.
It's much more so how I'm utilizing that compute.
That's how I think about the workload.
So what are like, let's say some common categories of
compute utilization that you see out there?
So first and foremost is I would have never expected this before Blue Sky.
Although I think you can kind of guess it's there,
but the big ones are really BI, like all the business intelligence tools like Looker, Tableau, Sigma has some of these.
There's just such an inundation of wanting to get insights out of my data, my system, that BI actually accounts for a huge amount of the workload.
Now, whether or not these dashboards are actually actively used or consumed, they are the ones
writing these automated queries.
The challenge with BI, especially in terms of like a workload perspective, is a BI tool
is not working, let's say, nine to five.
It will execute its queries whenever it wants.
It will do data refreshes at any time.
So your heaviest consumers can actually be something like BI tooling.
So I think the second is, and I'm going to kind of pick the ones that I think are unique.
The second is maintenance.
And few people actually think about maintenance in terms of what needs to happen for your
data cloud to operate optimally.
And these are literally things like snowflakes, repartitioning, or
re-clustering where I want my data.
Like I want to partition, I'm using partition and clustering interchangeable
here, but I want to cluster my data a certain way, but there are some
maintenance operations that need to happen, snowflakes on compute behind
the scenes to actually ensure that I'm able to read tables the way that I want and the tables
look semantically the way I want them to. So I kind of grouped that all into maintenance,
which is distinctly separate from even something like compliance. CCPA, GDPR, these workloads are
the right to delete. I actually grouped these into a separate type of workload because they involve both this
like linear scan or like taking advantage of some unique file format way of scanning
through your data, actually deleting and making incremental changes.
So I think these are the, and then of course you have your analytics, someone going on
writing a ML job or writing some kind of like just one-off SQL query to get a table back.
And you have your ETL pipelines that come from a variety of sources as well.
You did.
Yeah, it's interesting that you didn't mention ETL as one of like the main workloads out there.
Why is that?
Or you just included it as part of BI?
Like, how did you see like ETL being part of the workloads there?
It's a great question. So ETL is always the, you know, it's kind of the bedrock of like,
if I'm using data and there's going to be some cleaning process, some transformation process,
some load process, like all our extract process, there's all of these kind of
live in the same almost ecosystem. But the reason I don't think about ETL as prevalent of a workload is because it actually
tends to be the place that people are investing most of their time and energy. It's not the long
tale of, oh, I built this dashboard two years ago and forgot about it. It's really like, this is
clear business value because every day these tables need to be updated. They need to be transformed.
They need to be written to.
So deploying capital against ETL jobs is almost an easier justification than deploying it against BI tools that may not have the right consumers or may not generate as much business value.
Makes sense.
Okay.
You mentioned like maintenance, compliance. BI, in terms of what you've seen out there,
I would expect that BI is one of these things
that's kind of, let's say, predictable, in a way,
outside of, okay, let's say you have interactive analytics
where obviously you need to sit on top of your BI tool
and start experimenting with queries
and all that stuff.
But when you have dashboards, you can deploy quite a few different methodologies to optimize
the process.
Like materialization, for example, is one of them.
How many times do you have?
Or caching. you have, or CASI, right? There are tools, and BI is one of these processes that has been around
for a very long time. So database systems have really evolved around it, right? But what do you
see happening out there? Because obviously, there's a lot of space for optimization from what I understand. So is it like we are missing the right tooling to do that?
Is that like, why is the reason that there is so much space
still for optimization when it comes to BI?
It's a great question.
I think in the past, BI fell in this category of like read-only
in the sense of I'd have a dashboard, it was executed once,
and it would just be, you know,
like on a page for someone to consume.
In the new world of data applications,
like there's a lot of these companies
like Streamlit, Houseware,
that are doing these really,
I think Snowflake actually just acquired Streamlit,
doing these really powerful
like data application creation.
You as a non-technical user, or I don't want to say
non-technical, but a less technical user, can now interact with the platform in a way that you
previously didn't really interact with it. Not just filtering, but I can actually bring in and
join with no-code or low-code solutions, other tables, and create new derivative data products
just from my own system.
So the materialized view creation caching, they all solve that root node problem of this, you know, compute happening over and over again and
persisting that, but any of the derivatives products, even notebooking
tools like Hex, I think is a really cool product as well, you can create
all of this derivative value and all of these derivative data products
that still kind of live in the realm of BI, although people are using Hex for ETL also,
but still kind of live in this BI tool, BI world, independent of, I guess, a previous
just like an individual dashboard that was just sitting there consuming data with no
one really looking at it.
Mm-hmm. I was already looking at it. Stas Miliuszak And do you feel like we need new tooling to optimize this new,
let's say, I wouldn't say new workloads, but new facets of existing workloads?
Like what do you see there?
I mean, obviously there is an opportunity that's why Boost Kaiser is out there, but
what a database system should
do to account
for these new ways of interacting
with data and make the full process more
performant at the end?
Yeah. So the interesting thing is SQL's
been around forever, right? Just the anti-SQL
standard has existed. The
execution engine has been the thing that's
been particularly played with
over the years.
Snowflake is effectively Oracle, except without the DBAs deployed in the cloud, you can manage on your own. But the execution engine, the thing that actually does a lot of
the magic of Snowflake, the clustering and the ability that you can write a query and
potentially never have it fail, it can just keep spinning. Whereas you do the same
thing in something like Spark, it just crashes. Those are dual-edged swords. So if you look at
something like, well, we'll look at Databricks. I think Databricks and Spark are such a great
company with a really cool technology. They're investing so heavily, things like Photon,
Catalyst, all of these technologies that are really just made for
the purpose of optimizing a query execution. I would even say optimizing, making a query
execution more predictable. That's really what I think they're doing. So in terms of the need
of tooling, for me, it's for as long as we have people who are going to be authoring queries,
we're going to need people who either are educating folks
on how to author queries in the most actual way
or automated tools and solutions that just abstract that problem away.
This may be an imperfect comparison,
but anyone who worked with the old C++ memory management things
has experienced challenges of memory leaks forever.
So you knowing when to like mal-like
or like deallocate memories really, really hard.
And so people built layers on top of that.
Java became one of the predominant technologies.
And then we have like G1 garbage collection,
all of these new ways of actually abstracting
that problem away.
So what I see us doing or this space in particular,
we'll add like, it is just adding a layer of abstraction
that handles the complexity of otherwise having to optimize low-level SQL code based on table
semantics or query semantics. Yeah. So I have a question, actually, and this is for you,
Venu, and Costas as well, because I know that you've looked at some of these tools.
One interesting dynamic, just to dig in a little bit deeper on some of the tools that allow an end user to actually drive up compute. Because those tools can offload compute to Snowflake, okay, well, this is enabling a lot more people
to do a lot more things, but it's creating a huge bill on the backend because you're just
hammering compute. How do you see those products managing that? Because that I mean, because that's, that's a, in my mind, a non-trivial
component of your product is sort of the optimization, right? I mean, there's literally
entire companies obviously built around query optimization, of course. I mean, obviously that's
exhibit A with Blue Sky, but I'd love to hear your thoughts on that. Like how, how do you
see those products managing that?
Do you want to go first or I pick you, Ilze?
Yeah, I can. I mean, I have, and that's also what I wanted to ask you.
Okay, traditionally, let's say the database system has the query optimizer, right? So you have, let's say, a piece of the technology that is one way or another responsible to go out there and make the best possible choices to execute the query in the best possible way.
Obviously, that's a really hard problem.
It will never be completely solved, blah, blah, blah, like all that stuff.
But at least you have access to that.
Traditionally, the DBA, that was the role of the DBA.
When things start going wrong, I can use the query optimizer,
the planner, the explain commands, blah, blah, blah,
all that stuff to see what's going wrong
and figure out ways to manually optimize things.
When we put so many layers of abstraction in between, and
I'm talking specifically for things
like Looker and BI
tools where you also have languages
that you use to model the
data, and there's another
piece of software there that takes
the data model definition
and does whatever it wants to do,
the user is like,
how do you even try to tackle
this problem, right?
Like a query that is generated by Looker that then is optimized by the query optimizer and
then turns into a plan and gets executed.
How do you even figure this out?
In my mind, at least, it's really, really hard, right? So how can we,
I mean, abstraction is good,
but it also adds complexity.
So how do you think
we can tackle this problem?
Absolutely.
And so I think,
I'm going to go back
to the metaphor
of how I think about Snowflake,
where anyone,
any query author,
I think of as
someone driving a car
and their goal is to get
from point A to point B.
The way that, or how much gas or how much fuel they consume on that journey doesn't just depend
on their ability to become the best driver in the world. It depends on so many things,
the kind of car they're driving, the environment they're driving in, who else is on the road.
And the kind of parallel here is, if I were to say a warehouse in Snowflake, so logical grouping
of compute cluster is the car, the optimal route selection or the optimal gas consumption to get
to that end route depends not only on the person, but also on the car. The best driver in the world
can still use a bunch of gas driving a hover to their place
of wherever they want to go.
It's kind of the same thing.
If I'm authoring a query and my query optimizer is particularly amazing, it's done everything
perfectly, that's only a part of the equation.
The second part is, where do I choose to execute that query?
Am I bin-packed with like incredibly computationally expensive queries?
So I'm going to actually slow down and I can't scale up that much.
Am I going to be able to have any kind of data locality, depending on the
technology that I'm using at that point?
So all of these come together to, even if the query is written in the most basic
query, like select star from this table, there's
still so many other elements that are involved in my query execution.
When I say the level of abstraction, I also mean if we are able to, like we
could train every driver to drive optimally and even in that situation, all these
external factors could throw things for a loop.
So what I really mean is how do I augment that driver either by extending what the
query optimization can do, but also by adding contextual information around
straight conditions or the road conditions, meaning it's still like
terminology, like what other queries are being executed, the car, the size of my
warehouse that I have, the number of clusters that I'm scaling up
and down. So abstracting that entire problem space away such that a user is executing an
individual query doesn't have to worry about that is incredibly powerful. Let me make this a little
more concrete. If I'm doing, so Snowflake has had a very interesting thing with this terminology
warehouse. A warehouse doesn't mean anything. It's just like a logical collection of EC2 instances. And so how people actually use
or name these warehouses really changes from organization to organization. If people use
looker, their setup instructions say looker warehouse. Or if they're a little bit more
detailed about this, they'll say looker extra small warehouse, extra large warehouse.
But the challenge is the logical grouping is dependent on the product, not the actual workload of that individual technology. So if Looker is actually coming in every day and firing a query
once every 24 hours, that happens to execute on this massively extra large warehouse that has a very low, very high out of suspend,
I'm going to be spending a lot of money on that one query.
It may even be overspecced.
So all of these problems coming together
and the contextual information is really how I think about solving this.
Instead of a layer of abstraction on top of just the query,
it's on top of the data cloud as a whole.
It's very interesting. And I want to add another dimension to this problem,
and I want to ask you specifically because you have also worked in the financial sector.
So you know how people get motivated to optimize based on the profit that we can have, the alpha that we can generate at the end, right?
We all strive for this alpha at the end.
So I keep remembering cases, for example, like BigQuery, right?
There was this case where you could use select star, the limit's 10, right?
So you would expect that the query agent would just reach 10 values and return them.
No, it would go and actually scan the whole data set and then return just 10.
But at the same time, that's how the queries like pricing is based on how much data it
reads during the operation.
So there's a lot of motivation there to actually do that because that's how the product can
make more money.
And especially in consumption-based models, I think this is a very strong drive to guide
how the engineering teams there, or the product
teams in Snowflake, or whatever company there will make certain choices.
So how important do you think, outside of the technology itself, the abstractions
that we put there, are also these other factors, like the pricing models that the
companies have, or the contracts and the't know, like the contracts and like
the business side of things, like how much also affect at the end, how much it will cost us and
how we should optimize at the end, the workflows that we have. It's a great question. I think one
of the really interesting things about, I appreciate Blue Sky approaching this. Normally
people would think, and it's entirely understandable,
Snowflake must hate us, right? Snowflake is like, you are taking all of our money away and it's causing a bunch of issues. So this has been completely opposite from Snowflake's actual
reaction. Blue Sky is actually a Snowflake partner, which is super interesting if you think
about it. And I think a lot of this goes back to Snowflake's consumption-based pricing model.
So arguably, you can look at it and say, they want you to spend as much as possible to pay
them as much as possible. But there's a danger underlying this. It's almost like looking at the
finance side. If you put all of your money in one stock, it's generally very high risk.
And so I think Snowflake recognizes that problem. And for them, this optimal deploying of capital has a multifaceted benefit.
First, they have solutions architects who kind of function like oracles, DBAs, but they
have solutions architects who are really interested in helping companies grow their data cloud in a responsible way.
And I think that's incredibly powerful.
Because Snowflake realizes if I'm spending my entire compute budget optimized this one query maybe you try a new bi
problem or a new business problem like business venture with this compute money that you've now
saved you're actually further entrenched in the snowflake ecosystem so diversifying the workloads
like diversifying your investment is actually really beneficial and i actually think that's
one of the best things i I think, that Discovery Amazon had
as well, where even back in the startup that I was at previously, we would focus on, well,
if they gave us compute credits or some way of offsetting spend with private pricing,
I'm going to take that money and use Macy or Redshift and try something completely new.
So from the consumption-based pricing model perspective,
I actually think a diversified investment
is better for the data clouds.
And so even having their solutions architects
potentially at some point use Blue Sky
and say, here are the areas we can cut costs
or here's areas we can redeploy capital
is incredibly powerful.
I mean, Eric, one thing to your previous question,
building these data apps and like, you know,
almost Snowflake is now this like API, right?
It's almost like, I forget who called it this on LinkedIn somewhere,
but they were saying Snowflake's building their own like Apple app store
where you can build all these data apps backed by Snowflake.
And I think it's a really great characterization
because it actually shows that Snowflake is
now handling the backend computation of all of these tools and technologies and enabling
people higher in the stack to generate business value ordinarily wouldn't be able to do as
much work given their lack of experience building some of these technical products.
So it's kind of the same thing.
If I am Snowflake, I'm not necessarily interested in just optimizing as much compute out of this one app developer
because it means that they can't spend money building, refining, doing other things.
So actually running those optimizations behind the scenes or deploying a tool that can help these folks who are creating data apps grow and scale, I think is really powerful.
And one example I will give, given that they are public, Houseware just won Snowflake's
startup challenge a few months ago.
And these guys are an awesome team.
They're building really cool products.
I'm not an investor, but I think they're actually really cool.
And so one of the things that they're doing
is helping people build these data apps
and being able to build a data app.
If you're a small company,
just like a CTH company,
you can have all the technology,
but be terrified of that big compute bill
from some user accidentally running you up.
So it's the guardrails around safe compute as well
that I think are really powerful.
Yeah, that makes a lot of sense.
And one last question for me before I hand it back to Eric.
Is BlueSky right now like working only over Snowflake or you support also other data cloud
solutions?
So the irony of me right now is I actually didn't know anything about Snowflake
until I started working at Blue Sky.
My experience is like fairly heavily Spark and Databricks.
So right now we're focused on Snowflake for two reasons.
First, Snowflake is, you know, it's a big dominant player in the ecosystem.
And I think there's a lot of opportunity in Snowflake in particular. Just from a configuration or a query perspective, there's a lot we can do, especially
with also SQL. So right now we're focused on Snowflake, but that will almost certainly change
as time goes on. So do you see there more opportunities in systems similar to Snowflake or also in systems that are more like
Spark? The reason I'm saying that is because as a computation model, they're very different,
right? Very different types of deployments, different teams that are involved.
So how do you see the difference there there between like a system like BigQuery or
Snowflake or Redshift and then systems that are more like Athena or BMR and Spark or DataBricks?
How did you see the difference there? It's a good question. So I want to say like taking
this step from the back from the perspective of just Snowflake,
going to Blue Sky, I always, it's not like a broken record, but it's really about this
efficient deployment of capital, right?
So I would not necessarily say, I mean, this is great, right?
If we can optimize someone's spend, that's awesome.
But I wouldn't say my goal is to go into a customer and like bring their Snowflake spend
necessarily like as down as as humanly possible.
My goal is instead to have them effectively deploy their capital.
So if they have a bunch of failed queries or they're not using certain BI tools, that's
actually what I'm trying to address.
Not necessarily just negotiating their price down or something.
So I think when I look at something like Spark or Databricks, the number of levers that you have makes that problem like an N-dimensional problem.
In Snowflake, for example, I don't ask to set like Spark's driver memory or Spark's executor memory.
I execute the query and have some t-shirt size, warehouse size of how often or how often it's going to run.
But it really depends on like Snowflake to execute that completely
independently.
So I think when we move out of Snowflake and move to other technologies, I mean, BigQuery
doesn't have a lot of these knobs, either this redshift, the dimensionality of how many
variables we have to tune does become more and more complicated.
Our goal is really to look from an organization or team-wide perspective.
Not necessarily like, this query is slow.
Let me optimize this individual query.
It's really across the organization.
Here's what you're trying to do.
Let me instead help you optimize that.
And I'll give you an example that may be interesting for some of the listeners.
Incremental pipelines we're seeing all over the place now.
So a table, it's appending over and
over and over. And oftentimes, because businesses are moving so quickly, we've noticed more than a
handful of cases where an incremental pipeline, a table is being incrementally appended to,
and the downstream consumers of that table still do a full linear scan of all of the table without
actually looking at the diffs. And that's actually a really hard problem to identify without a tool that's
actually looking for that as like a best practice.
And so the dimensions that we can go or the areas we can expand in Snowflake
itself actually lead themselves to saying, given the multidimensional problem and
data breaks in other places, it's actually almost easier and more focused for us
to focus on optimizing this one sole area.
I mean, I will say this,
I even being very knowledgeable on Spark,
like leading Palantir's Spark team,
I have no, I don't know how that dimensionality
is going to make it easier or hard for us.
It could be a really challenging space
or it could be something where we can apply
similar principles. Absolutely, absolutely. Makes a little sense.
All right. So, Eric, all yours. Okay. I want to return to the car analogy
as we get close to the end of our time. So, one interesting thing, the car analogy with the
driver is really helpful, right? Because you have sort of training, you know, a driver trained to operate the vehicle in
a resource efficient way.
And then the vehicle, to your point, has a huge amount to do with it.
But, and I'm going to really extend the analogy probably to the point of breaking now to make
my point and formulate my question.
But if you think about this, I was actually thinking about this the other day because
I was driving a car from the 80s that's really old. And you can basically watch the gas gauge
go down while you're driving. I think it gets like eight miles to the gallon or something, you know, and then you get, and so, and also like, it's not very fast. So if you like push
the car really hard, you're getting like a lot of physical feedback that as a driver that tells
you like, Hey, you are, you are definitely like using a lot of fuel here and the car you're
driving, it's like really loud and you
can see the gas gauge going down, right? So you're not only getting feedback on your own driving,
but you're also getting feedback, you know, from the vehicle itself. Then if you sort of look at
the modern version of that, right? Like you get in a Prius, you know, that's like a hybrid vehicle
and it will give you real-time feedback on the economy of your driving style, right? And even like the
efficiency of the vehicle itself and sort of conserving resources. So again, I'm drawing
the analogy out a little bit about that, but like, if you think about executing a query
in Snowflake, just in the raw SQL editor, you have like the little time counter. And that's like,
that is basically your only physical feedback. And then if you have a data app on top of that,
where you're doing something that doesn't give you any physical feedback, it creates this weird dynamic where it's hard as a user to actually get the information that you need in order to optimize while you're doing your job.
And I'd just love to hear your thoughts on that as part of this problem set.
I mean, I know that Blue Sky comes in and helps, and you've talked about, okay, do we have a completely automated solution?
But it is interesting.
I mean, to your point,
like I don't believe that there's,
you know, malicious,
like we're going to obscure all this so that our NDR is crazy, right?
I mean, you know,
it's like they have to have a balanced approach to that.
But there is actually not a lot of feedback
that helps you sort of
while you're doing the actual work itself,
adjust what you're doing to actual work itself, adjust what
you're doing to account for resource usage. Yeah. So I had one of the customers ask me a few,
a few, maybe a month ago, two months ago now, you know, why don't you just build a SQL query
linter that just tells you, gives you the red Microsoft Word squiggly lines that says this is
not optimal. And honestly, it's not a bad idea.
The reason that I think this is a challenging problem is because in your Prius or in whatever
car, you have that one dimension.
You have, I'd argue the brake and the gas are two sides of the same coin.
Yeah.
You can not slam the brake as aggressively or you can slam the peer put gas into it. The challenge is with that as your only
lever, actually the problem space of what you can do to fix whatever is being found on the dashboard
is much smaller. If I say, hey, this query is not being run optimally, or if I even said you're
doing a linear scan over this data set, the amount of knowledge and expertise it takes to figure out how to solve that problem in an optimal
way, it's actually pretty massive.
And even if I were to say something like, I mean, we'll use like Java as like garbage
collection, right?
I could just say in like C++, like you've allocated this thing.
My IDE is saying you didn't destroy it properly somewhere else in the code.
Well, those are great during effectively the product runtime.
But when they have external dependencies not in the same file, it becomes a really complicated, almost intractable problem.
Knowing what to do is the second step.
And it's not always easy, even for like really experienced query authors. We actually, I'm our deployment lead. So when I go to customers, I actually use BlueSky
and I will say, here's areas that I think you can do optimization. But for me, it's
still like, let me actually introspect your query. Let me understand the table, not just
schema, but like attributes at a fundamental level.
All of that effectively has to be surfaced to the point that you can tell a nice story around what needs to happen.
And that's almost why, rather than just kind of lint in Java and say, here's all the things
you can do to better optimize your memory, let's just build a garbage collection tool.
Let's just handle that for you.
That's kind of our end state. Let's just handle that for you. That's kind of our end state.
Let's just handle some of these challenges for you.
And actually, the one additional dimension is
some of these challenges you may not even understand.
Like in a multi-jv app,
if I have two things running
and I have one service
that is thrashing the hard drive
of whatever bot it's running on,
me as a second service may have no
idea why my job is so slow or being queued or IO is as bad as it is. Yeah. Yeah. If I had to,
if I had to summarize that, it would be you as a driver, like shouldn't have to worry about all of
these various inputs because it's a much bigger problem than like gas pedal and brake pedal.
Exactly. You should just be able to focus on driving.
Exactly.
And I think one of the things Snowflake has done is Snowflake is actually, I don't know
if people think about them this way, but they're a giant multi-tenant system.
So everyone can share data with everyone else.
Everyone's executing queries in the same, technically the same AWS or GCP infrastructures, everyone
else.
So you're really part of this massive cluster that's doing all this computation that can
actually affect a lot around whether or not you're getting your queries surfaced and run
in the exact same time every single time.
Yep.
All right.
Well, we're close to the buzzer here.
So one more quick question for you, and this is advice for listeners. For anyone listening who is thinking, you know, maybe they
don't have a huge Snowflake bill, but this has gotten their wheels turning on, hmm, like I
wonder what, if anything is super inefficient in, you know, in the way that we're executing stuff on Snuff,
like where would you have them start looking?
Like where's the best place to start doing that investigation?
I honestly think one of the, so the first thing I would say is the Blue Skies team is
incredibly knowledgeable on this.
This is not just me like giving you a sales thing of saying, come talk to us.
But finding people that are that have
a lot of expertise in this space is actually really hard specifically because when they develop
that expertise they either have a lot of contextual information like people at big snowflake consuming
companies know about their own unique patterns but they don't necessarily see the like swath of other
ways snowflake is being used so i would say like you can reach out to the blue skyath of other ways Snowflake is being used. So I would say like,
you can reach out to the BlueSky team.
The other thing I would do is,
there's a lot of like Snowflake's
definitive guide just came out.
And there's a lot of like great material in there.
There's blog posts.
And there's a lot of sessions like this,
like really people who are spending time in the space
who kind of share the tidbits of best practices,
like the order by or the auto suspend stuff that we discussed.
There's a lot of information.
We have a blog where we're slowly creating more and more content.
But the main thing I would really do is honestly is experiment.
Like try some of this stuff out.
You can do it.
And Snowflake has made it so easy just to try these queries
in a smaller capability or spin up even a test instance.
So actually playing with the tools and technologies, I think is pretty powerful.
So helpful. Vinu, this has been such a great episode. I learned a ton. The multiple analogies
were great. So thank you so much for spending some time with us.
Absolutely. Thank you both so much. It's been awesome being here.
Glad we got a chance to chat.
So really appreciate the time and hope this was helpful.
My takeaway from this, Costas, there were a lot.
I love the car analogy.
I obviously dug into that multiple times. really helpful to hear someone so technical mention how difficult it is to actually get
a sales contract from initial conversation to signature. And I just really appreciated that.
It was funny because Venu is obviously a brilliant person to even be able to perform all of those job functions. I mean, not only is he brilliant from sort of an engineering and data perspective, but interpersonally, obviously, to be able to actually be a salesperson too is a whole different skill set. So I think that's a really rare combination, but it was just really enjoyable to hear,
you know, sort of hear it from the other side
to hear an engineer say like,
well, I mean, it is so hard, you know,
to actually like get a sales contract through, right?
You know, whereas on the other side,
it's like, you don't understand how difficult it is
to like, you know, scale a distributed system,
you know, or, you know, whatever it is. So that was my big takeaway system you know or you know whatever it is so that
was my big takeaway along with all the other great stuff but that just made me smile yeah yeah like i
really i think what i really enjoyed from the conversation is the definition of uh like the
workload as capital allocation i think that was like very very interesting to hear they're
pressing and in general like this whole mental model of
how to think about your data infrastructure and how it is utilized and how you can optimize it
and what optimization means at the end. I think that was probably the most valuable part of the
conversation, at least for me, and hopefully for many listeners out there
who sooner or later they will face the need to optimize also for cost and not just for performance
or SLAs in terms of latency and stuff like that. So yeah, we probably need another episode with you.
There are more stuff to discuss and we'll get like the more detailed,
like technical detail of the solutions that they have.
So I'm looking forward to have him
on the show again in the future.
Absolutely.
Well, thank you so much for listening.
Subscribe if you haven't,
tell a friend about the show
and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on
your favorite podcast app to get notified about new episodes every week. We'd also love your
feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.