The Data Stack Show - 106: Optimizing Query Workloads (and Your Snowflake Bill) with Vinoo Ganesh of Bluesky Data

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we are going to talk about a really interesting topic, and it's ROI related to all of the data workloads that you run. Kostas, I know that you have questions about what the definition of

Starting point is 00:00:41 workload is. We want to dig into that. But we're going to talk with Banu from Blue Sky. And what I'm really interested in is on their website, they say, if you're spending $50,000 or more on your Snowflake bill, you should talk to us because we can help you drive better ROI, which is fascinating. So I want to know about that number. I also want to know about their

Starting point is 00:01:05 relationship with snowflake right because reducing your snowflake bill like are they friendly with snowflakes so that'll be interesting so i have so many questions to ask and then you know of course what does the product do but how about you yeah i mean it's a little difficult to spend 50 grand on Snowflake, right? So you know how easy it is. So yeah, it's going to be very interesting to hear war stories. Let's say what they experienced with their customers. I definitely would like to chat about the definition of workloads and what they see out there in terms of what is the most expensive part of the operations around data. And yeah, also the other thing, which I think is going to be quite

Starting point is 00:01:52 interesting is, I know that right now the product is focusing on Snowflake, but what it means to take this kind of product, this kind of service and deploy it on different data warehouses or data lakes or data infrastructure in general. So I think it's going to be a very interesting conversation. There's a lot of discussion lately about the cost of Snowflake. So I think it's the right timing to have this conversation today. I agree.

Starting point is 00:02:21 Well, let's dive in and talk with Vinu. Yeah. Vinu, welcome to the Data Sack Show. We are super excited to chat today. Thank you, Eric. Very excited to be here. All right. Well, give us your background. You've done some really interesting things in some really interesting industries. So tell us about your background and then what led you to Blue Sky. Absolutely. I started my career off at Palantir. Was there for almost seven years.

Starting point is 00:02:47 I did virtually every job you can imagine, from software engineer, building some of our core distributed systems, to salesperson selling our product, to deploying it, and commercial, healthcare, military environments, before eventually leading our core compute team. So every bit per byte of data that flowed through Palantir flowed through my team at

Starting point is 00:03:10 one point. After Palantir, I realized that we had built these incredibly powerful analytical tools, but a lot of our customers and a lot of just analytics tools consumers didn't have the data size or scale to warrant the power of these tools. So I decided to focus on that area and built a data as a service company. Marisette, I think it's about like 15 million ARR now, still chugging along. Oh, cool. Really fabulous.

Starting point is 00:03:36 Thanks. Yeah. Still doing well. Really focused on how do we take a huge amount of data, make it accessible and make that data accessible to consumers who don't have to do these crazy expensive cleaning operations. After that, an old friend from Palantir reached out and ended up joining Citadel, the hedge fund leading business engineering on Ashlar Capital. So building all the tools, technologies, managing the data engineering team for

Starting point is 00:04:04 the last mile aspect of portfolio managers, alpha generation processes, just trying to help people make money effectively. Before, actually, I guess right after, another mutual friend made an introduction and got introduced to Blue Sky. And Blue Sky, where I am right now, I'm on the founding team. Our goal is really to provide an optimization mechanism and a mechanism for people to introspect their own query workloads and their own data cloud workloads and really get the maximum ROI of their data cloud. And that means a number of dimensions. Awesome. Okay. I have a question about your time at Palantir because the spectrum of job titles that you mention is astounding in many ways, right?

Starting point is 00:04:54 You just rarely ever hear of someone who sort of goes from software engineering to sales, to sort of owning the data platform. And it sounds like multiple jobs in between. I would just love to know, having that breadth of perspective inside of a single organization, what was some of the most interesting things or unexpected things that you learned doing such drastically different roles? Absolutely.

Starting point is 00:05:25 I think this is one of the things that Palantir does best, where it's almost like you can have 10 different jobs just in the same umbrella company. So first and foremost, the reason that I ended up forward deploying, as they would call it, is a lot of the design decisions that, I guess, mine and my co-engineers made

Starting point is 00:05:43 on some of the early data storage products were not always optimal. And until we had the real customer workloads, and I'm using workloads again, but workload and understanding, there was no way we could have designed a system that actually made sense. So I think the first big and surprising thing

Starting point is 00:06:01 was truly how different, anyone who's worked in production software will know this, but how different production versus development is. Really just understanding how we build software, how we actually develop a user focus, especially when the tools and technologies aren't directly consumer facing. Like a distributed system or data storage system, you wouldn't think of as being particularly customer facing. But all the decisions that you make from a design perspective, from everything from compliance to storage, all directly affect the user experience.

Starting point is 00:06:36 The second thing is really almost the value of having a technical slant, not necessarily in your sales cycle, but amending your sales cycle. Being able to actually communicate with the people procuring your software with a deeper understanding of why things are implemented the way they are, some of the challenges and limitations, I think all gave me a lot of respect for the engineering background that I had. And conversely, going back to the engineering side, really understanding how hard it is to move a contract from a initial POC to an enterprise agreement. It's just so difficult. Yeah. Yeah, that's great. Actually, I'm glad you brought

Starting point is 00:07:20 that up because I was going to ask you from the engineering side, you know, a lot of our listeners are technical. And so, you know, that's really helpful to hear that perspective on the sales side, right? Because I'm sure, you know, for salespeople, building production software probably seems really, really hard, right? Moving a contract is difficult too. Okay. Well, I know Kostas has a bunch of questions, but I actually want to start with a really specific data point that you list on the Blue Sky website. And I think this will be just a great jumping off point. So I know that ROI means that there, you know, ultimately it kind of boils down to, you know, what is it producing. And I'm a marketer, so it stuck out to me. I got my attention for those reasons alone. But I'd love to know why that specific breakpoint

Starting point is 00:08:33 and what does that number, whether or not it's the perfect number as a proxy for what Blue Sky helped solve, what is represented underneath that? And I think specifically, I'd love to know, how can we help our listeners benchmark cost even? Absolutely. So I will say transparently, the 50K number was kind of a number that was just picked. However, it's one of those, like a backer name where we pick the number and then realize, wow, this is actually indicative of something pretty powerful. I think what's been really interesting, especially with something like the Snowflake ecosystem is starting off as a small scale user. And Snowflake is an incredibly powerful tool. SQL is really easy to write. There's all these built-in integrations,

Starting point is 00:09:20 actually a blessing and a curse. But the number one thing that I've heard from our Snowflake customers is the speed at which you can ramp up your Snowflake spend by actually doing things that add business value is unparalleled. So you deploying like a Sigma or like a DBT, these are incredibly powerful technologies and tools, but they almost add this exponential growth aspect to your Snowflake spend.

Starting point is 00:09:45 And so that number in particular, I think it almost is the beginning, almost the precipice of, I'm now going to become a heavy Snowflake spender or heavy Snowflake user. And so Blue Sky has customers and partners that go anywhere from that 50K number up to the double digit millions for Snowflake's bet. And so where you are on that data journey or that data process, and when you actually decide to engage us, tells a lot about how you think about your utilization of a data cloud. So we picked that number largely because it really does look like the precipice of starting

Starting point is 00:10:24 to expand your utilization and your snowflake footprint. Super interesting. And one follow-on question to that, and this is probably multiple questions packaged into a single one, but when you think about the, let's just use the examples that you mentioned, right?

Starting point is 00:10:43 So of course, SQL is easy to write. And I mean, it is wonderful that we live in an age where you can deploy Snowflake, start writing SQL, get a huge amount of value in a really short amount of time. But then when you think about a tool like Sigma, or even dbt might be a better example where it's it's pretty low dbt in particular so it's pretty low in the stack in terms of like where it interacts with the data and then in and then sort of pushes value out in a large variety of contexts right so you almost have what I would call like ROI fragmentation so howmentation. So there's the cost side of it, but then how do you think about ROI in such a fragmented way because it's touching so many parts of the business? That's not necessarily a simple calculation.

Starting point is 00:11:37 Definitely. I think maybe this is the finance side of me, but anytime I think about ROI, I think about really like, am I effectively deploying my capital as a business? And what I mean by that is not necessarily like, am I spending a certain dollar amount, but the dollar amount that I'm spending, am I actually spending that in the most effective way possible? You can kind of think about it in the, I was like using this car analogy. Like if I'm driving around in a car, I can either be very gas efficient or like very bad at consuming gas.

Starting point is 00:12:10 I'm slamming the brake or slamming the gas. I'm going to burn through a lot of gas really quickly. Even the car that I use, like whether it's I'm driving a Hummer around, it's going to be guzzling gas like crazy. So that I'm paying for the gas either way. Does that gas consumption actually add the value that it should to my business? So when I think about ROI, I don't necessarily just think about, am I getting a dollar value back for this effective cost of compute that I'm putting in?

Starting point is 00:12:39 Am I deploying that capital for my business in the most effective way possible. As a concrete example, I think dbt is a super powerful tool, right? Being able to test and almost have like a CI, CD process around SQL is incredibly powerful. Absent something like dbt, you can run a series of failed queries, like one after another after another, shrinking up more and more cost. Now, the capital deployment of just letting a query run and failing is a horrible way to deploy capital. But I could instead use a tool like dbt and almost get all of that,

Starting point is 00:13:15 dbt has its own costs, but get that failed query capital back and deploy it against another business critical problem. That to me is a much better ROI and a much better deployment of capital. Yeah. Makes total sense. Exactly.

Starting point is 00:13:30 Okay. I'm going to ask one more question, but Costas, I feel like I've been hogging the mic. Vinu, could you help us understand? So let's say I'm looking at my Snowflake bill. It's 75 grand. We're starting to have internal discussions around like, okay, you know, we're getting some inquiry about like, wow, this cost is really ramped up.

Starting point is 00:13:53 Describe the process of how Blue Sky would come in and help us address that situation. Absolutely. So the first thing is, as any engineer is what, we start out with data, right? We want to understand not just the data of what the bill is, but what actually makes up that bill and why does it look the way that it does? So the first thing that we do is we never need access to any of your business data or anything other than metadata of your query history. From that, we can actually tell using some proprietary algorithms. First, where is your compute actually going?

Starting point is 00:14:30 Am I over-speccing my warehouses in Snowflake? Do I have a bunch of idle compute? Do I have these massive queries that take up thousands of credits after one execution? So we really start with an understanding of what is the information that I have on the ground from Snowflake. From there, we start introspecting by adding our own kind of flavor and opinions into our product. Some of the examples I gave, like warehouse idle credit, or even an ability to look at a query and say, this query is ordered by

Starting point is 00:15:02 this insert is ordered by this table rather is ordered by a particular column. Consumers of that table should take advantage of that order by and filter where they can. Those are insights that we can display as well. So it starts with the understanding and onboarding of the unique aspects of a data cloud. Then we look forward. So it's really easy to say, okay, well, we're in this position now, let's just do like a P zero tourniquet cost cutting exercise, only to end up in the same situation three months from now when the spend has grown. So we instead do is also

Starting point is 00:15:38 provide tools and mechanisms for controlling costs from a guardrail perspective as you move forward. And these are ways of coalescing functionally equivalent queries or semantically equivalent queries together to actually attribute a cost or even highlighting something like a misconfiguration where I've sized a warehouse a particular way when the workload doesn't actually warrant

Starting point is 00:16:03 that sizing of the warehouse. So it's a data-driven approach that really starts with visibility before extending into this insights level of what you can manually change before building BlueSky's end vision, which is an automated tuning and healing layer. Eventually, you're going to get tired of implementing these insights. Maybe just turn on an autopilot and we can figure out what to do for you. Super interesting. All right. Well, that's a great point on which to hand it off to Costas. Costas, thank you for your patience. Thank you, Eric. Thank you. So Vinu, let's talk a little bit about workloads, right? I mean, people are using data warehouses, obviously, for analytical purposes,

Starting point is 00:16:49 but there are many different things that are happening in a data warehouse before we can get a dashboard or a report or whatever, right? So, can you help me understand, like, how do you define a workload in Blue Sky? And yeah, let's, we'll get deeper into that. So let's start with this. Sounds good.

Starting point is 00:17:14 To me, a workload is, you know, in the olden terminology, there's like the old TP, old AP, and like our batch or streaming or analytical compute. For me, a workload is really going back to that finance. It is the way I'm deploying my capital in my data cloud. So the workload can involve anything from me writing data, persisting that data, me actually doing snowflakes, like auto clustering behind the scenes to me repartitioning data, me even doing things like a reverse ETL process of writing data out of the cluster.

Starting point is 00:17:53 So these don't fall into necessarily batch analytical or these clean definitions of what were previously like your, I'm a batch compute heavy workload. It's much more so how I'm utilizing that compute. That's how I think about the workload. So what are like, let's say some common categories of compute utilization that you see out there? So first and foremost is I would have never expected this before Blue Sky.

Starting point is 00:18:22 Although I think you can kind of guess it's there, but the big ones are really BI, like all the business intelligence tools like Looker, Tableau, Sigma has some of these. There's just such an inundation of wanting to get insights out of my data, my system, that BI actually accounts for a huge amount of the workload. Now, whether or not these dashboards are actually actively used or consumed, they are the ones writing these automated queries. The challenge with BI, especially in terms of like a workload perspective, is a BI tool is not working, let's say, nine to five. It will execute its queries whenever it wants.

Starting point is 00:19:02 It will do data refreshes at any time. So your heaviest consumers can actually be something like BI tooling. So I think the second is, and I'm going to kind of pick the ones that I think are unique. The second is maintenance. And few people actually think about maintenance in terms of what needs to happen for your data cloud to operate optimally. And these are literally things like snowflakes, repartitioning, or re-clustering where I want my data.

Starting point is 00:19:29 Like I want to partition, I'm using partition and clustering interchangeable here, but I want to cluster my data a certain way, but there are some maintenance operations that need to happen, snowflakes on compute behind the scenes to actually ensure that I'm able to read tables the way that I want and the tables look semantically the way I want them to. So I kind of grouped that all into maintenance, which is distinctly separate from even something like compliance. CCPA, GDPR, these workloads are the right to delete. I actually grouped these into a separate type of workload because they involve both this like linear scan or like taking advantage of some unique file format way of scanning

Starting point is 00:20:11 through your data, actually deleting and making incremental changes. So I think these are the, and then of course you have your analytics, someone going on writing a ML job or writing some kind of like just one-off SQL query to get a table back. And you have your ETL pipelines that come from a variety of sources as well. You did. Yeah, it's interesting that you didn't mention ETL as one of like the main workloads out there. Why is that? Or you just included it as part of BI?

Starting point is 00:20:40 Like, how did you see like ETL being part of the workloads there? It's a great question. So ETL is always the, you know, it's kind of the bedrock of like, if I'm using data and there's going to be some cleaning process, some transformation process, some load process, like all our extract process, there's all of these kind of live in the same almost ecosystem. But the reason I don't think about ETL as prevalent of a workload is because it actually tends to be the place that people are investing most of their time and energy. It's not the long tale of, oh, I built this dashboard two years ago and forgot about it. It's really like, this is clear business value because every day these tables need to be updated. They need to be transformed.

Starting point is 00:21:25 They need to be written to. So deploying capital against ETL jobs is almost an easier justification than deploying it against BI tools that may not have the right consumers or may not generate as much business value. Makes sense. Okay. You mentioned like maintenance, compliance. BI, in terms of what you've seen out there, I would expect that BI is one of these things that's kind of, let's say, predictable, in a way, outside of, okay, let's say you have interactive analytics

Starting point is 00:21:58 where obviously you need to sit on top of your BI tool and start experimenting with queries and all that stuff. But when you have dashboards, you can deploy quite a few different methodologies to optimize the process. Like materialization, for example, is one of them. How many times do you have? Or caching. you have, or CASI, right? There are tools, and BI is one of these processes that has been around

Starting point is 00:22:28 for a very long time. So database systems have really evolved around it, right? But what do you see happening out there? Because obviously, there's a lot of space for optimization from what I understand. So is it like we are missing the right tooling to do that? Is that like, why is the reason that there is so much space still for optimization when it comes to BI? It's a great question. I think in the past, BI fell in this category of like read-only in the sense of I'd have a dashboard, it was executed once, and it would just be, you know,

Starting point is 00:23:06 like on a page for someone to consume. In the new world of data applications, like there's a lot of these companies like Streamlit, Houseware, that are doing these really, I think Snowflake actually just acquired Streamlit, doing these really powerful like data application creation.

Starting point is 00:23:23 You as a non-technical user, or I don't want to say non-technical, but a less technical user, can now interact with the platform in a way that you previously didn't really interact with it. Not just filtering, but I can actually bring in and join with no-code or low-code solutions, other tables, and create new derivative data products just from my own system. So the materialized view creation caching, they all solve that root node problem of this, you know, compute happening over and over again and persisting that, but any of the derivatives products, even notebooking tools like Hex, I think is a really cool product as well, you can create

Starting point is 00:24:01 all of this derivative value and all of these derivative data products that still kind of live in the realm of BI, although people are using Hex for ETL also, but still kind of live in this BI tool, BI world, independent of, I guess, a previous just like an individual dashboard that was just sitting there consuming data with no one really looking at it. Mm-hmm. I was already looking at it. Stas Miliuszak And do you feel like we need new tooling to optimize this new, let's say, I wouldn't say new workloads, but new facets of existing workloads? Like what do you see there?

Starting point is 00:24:39 I mean, obviously there is an opportunity that's why Boost Kaiser is out there, but what a database system should do to account for these new ways of interacting with data and make the full process more performant at the end? Yeah. So the interesting thing is SQL's been around forever, right? Just the anti-SQL

Starting point is 00:24:58 standard has existed. The execution engine has been the thing that's been particularly played with over the years. Snowflake is effectively Oracle, except without the DBAs deployed in the cloud, you can manage on your own. But the execution engine, the thing that actually does a lot of the magic of Snowflake, the clustering and the ability that you can write a query and potentially never have it fail, it can just keep spinning. Whereas you do the same thing in something like Spark, it just crashes. Those are dual-edged swords. So if you look at

Starting point is 00:25:31 something like, well, we'll look at Databricks. I think Databricks and Spark are such a great company with a really cool technology. They're investing so heavily, things like Photon, Catalyst, all of these technologies that are really just made for the purpose of optimizing a query execution. I would even say optimizing, making a query execution more predictable. That's really what I think they're doing. So in terms of the need of tooling, for me, it's for as long as we have people who are going to be authoring queries, we're going to need people who either are educating folks on how to author queries in the most actual way

Starting point is 00:26:09 or automated tools and solutions that just abstract that problem away. This may be an imperfect comparison, but anyone who worked with the old C++ memory management things has experienced challenges of memory leaks forever. So you knowing when to like mal-like or like deallocate memories really, really hard. And so people built layers on top of that. Java became one of the predominant technologies.

Starting point is 00:26:34 And then we have like G1 garbage collection, all of these new ways of actually abstracting that problem away. So what I see us doing or this space in particular, we'll add like, it is just adding a layer of abstraction that handles the complexity of otherwise having to optimize low-level SQL code based on table semantics or query semantics. Yeah. So I have a question, actually, and this is for you, Venu, and Costas as well, because I know that you've looked at some of these tools.

Starting point is 00:27:04 One interesting dynamic, just to dig in a little bit deeper on some of the tools that allow an end user to actually drive up compute. Because those tools can offload compute to Snowflake, okay, well, this is enabling a lot more people to do a lot more things, but it's creating a huge bill on the backend because you're just hammering compute. How do you see those products managing that? Because that I mean, because that's, that's a, in my mind, a non-trivial component of your product is sort of the optimization, right? I mean, there's literally entire companies obviously built around query optimization, of course. I mean, obviously that's exhibit A with Blue Sky, but I'd love to hear your thoughts on that. Like how, how do you see those products managing that? Do you want to go first or I pick you, Ilze?

Starting point is 00:28:32 Yeah, I can. I mean, I have, and that's also what I wanted to ask you. Okay, traditionally, let's say the database system has the query optimizer, right? So you have, let's say, a piece of the technology that is one way or another responsible to go out there and make the best possible choices to execute the query in the best possible way. Obviously, that's a really hard problem. It will never be completely solved, blah, blah, blah, like all that stuff. But at least you have access to that. Traditionally, the DBA, that was the role of the DBA. When things start going wrong, I can use the query optimizer, the planner, the explain commands, blah, blah, blah,

Starting point is 00:29:16 all that stuff to see what's going wrong and figure out ways to manually optimize things. When we put so many layers of abstraction in between, and I'm talking specifically for things like Looker and BI tools where you also have languages that you use to model the data, and there's another

Starting point is 00:29:36 piece of software there that takes the data model definition and does whatever it wants to do, the user is like, how do you even try to tackle this problem, right? Like a query that is generated by Looker that then is optimized by the query optimizer and then turns into a plan and gets executed.

Starting point is 00:29:58 How do you even figure this out? In my mind, at least, it's really, really hard, right? So how can we, I mean, abstraction is good, but it also adds complexity. So how do you think we can tackle this problem? Absolutely. And so I think,

Starting point is 00:30:15 I'm going to go back to the metaphor of how I think about Snowflake, where anyone, any query author, I think of as someone driving a car and their goal is to get

Starting point is 00:30:23 from point A to point B. The way that, or how much gas or how much fuel they consume on that journey doesn't just depend on their ability to become the best driver in the world. It depends on so many things, the kind of car they're driving, the environment they're driving in, who else is on the road. And the kind of parallel here is, if I were to say a warehouse in Snowflake, so logical grouping of compute cluster is the car, the optimal route selection or the optimal gas consumption to get to that end route depends not only on the person, but also on the car. The best driver in the world can still use a bunch of gas driving a hover to their place

Starting point is 00:31:06 of wherever they want to go. It's kind of the same thing. If I'm authoring a query and my query optimizer is particularly amazing, it's done everything perfectly, that's only a part of the equation. The second part is, where do I choose to execute that query? Am I bin-packed with like incredibly computationally expensive queries? So I'm going to actually slow down and I can't scale up that much. Am I going to be able to have any kind of data locality, depending on the

Starting point is 00:31:34 technology that I'm using at that point? So all of these come together to, even if the query is written in the most basic query, like select star from this table, there's still so many other elements that are involved in my query execution. When I say the level of abstraction, I also mean if we are able to, like we could train every driver to drive optimally and even in that situation, all these external factors could throw things for a loop. So what I really mean is how do I augment that driver either by extending what the

Starting point is 00:32:08 query optimization can do, but also by adding contextual information around straight conditions or the road conditions, meaning it's still like terminology, like what other queries are being executed, the car, the size of my warehouse that I have, the number of clusters that I'm scaling up and down. So abstracting that entire problem space away such that a user is executing an individual query doesn't have to worry about that is incredibly powerful. Let me make this a little more concrete. If I'm doing, so Snowflake has had a very interesting thing with this terminology warehouse. A warehouse doesn't mean anything. It's just like a logical collection of EC2 instances. And so how people actually use

Starting point is 00:32:50 or name these warehouses really changes from organization to organization. If people use looker, their setup instructions say looker warehouse. Or if they're a little bit more detailed about this, they'll say looker extra small warehouse, extra large warehouse. But the challenge is the logical grouping is dependent on the product, not the actual workload of that individual technology. So if Looker is actually coming in every day and firing a query once every 24 hours, that happens to execute on this massively extra large warehouse that has a very low, very high out of suspend, I'm going to be spending a lot of money on that one query. It may even be overspecced. So all of these problems coming together

Starting point is 00:33:34 and the contextual information is really how I think about solving this. Instead of a layer of abstraction on top of just the query, it's on top of the data cloud as a whole. It's very interesting. And I want to add another dimension to this problem, and I want to ask you specifically because you have also worked in the financial sector. So you know how people get motivated to optimize based on the profit that we can have, the alpha that we can generate at the end, right? We all strive for this alpha at the end. So I keep remembering cases, for example, like BigQuery, right?

Starting point is 00:34:20 There was this case where you could use select star, the limit's 10, right? So you would expect that the query agent would just reach 10 values and return them. No, it would go and actually scan the whole data set and then return just 10. But at the same time, that's how the queries like pricing is based on how much data it reads during the operation. So there's a lot of motivation there to actually do that because that's how the product can make more money. And especially in consumption-based models, I think this is a very strong drive to guide

Starting point is 00:35:04 how the engineering teams there, or the product teams in Snowflake, or whatever company there will make certain choices. So how important do you think, outside of the technology itself, the abstractions that we put there, are also these other factors, like the pricing models that the companies have, or the contracts and the't know, like the contracts and like the business side of things, like how much also affect at the end, how much it will cost us and how we should optimize at the end, the workflows that we have. It's a great question. I think one of the really interesting things about, I appreciate Blue Sky approaching this. Normally

Starting point is 00:35:42 people would think, and it's entirely understandable, Snowflake must hate us, right? Snowflake is like, you are taking all of our money away and it's causing a bunch of issues. So this has been completely opposite from Snowflake's actual reaction. Blue Sky is actually a Snowflake partner, which is super interesting if you think about it. And I think a lot of this goes back to Snowflake's consumption-based pricing model. So arguably, you can look at it and say, they want you to spend as much as possible to pay them as much as possible. But there's a danger underlying this. It's almost like looking at the finance side. If you put all of your money in one stock, it's generally very high risk. And so I think Snowflake recognizes that problem. And for them, this optimal deploying of capital has a multifaceted benefit.

Starting point is 00:36:33 First, they have solutions architects who kind of function like oracles, DBAs, but they have solutions architects who are really interested in helping companies grow their data cloud in a responsible way. And I think that's incredibly powerful. Because Snowflake realizes if I'm spending my entire compute budget optimized this one query maybe you try a new bi problem or a new business problem like business venture with this compute money that you've now saved you're actually further entrenched in the snowflake ecosystem so diversifying the workloads like diversifying your investment is actually really beneficial and i actually think that's one of the best things i I think, that Discovery Amazon had

Starting point is 00:37:27 as well, where even back in the startup that I was at previously, we would focus on, well, if they gave us compute credits or some way of offsetting spend with private pricing, I'm going to take that money and use Macy or Redshift and try something completely new. So from the consumption-based pricing model perspective, I actually think a diversified investment is better for the data clouds. And so even having their solutions architects potentially at some point use Blue Sky

Starting point is 00:37:56 and say, here are the areas we can cut costs or here's areas we can redeploy capital is incredibly powerful. I mean, Eric, one thing to your previous question, building these data apps and like, you know, almost Snowflake is now this like API, right? It's almost like, I forget who called it this on LinkedIn somewhere, but they were saying Snowflake's building their own like Apple app store

Starting point is 00:38:18 where you can build all these data apps backed by Snowflake. And I think it's a really great characterization because it actually shows that Snowflake is now handling the backend computation of all of these tools and technologies and enabling people higher in the stack to generate business value ordinarily wouldn't be able to do as much work given their lack of experience building some of these technical products. So it's kind of the same thing. If I am Snowflake, I'm not necessarily interested in just optimizing as much compute out of this one app developer

Starting point is 00:38:55 because it means that they can't spend money building, refining, doing other things. So actually running those optimizations behind the scenes or deploying a tool that can help these folks who are creating data apps grow and scale, I think is really powerful. And one example I will give, given that they are public, Houseware just won Snowflake's startup challenge a few months ago. And these guys are an awesome team. They're building really cool products. I'm not an investor, but I think they're actually really cool. And so one of the things that they're doing

Starting point is 00:39:26 is helping people build these data apps and being able to build a data app. If you're a small company, just like a CTH company, you can have all the technology, but be terrified of that big compute bill from some user accidentally running you up. So it's the guardrails around safe compute as well

Starting point is 00:39:43 that I think are really powerful. Yeah, that makes a lot of sense. And one last question for me before I hand it back to Eric. Is BlueSky right now like working only over Snowflake or you support also other data cloud solutions? So the irony of me right now is I actually didn't know anything about Snowflake until I started working at Blue Sky. My experience is like fairly heavily Spark and Databricks.

Starting point is 00:40:13 So right now we're focused on Snowflake for two reasons. First, Snowflake is, you know, it's a big dominant player in the ecosystem. And I think there's a lot of opportunity in Snowflake in particular. Just from a configuration or a query perspective, there's a lot we can do, especially with also SQL. So right now we're focused on Snowflake, but that will almost certainly change as time goes on. So do you see there more opportunities in systems similar to Snowflake or also in systems that are more like Spark? The reason I'm saying that is because as a computation model, they're very different, right? Very different types of deployments, different teams that are involved. So how do you see the difference there there between like a system like BigQuery or

Starting point is 00:41:08 Snowflake or Redshift and then systems that are more like Athena or BMR and Spark or DataBricks? How did you see the difference there? It's a good question. So I want to say like taking this step from the back from the perspective of just Snowflake, going to Blue Sky, I always, it's not like a broken record, but it's really about this efficient deployment of capital, right? So I would not necessarily say, I mean, this is great, right? If we can optimize someone's spend, that's awesome. But I wouldn't say my goal is to go into a customer and like bring their Snowflake spend

Starting point is 00:41:42 necessarily like as down as as humanly possible. My goal is instead to have them effectively deploy their capital. So if they have a bunch of failed queries or they're not using certain BI tools, that's actually what I'm trying to address. Not necessarily just negotiating their price down or something. So I think when I look at something like Spark or Databricks, the number of levers that you have makes that problem like an N-dimensional problem. In Snowflake, for example, I don't ask to set like Spark's driver memory or Spark's executor memory. I execute the query and have some t-shirt size, warehouse size of how often or how often it's going to run.

Starting point is 00:42:21 But it really depends on like Snowflake to execute that completely independently. So I think when we move out of Snowflake and move to other technologies, I mean, BigQuery doesn't have a lot of these knobs, either this redshift, the dimensionality of how many variables we have to tune does become more and more complicated. Our goal is really to look from an organization or team-wide perspective. Not necessarily like, this query is slow. Let me optimize this individual query.

Starting point is 00:42:49 It's really across the organization. Here's what you're trying to do. Let me instead help you optimize that. And I'll give you an example that may be interesting for some of the listeners. Incremental pipelines we're seeing all over the place now. So a table, it's appending over and over and over. And oftentimes, because businesses are moving so quickly, we've noticed more than a handful of cases where an incremental pipeline, a table is being incrementally appended to,

Starting point is 00:43:16 and the downstream consumers of that table still do a full linear scan of all of the table without actually looking at the diffs. And that's actually a really hard problem to identify without a tool that's actually looking for that as like a best practice. And so the dimensions that we can go or the areas we can expand in Snowflake itself actually lead themselves to saying, given the multidimensional problem and data breaks in other places, it's actually almost easier and more focused for us to focus on optimizing this one sole area. I mean, I will say this,

Starting point is 00:43:51 I even being very knowledgeable on Spark, like leading Palantir's Spark team, I have no, I don't know how that dimensionality is going to make it easier or hard for us. It could be a really challenging space or it could be something where we can apply similar principles. Absolutely, absolutely. Makes a little sense. All right. So, Eric, all yours. Okay. I want to return to the car analogy

Starting point is 00:44:15 as we get close to the end of our time. So, one interesting thing, the car analogy with the driver is really helpful, right? Because you have sort of training, you know, a driver trained to operate the vehicle in a resource efficient way. And then the vehicle, to your point, has a huge amount to do with it. But, and I'm going to really extend the analogy probably to the point of breaking now to make my point and formulate my question. But if you think about this, I was actually thinking about this the other day because I was driving a car from the 80s that's really old. And you can basically watch the gas gauge

Starting point is 00:44:57 go down while you're driving. I think it gets like eight miles to the gallon or something, you know, and then you get, and so, and also like, it's not very fast. So if you like push the car really hard, you're getting like a lot of physical feedback that as a driver that tells you like, Hey, you are, you are definitely like using a lot of fuel here and the car you're driving, it's like really loud and you can see the gas gauge going down, right? So you're not only getting feedback on your own driving, but you're also getting feedback, you know, from the vehicle itself. Then if you sort of look at the modern version of that, right? Like you get in a Prius, you know, that's like a hybrid vehicle and it will give you real-time feedback on the economy of your driving style, right? And even like the

Starting point is 00:45:48 efficiency of the vehicle itself and sort of conserving resources. So again, I'm drawing the analogy out a little bit about that, but like, if you think about executing a query in Snowflake, just in the raw SQL editor, you have like the little time counter. And that's like, that is basically your only physical feedback. And then if you have a data app on top of that, where you're doing something that doesn't give you any physical feedback, it creates this weird dynamic where it's hard as a user to actually get the information that you need in order to optimize while you're doing your job. And I'd just love to hear your thoughts on that as part of this problem set. I mean, I know that Blue Sky comes in and helps, and you've talked about, okay, do we have a completely automated solution? But it is interesting.

Starting point is 00:46:47 I mean, to your point, like I don't believe that there's, you know, malicious, like we're going to obscure all this so that our NDR is crazy, right? I mean, you know, it's like they have to have a balanced approach to that. But there is actually not a lot of feedback that helps you sort of

Starting point is 00:47:01 while you're doing the actual work itself, adjust what you're doing to actual work itself, adjust what you're doing to account for resource usage. Yeah. So I had one of the customers ask me a few, a few, maybe a month ago, two months ago now, you know, why don't you just build a SQL query linter that just tells you, gives you the red Microsoft Word squiggly lines that says this is not optimal. And honestly, it's not a bad idea. The reason that I think this is a challenging problem is because in your Prius or in whatever car, you have that one dimension.

Starting point is 00:47:34 You have, I'd argue the brake and the gas are two sides of the same coin. Yeah. You can not slam the brake as aggressively or you can slam the peer put gas into it. The challenge is with that as your only lever, actually the problem space of what you can do to fix whatever is being found on the dashboard is much smaller. If I say, hey, this query is not being run optimally, or if I even said you're doing a linear scan over this data set, the amount of knowledge and expertise it takes to figure out how to solve that problem in an optimal way, it's actually pretty massive. And even if I were to say something like, I mean, we'll use like Java as like garbage

Starting point is 00:48:15 collection, right? I could just say in like C++, like you've allocated this thing. My IDE is saying you didn't destroy it properly somewhere else in the code. Well, those are great during effectively the product runtime. But when they have external dependencies not in the same file, it becomes a really complicated, almost intractable problem. Knowing what to do is the second step. And it's not always easy, even for like really experienced query authors. We actually, I'm our deployment lead. So when I go to customers, I actually use BlueSky and I will say, here's areas that I think you can do optimization. But for me, it's

Starting point is 00:48:57 still like, let me actually introspect your query. Let me understand the table, not just schema, but like attributes at a fundamental level. All of that effectively has to be surfaced to the point that you can tell a nice story around what needs to happen. And that's almost why, rather than just kind of lint in Java and say, here's all the things you can do to better optimize your memory, let's just build a garbage collection tool. Let's just handle that for you. That's kind of our end state. Let's just handle that for you. That's kind of our end state. Let's just handle some of these challenges for you.

Starting point is 00:49:28 And actually, the one additional dimension is some of these challenges you may not even understand. Like in a multi-jv app, if I have two things running and I have one service that is thrashing the hard drive of whatever bot it's running on, me as a second service may have no

Starting point is 00:49:45 idea why my job is so slow or being queued or IO is as bad as it is. Yeah. Yeah. If I had to, if I had to summarize that, it would be you as a driver, like shouldn't have to worry about all of these various inputs because it's a much bigger problem than like gas pedal and brake pedal. Exactly. You should just be able to focus on driving. Exactly. And I think one of the things Snowflake has done is Snowflake is actually, I don't know if people think about them this way, but they're a giant multi-tenant system. So everyone can share data with everyone else.

Starting point is 00:50:19 Everyone's executing queries in the same, technically the same AWS or GCP infrastructures, everyone else. So you're really part of this massive cluster that's doing all this computation that can actually affect a lot around whether or not you're getting your queries surfaced and run in the exact same time every single time. Yep. All right. Well, we're close to the buzzer here.

Starting point is 00:50:41 So one more quick question for you, and this is advice for listeners. For anyone listening who is thinking, you know, maybe they don't have a huge Snowflake bill, but this has gotten their wheels turning on, hmm, like I wonder what, if anything is super inefficient in, you know, in the way that we're executing stuff on Snuff, like where would you have them start looking? Like where's the best place to start doing that investigation? I honestly think one of the, so the first thing I would say is the Blue Skies team is incredibly knowledgeable on this. This is not just me like giving you a sales thing of saying, come talk to us.

Starting point is 00:51:23 But finding people that are that have a lot of expertise in this space is actually really hard specifically because when they develop that expertise they either have a lot of contextual information like people at big snowflake consuming companies know about their own unique patterns but they don't necessarily see the like swath of other ways snowflake is being used so i would say like you can reach out to the blue skyath of other ways Snowflake is being used. So I would say like, you can reach out to the BlueSky team. The other thing I would do is, there's a lot of like Snowflake's

Starting point is 00:51:50 definitive guide just came out. And there's a lot of like great material in there. There's blog posts. And there's a lot of sessions like this, like really people who are spending time in the space who kind of share the tidbits of best practices, like the order by or the auto suspend stuff that we discussed. There's a lot of information.

Starting point is 00:52:08 We have a blog where we're slowly creating more and more content. But the main thing I would really do is honestly is experiment. Like try some of this stuff out. You can do it. And Snowflake has made it so easy just to try these queries in a smaller capability or spin up even a test instance. So actually playing with the tools and technologies, I think is pretty powerful. So helpful. Vinu, this has been such a great episode. I learned a ton. The multiple analogies

Starting point is 00:52:38 were great. So thank you so much for spending some time with us. Absolutely. Thank you both so much. It's been awesome being here. Glad we got a chance to chat. So really appreciate the time and hope this was helpful. My takeaway from this, Costas, there were a lot. I love the car analogy. I obviously dug into that multiple times. really helpful to hear someone so technical mention how difficult it is to actually get a sales contract from initial conversation to signature. And I just really appreciated that.

Starting point is 00:53:21 It was funny because Venu is obviously a brilliant person to even be able to perform all of those job functions. I mean, not only is he brilliant from sort of an engineering and data perspective, but interpersonally, obviously, to be able to actually be a salesperson too is a whole different skill set. So I think that's a really rare combination, but it was just really enjoyable to hear, you know, sort of hear it from the other side to hear an engineer say like, well, I mean, it is so hard, you know, to actually like get a sales contract through, right? You know, whereas on the other side, it's like, you don't understand how difficult it is to like, you know, scale a distributed system,

Starting point is 00:54:03 you know, or, you know, whatever it is. So that was my big takeaway system you know or you know whatever it is so that was my big takeaway along with all the other great stuff but that just made me smile yeah yeah like i really i think what i really enjoyed from the conversation is the definition of uh like the workload as capital allocation i think that was like very very interesting to hear they're pressing and in general like this whole mental model of how to think about your data infrastructure and how it is utilized and how you can optimize it and what optimization means at the end. I think that was probably the most valuable part of the conversation, at least for me, and hopefully for many listeners out there

Starting point is 00:54:46 who sooner or later they will face the need to optimize also for cost and not just for performance or SLAs in terms of latency and stuff like that. So yeah, we probably need another episode with you. There are more stuff to discuss and we'll get like the more detailed, like technical detail of the solutions that they have. So I'm looking forward to have him on the show again in the future. Absolutely. Well, thank you so much for listening.

Starting point is 00:55:17 Subscribe if you haven't, tell a friend about the show and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers.

Starting point is 00:55:49 Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 106: Optimizing Query Workloads (and Your Snowflake Bill) with Vinoo Ganesh of Bluesky Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 106: Optimizing Query Workloads (and Your Snowflake Bill) with Vinoo Ganesh of Bluesky Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.