Drill to Detail - Drill to Detail Ep.71 'The Rise of Snowflake Data Warehouse' With Special Guest Kent Graziano

Starting point is 00:00:00 Hello and welcome to Drill to Detail and I'm your host Mark Rittman. So I'm very pleased to be joined today by an old friend that I think I must have met the first time back in 2005 when I blogged about his conference paper on agile methods and data warehousing. And he's still at the forefront of the data warehousing industry today. So Kent Graziano, it's great to have you back on the show. And how are you doing? Great, Mark. Thanks for having me again. It was fun the last time. I'm looking forward to having another great conversation with you. Okay, Kent. So just for anyone who's not heard of you, tell us what you do currently. You work at Snowflake. What do you do there? And maybe do

Starting point is 00:00:50 a little bit of an introduction as to, I suppose, how you got there. Sure thing. So I'm the chief technical evangelist at Snowflake. And I've been with the company for a little over three and a half years now. And I'll say I accidentally tripped over the company attending a big data meetup in Denver, which was, oddly enough, put on by the Rocky Mountain Oracle User Group. And I got introduced to the company, the technology, loved the vision, loved the product, the way it was presented. And it just seemed to solve so many problems to me that I had seen as a data architect throughout my career

Starting point is 00:01:34 that I just had to become part of it. Career-wise, like you said, you and I met back in the early 2000s in the Oracle Development Tools user group. I started in the Oracle community actually in 1989, so way back version five of Oracle. And I was, you know, I've worked in the Oracle community for that entire time. I got to meet a lot of folks like yourself, presented conferences, eventually turned into the data warrior, is the name of my blog, and became very much focused after, I'll say, about the a book with Bill on data models and just went from there in doing all kinds of different data warehouses for all sorts of organizations. I've worked in and out of various industries, either as a consultant, sometimes as an employee.

Starting point is 00:02:40 I worked for HP for three years working on their internal data warehouse on NeoView. So I managed to get some exposure outside the Oracle world to other technologies. So that was really my introduction to the MPP world was HP NeoView. And then got to do a little work at SQL Server as well throughout my career as a data warehouse architect and consultant. And all along the way, thanks to the mentorship of the president of the Rocky Mountain Oracle User Group back when I started, I learned about doing presentations and giving back to the community. And so that sort of built my profile in the industry and where you met me, obviously, presenting at Oracle User Groups.

Starting point is 00:03:25 And all of that experience led me to where I am today, to being an independent consultant and then starting a blog and then getting the opportunity to go to work for this fantastic, what you now call the late stage startup company, Snowflake, and getting to be their chief evangelist globally. Yeah. And I mean, actually, in some respects, I owe you an apology because when you first joined there, I was, I wouldn't say I was skeptical, but I was certainly, I didn't, I suppose in a way, I didn't anticipate the meteoric growth that Snowflake's had. And I suppose the real world problems it solves for people and the way, I suppose, it has introduced cloud technology and cloud benefits to the sort of data warehousing world. And it's been, certainly been a meteoric sort of few years

Starting point is 00:04:14 for Snowflake, really. I mean, maybe just give us an idea of, for anybody who doesn't know, anybody who doesn't know who Snowflake are and I suppose how that product and company came about, maybe just do a little potted history of what Snowflake is and who they were. Oh, sure. So Snowflake is based in Silicon Valley, actually in San Mateo, California.

Starting point is 00:04:35 The company was founded in 2012 by two guys who were originally at Oracle who had an idea for building a brand new relational database that would be built specifically in and for the cloud and be able to take advantage of some of the key features that are available in the cloud that we don't have in on-prem systems. And so they actually designed and built a brand new architecture specifically to support high speed analytics. So it's got all the best features that we've all seen. It has the ability to scale incredibly like an MPP. So it's got MPP features. It's got columnar features. But it's fully relational, fully SQL-based,

Starting point is 00:05:27 and also addressed some big data issues that weren't being addressed by the mainstream vendors in that we can ingest semi-structured data directly into a column in a table in Snowflake and then write SQL against it. And this is one of the things that attracted me to the company. It was these features. These were problems that were in the industry

Starting point is 00:05:50 that people were having to spend a lot of money and a lot of engineering time trying to solve, I'll say manually, being able to parse out all this semi-structured data and just to put it in relational tables to run queries against it. Well, Snowflake invented a new data type. I mean, they were building a brand new database from scratch.

Starting point is 00:06:11 So why not invent a new data type that solves a problem that nobody else has solved? So all that came along. And so they spent about three years in stealth. And then in 2015, had their first GA release. There was already a couple of beta customers running. And I'll say in late, about six weeks later after they went GA is when I saw my first Snowflake presentation. And at that time, the company was about 80 people. After a month or

Starting point is 00:06:40 so there of me following the company, I had the opportunity to apply for a position and was hired in to be the evangelist for the company. And at that point, when I started, there was a hundred people in the company. And you talk about the meteoric rise. Well, let's talk about the meteoric growth of the company as well. The product was so promising and solving so many key problems that we were able to attract nearly a billion dollars in venture capital funding, which then allowed the company to grow. So little over three and a half years ago, late 2015, there was 100 people in the company. Now there's over 1300 people worldwide.

Starting point is 00:07:20 We've got went from being a company based in Silicon Valley, marketing primarily in the U.S., to having a huge presence all across EMEA, as you yourself have now experienced. And now we also have offices over in Asia-Pac, specifically in Australia, New Zealand, and Singapore. And it's been phenomenal. And I understand your skepticism when we first went there. It was the technology and what we are able to do in the way we're able to scale, handle the semi-structured data, seems unbelievable to experienced data warehousing people. Because those of us who have been in the industry i've been in it for almost 30 years now we've never seen technology that could do this and all promises from every vendor that they were going to handle all these kinds of problems turned out to be you

Starting point is 00:08:15 know um pretty far reaching and for the most part never really materialized so to see a product that was able to do this is quite stunning. And I very frequently had people asking me, it's like, well, you know, how is this possible? This doesn't seem like it could really do it, which is one of the reasons Snowflake's sales motion has always included doing a POC because it's seeing is believing. If you get people, I look at people like myself who have been doing this for years, folks that you and I know, Mark, in the Oracle Ace Director community, people in the Oak Table Network, those guys aren't going to believe it just because you told them it can do it. They need to see it because that's it. The seeing is believing because it is such a drastic difference from what is currently available in the legacy technologies and the on-prem technologies.

Starting point is 00:09:13 You really do have to see it. But one of the reasons and the way it was able to be accomplished was the fact that our founders did indeed write this from scratch. They didn't base it on any existing technology. It's not like another rev off of Postgres or a wrapper on top of Hadoop. It is a true relational database, SQL-based, designed though specifically for data warehousing, analytics, data science. And that's what allowed them to build this thing out and really make it the incredible engine that it is. And I think in part, the rise of the company and the uptake of the technology is because it actually does deliver on all the promises and people are seeing that and you know companies all over the world are able to put in you know very large

Starting point is 00:10:12 amounts of data and scale as they need without a lot of heartache engineering maintenance it just the architecture with this the multi-cluster shared data architecture, which separates the compute from the storage is a total game changer. It's a brand new architecture where before we had shared nothing and shared disk architectures. This is multi-cluster shared data.

Starting point is 00:10:40 It's completely separated that storage from the compute, which allows us to then design and create independent compute clusters to access the same common data set, which has been the dream of data warehousing for 30 years, to have that single source of truth for the data. Everybody's looking at the same data, but now we can do it without the contention of having everybody trying to query the data and load the data all at the same time against a finite set of resources. And that's where the architecture and the ability of the cloud changed everything to be able to have these independent compute nodes access that data so you don't have the contention, so you can run ETL with one set of compute

Starting point is 00:11:29 and run BI queries with another set of compute and have your data scientists pulling the data and doing advanced analytics on it with a different set of compute, but it's all against the same engine, and that's the same data. That's why it's so exciting to see this. And really why I do believe you've seen this meteoric growth is because once people saw what

Starting point is 00:11:52 it could do, and some bigger customers got on board and started telling people about what they were able to do, well, then people started looking at it more seriously and saying, how can we lower our total cost of ownership in our analytics platform? How can we get to market faster? And indeed, this is a follow on to what you first saw me doing, agile. How can we do agile data warehousing? How can we deliver faster? Well, one of the things that's always slowed us down was we might start building and gain a lot of momentum, have a highly

Starting point is 00:12:25 successful platform, but then we get overloaded with now we have too many users and the box slows down. And now we got to go through a procurement process to buy a bigger box. And then we got to port the data from one box to another or worse yet replicate the data. So now we've got the same data in multiple physical servers. And the architecture that Snowflake has built allows us to solve all of those problems so that we can now start building out an analytics platform incrementally and have success and

Starting point is 00:12:59 have growth and not be boxed into a corner in a data center. Let me stop you there. Let me stop you there. That's fantastic. I mean, there's a huge amount to unpack in what must be the longest ever response to a question on the podcast. It's great to be back on the show, Ken. So there's a huge amount in what you said there.

Starting point is 00:13:24 So let's kind of go back and pick through some of that there's um so so again to me and just as observation for me is is the thing that i thought was unambitious about snowflake was actually i suppose a masterstroke which was the fact it actually although it behaves differently to what we're used to seeing in the past um it actually you the way you manage snowflake the way you can do it use it's all it's all done through the command line, it's SQL. And the fact it kind of behaves like a data warehouse and things like zero copy clones and so on makes sense to people who are used to things like Oracle and SQL Server

Starting point is 00:13:54 and so on. It has the air of familiarity whilst, as you say, solving a few problems. So let's kind of take a step back. And I think to understand what Snowflake does, I think it's maybe good to understand what was there before and maybe what problems, what were the limitations of those architectures. So, you know, you and I have been doing this for a long time and, you know, you mentioned, I think, some ancient databases there back at the start. You look back at things like Teradata, Oracle and so on, you know, how does a typical on-premises data warehouse server work? And what were the limits that you started to hit back, you know, a, what, how does a typical on premises data warehouse server work? And what were the limits that you were starting to hit back at, you know, a few 510 years ago, really? Sure, the, the traditional boxes for one are just that they're boxes, right? It's a, it might be a server for a set of servers, a disk, a disk array.

Starting point is 00:14:47 But there were always, as you configured these, you had to know ahead of time pretty much what are we going to need? How much resources will we need for our data warehouse? And by then, with that knowledge, we then purchase a box, whether it's an appliance like Teradata or Exadata, or you're installing a SQL server or a Vertica or something like that onto a physical server, you've got to size the box. And I even remember the last consulting gig I had before coming to work at Snowflake, the first day, very first day, first conversation I had with the VP of IT, he said, I need to know how big a box I need to go by. And my answer was, I can't tell you

Starting point is 00:15:40 because I know nothing about what you want me to do yet. I don't know how big a data warehouse you're going to need. I don't know how much source data you have. You haven't told me how many users or what the applications are. So I can't tell you right now what size box you need to buy. And he followed that with, well, that's problematic because it's going to take at least six weeks. Once we figure out how big a box we need for us to get it, purchase it, and then probably a few more weeks to get it installed in our data center. And so that's the traditional world of having to have a piece of hardware and having sized that piece of hardware to the anticipated need in our data warehouse.

Starting point is 00:16:26 And this was true regardless of which of the databases or appliances you were working with. And by virtue of the decision you make on what to buy, that puts an end state boundary on how big you can get, how much CPU horsepower do you have so that it limits one of several things. It either limits the size of the data warehouse, how big can it get, how many terabytes of data could it be, or it limits the number of concurrent users. How many processes can we have running at any one time? And for the folks who built very successful data warehouses in those platforms, often they ran into the issues of, wow, we now have to regulate when our load process runs

Starting point is 00:17:23 and when our queries are being run because we can't run them both at the same time. We have to have an onboarding system now for adding new users because we're right near the limit. And maybe we can handle 10 more users, but we can't handle 100 more users. And so those were limitations that actually limited the ability for a data warehouse program to be successful within an organization. And certainly limited their ability to be agile and adjust to changes in the business environment and the demands of their constituents in trying to serve data. Those were all kind of limitations that we had in those environments. Okay. And so, again, back in those days, there was quite a big debate about, you know,

Starting point is 00:18:13 whether you went down MPP or whether you had shared nothing or shared everything or whatever. And yet you mentioned about the way that Snowflake works with its separation of compute and storage. I mean, that's something that we've heard about from Hadoop before, and it's an idea that's been around for a while. But why, I suppose, in a way, how does Snowflake's architecture differ from MPP or sort of shared everything? And why is that, in your view, a better way of architecting data warehouse servers today?

Starting point is 00:18:40 Yeah, so the key difference, well, there's a couple of key differences. First off, of course, is the fact that the storage and the compute are separated. So the storage, we'll say, taking a note from the big data world, being able to put all your data in a single centralized set of low-cost storage right now because this is a database it's wrapped with tables and schemas and databases just like we had in the traditional world and so this is to your point earlier you said it has that familiar feeling so yeah absolutely you you log into a snowflake and you see databases you see schemas you see tablesas, you see tables, you see views, you see sequences, you see constraints, you see users and roles.

Starting point is 00:19:32 So that's all very, very comfortable. way more people that know and understand SQL than any of the other programming languages, whether it's MapReduce or Java or even Python, that in the database world, in the world of analytics, people know SQL. So that was key to the architecture that this was a SQL-based data warehouse system. So we have the storage, and you see all these common objects. Then on the compute side, though, we now, with that separation in our architecture, you now have complete control over provisioning the compute. And I think of it as just-in-time provisioning.

Starting point is 00:20:24 I need to run a job. I need to run, let's call it, I need to do a big load. I want to load a whole bunch of data from my ERP system. So I write a traditional ETL-type batch process, but I have to have power for that. I need that compute. So I can create what we call a virtual warehouse in Snowflake and say, how big does that need to be? Is it one node? Is it two nodes? How big is that cluster? What kind of throughput do I need? And I can configure that on the fly, give it a name, and then start running my process. And I can turn on a thing called auto suspend. So when the process is done and the compute's no longer needed, it automatically spins itself down.

Starting point is 00:21:08 So I'm not paying for that resource when nothing's happening. And that's, again, that's a concept of elasticity and pay-as-you-go that we get with the cloud. And this is where back to why we are so unique in that our founders wrote this system to take advantage of those sorts of things. Now, we wanted to have another set of users who are going to query, say your finance department needs to run month-end reports. We can configure a separate set of compute for them. And it's going to run against the same database and

Starting point is 00:21:45 databases as that ETL does. So we're not having to replicate the data, but yet we've completely been able to separate the compute resource so our ETL now can run during the day. It doesn't have to be in a limited batch window. We can start addressing things like near real-time loads while our report users are running their queries and the thing that the differentiator there over the shared nothing and shared disk is that separation the compute from storage but then our cloud services layer that wraps around that, that does the transaction management, the ACID compliance, make sure that you have read consistent views. So I fire off a query. I don't want that data changing in the middle of the query execution.

Starting point is 00:22:38 Yeah, the ETL is running under the covers, but we've isolated that. So we now have a read consistent view and our cloud services layer allows us to do all of that so that we can have multiple virtual warehouses running with different constituents, asking different kinds of questions and different kinds of queries. You could have a streaming process loading. You could have a batch process loading.

Starting point is 00:23:04 You can have somebody running a data science algorithm against this data. And it's all been separated and managed for you. And this is the coolest part about this, is the end user, or even the database architect, the data warehouse architect, doesn't have to set anything up. It's all built in. So they don't even have to worry about this. All they need to figure out is how much compute does this process need, and then set on the auto suspend and auto resume. Then when people hit it, it turns on. When they're done with it, it automatically turns off. And that's something that no other system that I've seen can do and put it all together in this nice data warehouse SQL wrapper

Starting point is 00:23:52 that we all understand. Okay. Okay. So let's maybe contrast this. I mean, let's maybe contrast this with other approaches that have been taken so um so you another another um another data warehouse database that that also has ability to scale is so big query or any i suppose any of those i suppose serverless um maybe less um less kind of like um data warehousey sort of like um sort of services things that are more like table as a service and so on. I mean, you know, if you look at how scalable, say, BigQuery is or things like that, how, again, why were the design choices taken that Snowflake did? And, you know, why would, as opposed to you really, why do I find that Snowflake is more used than BigQuery, really, in client projects?

Starting point is 00:24:42 And this is, you know, my observations over the last couple of years. And in part, it's even the answer to the question is in the way you even ask the question. Those query services are just that. They're query services. They are not fully encapsulated databases. And so someone confirming a traditional data warehouse background, yes, those query services have some great use cases that they're absolutely useful for,

Starting point is 00:25:12 but they aren't necessarily useful as useful across a broader array, like a data warehouse is, and like a databases. There are more use cases can be served by one than the other. And that's where I think the challenge is, is in looking at what are the requirements? And this is really gets down to what's the actual business use case? Is it a point solution that there's a very specific thing that we need to do this and this is all we need to do? Or are we talking about enterprise data warehousing and an enterprise data analytics platform that, you know, as companies grow, they want to minimize some of that maintenance

Starting point is 00:25:55 and the hand coding. And so we're really looking for, like you said, that familiarity of a common standard database interface that we already have staff who knows how to do that, especially true at the enterprise level, right? You've got database administrators and database architects and data warehouse designers that understand this world. And I think in part that that's why we've seen this uptake is because the learning curve is much lower. I mean, you yourself can probably speak to this better than I, because I know you spent a lot of time coming up to speed on BigQuery and implementing it with your customers and contrast that to the learning curve of logging into Snowflake and writing create table and select

Starting point is 00:26:47 from. It's a very different experience as well. And I think in really the, you talk about the design choice. Why did our founders make that choice? Well, because, well, for one, they were database people and they had seen the market. They saw where the market was going. They could see the need and demand throughout the years of the big data and data lakes and all of that coming to fruition and say, you know, we need a technology that's going to be familiar and easy to use for the, you know, millions of database professionals out there in the world who are trying to make data more useful to the organizations and really start to treat data as an asset. And we need to make it simpler for people to do that and allow them to get more and more data in to petabyte scale and get more and more users in as we try to reach this,

Starting point is 00:27:57 the idea of data democratization and the citizen data scientist to be able to add all these people into the system and have it not be an onerous task so that it's much simpler to do. Okay. Okay. So, okay, other extreme then. Why, if you're looking for familiarity and you're looking for fully, you know, a full range of SQL functions and queries and so on, why shouldn't, I mean, I suppose in a way, why is Snowflake leading the market, I suppose, in data warehouse cloud services and not, for example, the traditional vendors, the SAPs, the Oracles, the SQL servers that have taken what is a very well-known database engine added a degree of elasticity to uh to those services and made them available to all their customers in the cloud

Starting point is 00:28:50 you know how comes snowflake is doing so well in that you know with that kind of competition because in reality and this is something a quote from benoit one of our founders when I first started, those are all being built on the existing code base. And what they discovered themselves through their years of experience was it was very difficult, if not impossible, to refactor that original code to truly take advantage of the cloud and these capabilities, which is why they invented an entirely new architecture. All the other folks you're talking about, they're either shared nothing or they're shared disk, right? And you just cannot get that full elasticity and flexibility that Snowflake has with those other technologies.

Starting point is 00:29:44 In some cases, people have effectively just taken an existing technology and put it on a VM in the cloud and called it a cloud data warehouse. Well, okay, it's a data warehouse in the cloud, but it's not a cloud-based data warehouse. It's not a cloud-native data warehouse. It wasn't written for the cloud. It was ported to the cloud, similar to what they used to do when porting from one operating system to another. The fundamentals of the architecture and the functionality are still pretty much the same. They may be able to make it a little easier by putting a nice front end on it and a little more GUI-driven

Starting point is 00:30:28 from an administration perspective and obfuscate some of the maintenance and work that needs to be done under the covers by making it truly easier on the front end for the administrators to do. But in some cases, if you're talking about the things that are platform as a service,

Starting point is 00:30:48 your DBA still has to do all the same work as if it was in a data center. Because in truth, it is in a data center. It's just not a data center you own, right? It's a data center that your vendor owns or a cloud provider owns. And so the work is still there. And that was, again, part of the goal that our founders had

Starting point is 00:31:09 was to make this much simpler. And anything that was even remotely redundant or rote that an administrator would do, well, we can automate that. So let's do that. Let's automate that and make it so much easier so that we can have agile data warehousing and we can have companies start small and grow to as big as they need to

Starting point is 00:31:35 without having to go through a crazy procurement cycle because they suddenly, you know, they spec'd a box thinking it would last for five years and they exploded themselves as a company. And in three years, they've used it all up. And now they've got to go back to the well and get more budget and go out and go through a procurement process. And all of that's gone with Snowflake. It's all gone.

Starting point is 00:32:01 Okay. So let's get on to it. So as I said, you know to to recap on this bit really i mean it's it's been um certainly been uh a pleasant surprise to see how well you guys have done and uh you know full disclosure my company is a snowflake partner i work on many snowflake projects now and it's actually the fact that i was working on them all the time that i actually contacted you and said this is actually probably a good time to uh to have another kind of conversation because um

Starting point is 00:32:25 certainly i use the product all the time now and for me it's features like say zero copy cloning it's it's it's it's all all the stuff around that you know that makes it very um nice to work with really um but one thing i'm conscious of is is since we first spoke a couple years ago or since i probably last saw you in the states um there's been a bunch of uh product um enhancements features initiatives and so on that i thought would be interesting to get your opinion on as a person who knows databases very well and data warehousing very well. Maybe I'm interested with these as to, you know, what problem these things were solving and what's the real kind of, I suppose, innovation behind them. So, I mean, the first one I want to speak to you about is the work that's

Starting point is 00:33:03 going on you guys are doing around, I think, data share houses or journey data exchanges and so on. Tell us a bit what that is and tell us, I suppose, how it leverages the Snowflake architecture to do this and so on. Sure. So, yeah, so data sharing or I think our marketing term on that was the data share house. So as a feature, what that allows companies to do, it allows you to build a, I refer to it as a curated data mart in your Snowflake account and associate it with an object we simply call a share and say anyone who has access to the share can see this

Starting point is 00:33:46 data so it might be a schema it might be a single table it might be a set of views but it allows you to encapsulate a set of objects and then grant access to those objects to other snowflake account holders so now I log into my Snowflake account and I have a shared database that you built. And I can query it. It's a read-only database. I can't update it. It's read-only.

Starting point is 00:34:16 But this has eliminated the need to export data. First, you always had to design it, right? So if I want to share data with you today i had to go and design my database then i have to build an export process probably to flat files i have to put it up onto a secured ftp site somewhere you download those flat files and you build an etl process to then load that into your data warehouse. And of course, that all takes time and money. And how often do I refresh this data? You know, is it monthly, quarterly, annually? Well, we may decide it's such a pain in the neck, I'm only going to refresh it annually. But really,

Starting point is 00:34:57 you would like to have it refreshed monthly. Well, there's all that mechanism in place with the data sharehouse. Now, we can update it as frequently as you want, and you now see it instantly. So I can get to a near real-time update of this shared data set that's going to deliver more value to the consumer faster without any additional work on their part. And that is groundbreaking. I mean, we talk about like nielsen who invented the idea of data sharing over 100 years ago right they are selling the tv ratings right collecting them and selling them to various agencies now there's no more data transport so and it solved it solved another interesting problem I hadn't even thought of originally was

Starting point is 00:35:46 there's no more redundant data and no more redundant storage necessary. So if I build you a 100 gigabyte database of shared data, well, you've got to have 100 gigabytes somewhere in order to consume it and to access it. Well, now with Snowflake, and this is specifically because of the architecture with the separation of compute from storage, that storage is all in the provider's Snowflake account, but it's only once. It doesn't have to be replicated now. And I can share that to any number

Starting point is 00:36:21 of other Snowflake customers. And that whole concept... But why would they do that? Well, there's really two use cases. One is the data for good use case. So data that people just want to share to help other people augment their analysis for various reasons.

Starting point is 00:36:43 So nonprofit organizations, NGOs, these are all great use cases where they've collected, say, population data, and they need some sort of study done on that, but they don't have the expertise to do the study. Well, they can provide a curated, anonymized set of that data in a snowflake share and share that to any number of other organizations that can potentially augment that data and produce the kind of analysis and reports that they're looking for. And then the monetization is, of course, the other one, just like my example with Nielsen, is people who sell data, who collect data and sell it to any number of consumers. And so you build a multi-tenant style data share that when somebody logs in, they only

Starting point is 00:37:36 see the data pertinent to them, right, that they are actually allowed to see. And so that's really where it scales. And this is where the network effect comes in. Provider number one creates a data set, shares it to 20 other downstream consumers. Each of them in turn may augment that data with their own data that they have in their data warehouse. They can now join it to this in their data warehouse, and they can create a refined data set that they may share back to the provider, or they may share downstream to their consumers. And so you get this radical explosion of the usage of the data,

Starting point is 00:38:21 but also at the same time, allowing organizations to monetize that data. We're talking about data as a true asset now, something that can have a monetary value put on it. And so organizations that didn't have the capacity from either a resource perspective or skills perspective to necessarily do traditional data sharing and data subscriptions and selling data to their consumers. Now they find themselves that, hey, they can now think about that. They can now think about just sharing data with or without a charge to their business partners for betterment of their ecosystem. These are all now opportunities there. And so this now has grown into the announcement at our summit back in June of the launching of the Snowflake Data Exchange. And the data exchange is Snowflake customers who are data aggregators. they have data that they believe other people will want access to and will find value from. There are some data sets that are free.

Starting point is 00:39:30 There are other data sets that require a subscription. But it's now a matter of just signing up for it, and then you get a share into your Snowflake account, and you can start using that data immediately. And so this is a whole new aspect to the data warehousing and analytics world that is just so much easier. Some people are referring it to the new data economy. You've seen all these articles. People are saying data is the new oil. And people are really thinking of data as an asset.

Starting point is 00:40:06 And there is now a certain aspect of the economy that is growing around this. And Snowflake is in the forefront of that with first the data sharehouse concept and the ability to do the sharing. And now having a data exchange, a platform where that's really a data marketplace now that the companies who never thought of it before now can look at the data they have and say, there's other people that can benefit from this data. I'm going to make it available to the world through the Snowflake data sharing. Okay. Okay. So, I mean, that's okay. Well, so other stuff that was announced at Snowflake Summit, there was support for GCP as a cloud platform. I mean,

Starting point is 00:40:53 I suppose in a way that in some respects, that's not very interesting. It's just another way you can consume Snowflake, but it also, I suppose, in a way is quite interesting. I suppose, technically how you've done that is interesting, but I suppose it may be the different use cases it opens up i mean maybe just tell us about what being available on gcp means in practice and uh you know again why why and um and what value is there in that well snowflake was designed from the ground up to be cloud agnostic so the founder's original vision is we did not want to be locked into a particular cloud. So the system itself is self-contained, if you will, and was designed to write to the APIs of the underlying cloud providers.

Starting point is 00:41:36 So that's what allowed us to develop on Amazon and then port to Azure and now port over and have an implementation on Google Cloud is the guts of Snowflake, the intellectual property and the unique functionality of Snowflake is encapsulated in the Snowflake engine itself. It's being powered by the blob storage and the compute of the underlying cloud provider. Now, why this is coming up now is simply the demand. There are companies for various reasons that have their allegiance, if you will,

Starting point is 00:42:13 for whatever reasons, technical or economic, to the different major cloud providers. And so the demand is there, and we are an agile engineering company. The demand is now to the point that it is on our prioritized backlog, if you want to talk in Scrum terminology now. So it's floated to the top. Now it's time to go work on that.

Starting point is 00:42:37 It's time to now build out our offering on Google because we have enough of a customer base that is saying, yes, this is interesting. We want to be able to use Snowflake on Google. And it may be because they have a lot of data already on Google. And that's been kind of what I've seen is, you know, companies who have a lot of data on AWS, they tend to go with Snowflake on AWS. Likewise with Azure and now with Google is that the folks that are going to be, that have a lot of data and a lot of investment in their data already on Google would prefer to access that with Snowflake on Google simply because it's going to be the most

Starting point is 00:43:23 convenient for them and it's actually gonna be the most convenient for them and it's actually going to be the lowest cost for them because otherwise they've got to pull the data out of Google over somewhere else for it to be accessible to Snowflake in another platform. Okay. So tell us a bit about Snowpipe. We've been looking at one of the places, one of the customers I work with is actually putting Snowpipe in place now

Starting point is 00:43:44 to bring in event data and bring in real-time data. But just tell us, what is Snowpipe, and what's it used for, and how does it also relate to things like the Kafka integration that's been announced recently? Yeah, so Snowpipe is our serverless data loading offering. And it really works with you who drop the data into your blob storage, whether it's S3 on Amazon or Google, or sorry, Azure blob storage,

Starting point is 00:44:17 and we can automatically pick up those files and load them into tables in Snowflake. So it's a continuous loading feature and it is serverless. So based on the size and number of files that it senses on a pipe, it automatically under the covers spins up the compute automatically and loads the data and then turns it off. And it saves you that administration because previous to SnowPipe, the primary mechanism for loading data to Snowflake was a copy command, but you had to have a virtual warehouse configured. And so you had to size it and make sure that it was available

Starting point is 00:44:54 when you're running your loading processes. And so this allows people to not have to do that. And actually, it was the first serverless function that we that we introduced to our ecosystem there's an api for that so folks writing javascript and python and other things can address it directly using the api we call that the snow pipe expert mode and then there's an auto ingest feature where it's simply define the pipe, define the endpoints, drop the data in kind of like a, like you mentioned Kafka, like a Kafka queue, and it just picks it up out of the blog storage and loads it in. So that's the basics. But we are also, as you said, announced at our summit that

Starting point is 00:45:40 we are getting native Kafka integration as well. So somebody can use Kafka to stream data right into Snowflake, where previous to that, people used Kafka and would stream it, drop it into an S3 bucket, and then Snowpipe would pick it up and load it in. Well, now we can make that a more seamless process by loading directly into Snowflake through Kafka. And it really, the goal of all these kinds of things is to have an ecosystem that provides choice to the customer. So whatever their preferred engineering method is,

Starting point is 00:46:19 whatever preferred ETL they may already have today or ELT process they have today. We want it to be as simple as possible for them to move into Snowflake without necessarily having to retool everything. Now, of course, as you know, we're seeing more and more demand, especially with IoT data, to do near real-time continuous feeds, streams of data. So there is certainly an increase in people using things like Kafka. And so it's incumbent upon us to provide that opportunity for them. And so working with the various tool vendors to help them build the connectors into

Starting point is 00:47:08 Snowflake and have the facilities to do these sorts of things is part of what we do in trying to make sure that we are able to put our customers first and give them the functionality and tools that they need to be successful. Okay. Okay. What about another thing I saw on my list of things here that I've seen that are interesting about Snowflake? JavaScript stored procedures. So, of course, you and I are very familiar with the idea of stored procedures, and we spent many, many years of our previous careers kind of working and consulting in this.

Starting point is 00:47:39 Tell us what problem JavaScript stored procedures solve, and I suppose, in a way, how are they different to the things that you and I would have been used to in the past, things like PLSQL and that sort of thing? Well, it's, I mean, like you said, stored procedures, there was a demand for it, and we've now added it. But we chose JavaScript because we didn't want to create yet another proprietary procedural language like the other database vendors have done. You mentioned PL SQL. PL SQL is something that Oracle invented. And so the choice to go with JavaScript is it's something that more people know, right? We want to make this, again, accessible to the broader audience and also at the same time not create something that is so proprietary

Starting point is 00:48:27 that nobody else can figure out how to use it right and so that's really that's that's why the choice of JavaScript we didn't want to go and invent yet another procedural language in order to support stored procedures inside of snowflake and from early on when I joined the company, we already had user-defined functions. And user-defined functions in Snowflake can be done with either JavaScript or just straight SQL. And so this is just following, I'll say, in kind

Starting point is 00:48:59 with our stored procedures being JavaScript stored procedures. Okay, okay. our stored procedures being JavaScript stored procedures. Okay. Okay. So we recently had one of the product managers for BigQuery on the show a while ago, and he was talking about, we were talking about the feature called the BI engine that's come out in BigQuery recently. And it made me kind of think, you know, a lot of, one of the things I'd really like with Snowflake, or certainly I've perhaps seen a need for is something that gives us um you know i

Starting point is 00:49:29 suppose more split second response times and and beyond you'd get with a kind of column store database i mean what's i mean i know there's been um materialized views have come out with snowflake recently maybe tell us about those and then maybe kind of if you can talk me maybe talk about where you might see this going or or the problems to be solved in this area or whatever, really. Yeah. So, well, the materialized views, first off, are just they are performance enhancement. specific subsets on very, very large data sets that may be required for dashboards or other kinds of reporting to optimize that performance so somebody's not having to go through and, you know, I'll say do aggregations against 100 billion rows

Starting point is 00:50:17 every time you execute the query, you build a materialized view that does that aggregation. And then under the covers, what we have is another serverless feature that keeps it up to date. So as the underlying table is updated, we are automatically re-syncing that materialized view. And in order, again, I think of it in terms of kind of an ELT process without having to do the ELT, right?

Starting point is 00:50:48 You know, why code it when you can have the system do it? So this is, again, one of those features that we've put in to enable that sort of thing. to what you're talking about, about these very small millisecond response sorts of BI queries, that's definitely something we have some... Yeah, we have some people, customers doing that already with Snowflake, and it is really a matter of figuring out the optimal data model in some cases to facilitate those query patterns,

Starting point is 00:51:27 as well as getting the right, I'll say right sizing their virtual warehouse and our multi-cluster warehouse feature that allows us to horizontally scale a virtual warehouse comes into play there if you're if you're suddenly having a burst of a thousand queries come in against a particular table to to do all these kinds of little queries one of the features you can take advantage of is our multi-cluster warehouse where you may start off with a single cluster you say a small, which is two nodes, and then there's an inbound and a lot more queries, and it starts queuing the queries. It'll automatically spin up another cluster in parallel

Starting point is 00:52:18 and load balance it. And so that's how we address that particular need for, I'll say, high level of concurrency on queries at a particular time is this automatic horizontal scaling of a virtual warehouse. Okay, fantastic. I mean, certainly I found that when I've been, well, first of all, I was asked to do a snowflake tuning exercise for a customer. And what you rapidly find out

Starting point is 00:52:45 is there is very little to tune, which is good, obviously. I mean, it's a column store database. There aren't indexes, there aren't kind of, and so on really. And again, my advice to the customer was to look at the data model and pre-transform and all the classic kind of data warehousing things that we used to do. And I think that certainly other vendors might be able to say, oh, we've got OLAP servers built in, we've got this, we've got that. But it adds complication to it, really. And I mean, just maybe just to explain to us again, how does the multi-cluster thing work? Because in a way, if compute is separated from storage, why is there a need to have a sort of second or third cluster? And how does that work in practice, really?

Starting point is 00:53:27 Oh, sure. Yeah, yeah. So, yeah, the compute, as you said, is separated from the storage. But imagine, if you will, a dashboard that typically on a typical day, 10, 15, 20 users may be accessing it and running the reports on these dashboards. And so that requires a small virtual warehouse. But month end rolls around, and there's several hundred people that need access to that same data. Well, you could go in and potentially

Starting point is 00:54:00 resize that virtual warehouse to something larger. But in truth, the best way to handle concurrency is to have additional nodes. But because you've got a dashboard and say, you know, the dashboard's using this warehouse, you don't want to go in and say, well, if it's user A, go to warehouse one, and if it's user B, go to warehouse two. So instead, we created this thing called the multi-cluster warehouse,

Starting point is 00:54:28 where we can say this virtual warehouse, which might be two nodes, can add parallel clusters that are exactly the same size, and then we automatically load balance those queries. So if there's 100 queries come in and they start to queue, we'll automatically spin up a second parallel cluster and take some of those queued clusters

Starting point is 00:54:52 and move them over to the second cluster so that they can be running indeed in parallel with the other queries. And if the queuing continues, we can spin up a third and a fourth and a fifth, and you can configure it to go all the way up to 10. And so things like the classic Black Friday, you now don't have to build your system for the peak load.

Starting point is 00:55:15 You can configure Snowflake to automatically scale out to handle that peak load, and then when the load passes, it automatically turns itself off. So you're not paying for those additional compute resources when indeed nobody actually needs them. And so this is one of those set it, forget it things. Once you realize that that's the use case for this particular environment, you make it a multi-cluster warehouse by simply saying you define a minimum number of nodes and a maximum number of nodes, and then the system handles the rest. So I like to say no more pagers going off on Black Friday

Starting point is 00:55:56 saying the system's down because we ran out of resources. If you configure it using this, it automatically scales and everybody's happy. The SLAs are maintained, and when it's's over it just scales itself right back down yeah excellent and actually yeah that's that's brilliant um so ken look i'm conscious of the time i've kept you now um and it's been really good catching up with you and uh you know i've um you actually you're over in the uk quite a bit aren't you actually i think you're with your um your your food your food photographs and your foodie kind of interests you uh you're over here quite a bit, aren't you, actually? I think with your food photographs and your foodie kind of interests,

Starting point is 00:56:27 you're over here quite a bit and obviously for Snowflake. But tell us a bit about kind of when you're next speaking and when you're next kind of in Europe and that sort of thing. Well, let's see. Next speaking in the US is actually the Northern California Oracle User Group. I am going to be giving a keynote there and also talking about cloud data warehousing.

Starting point is 00:56:50 How'd you do that? I was invited. The president of the group invited me to come and give this talk. So, yeah. And in truth, Mark, one of the things that's happening in the user community world, even in specifically the Oracle community that you and I are so fond of. We spent many years supporting these communities

Starting point is 00:57:10 and we still do, is that the leadership of the Oracle user groups is seeing the need to provide a diversity of information to their members because the world is no longer homogenous. And to be an Oracle expert is phenomenal, and it's a great career achievement. But as our companies are evolving and we're moving to the cloud, it's not just one technology anymore.

Starting point is 00:57:37 Just like with Kafka and the other streaming methods that we've had to learn, it's no longer, as you and I met, doing Oracle Warehouse Builder. Yeah, the story has changed and so I'll say the the enlightened leadership of several of these major user groups has seen that they need to expand beyond the one vendor policy and really become an educational organization to their members, to empower their members to be successful in their careers. So that's really the truth of it. The next time I'm in Europe, I am scheduled, I am now going to be speaking at the first ever

Starting point is 00:58:12 worldwide Data Vault consortium in Europe. It's traditionally been held in the US and was held here in Vermont back in May. But in September, we're going to be having the first ever event of that kind in Hanover, Germany. And I will actually be coming over to talk about data modeling schema on read. Another one of my, as you know, my favorite topic is data modeling. So Snowflake is actually a sponsor at that event. And then it looks like I may be speaking in Amsterdam, well, Utrecht, specifically at Big Data Expo. We're still working out the details on that. And a Snowflake-sponsored event in Zurich, Switzerland, all within about a two-week time period there in early September.

Starting point is 00:58:59 About to head off on holiday myself here for a couple of weeks, as it is that time of year. And so, yeah, those are the right when I come back, I go to Northern California to do the one talk. And then it's off to Germany and the Netherlands and Switzerland for my next round of talks there in Europe. Fantastic. I mean, just thinking back to when we recorded the first episode with you and the title was is data modeling dead and yet ironically today i actually sent you an email asking if you put me in touch with somebody at snowflake you could help one of my customers with their data modeling questions around snowflake um and and i think that's the kind of the irony not the irony but the uh in a way it shows uh shows how things have kind of not changed but it shows what a good bet I think Snowflake made.

Starting point is 00:59:46 And also probably what a good bet you made in terms of, you know, going to work with Snowflake and sort of focus on that technology. Absolutely. Yes. So how do people get a trial then of Snowflake just to kind of round things off? It's incredibly easy. You simply go to Snowflake.com. There's a button in the upper right corner of our website and you simply click on that and you put in a little bit of information and you get a 30 day free trial. Excellent. Excellent. Well, Kent, it's been great speaking to you. Thank you very

Starting point is 01:00:18 much. Enjoy your holiday and enjoy yourself over in Europe later in the year. If you do make it back to the UK, give me a shout and I'll buy you dinner again. And well done. It's been great to speak to you. Yeah, it's great. Thanks for having me again, Mark. Appreciate it. Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.71 'The Rise of Snowflake Data Warehouse' With Special Guest Kent Graziano

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.