Drill to Detail - Drill to Detail Ep. 5 'SnowflakeDB, and is Data Modeling Dead?' with Special Guest Kent Graziano

Episode Date: October 18, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Drills of Detail, the podcast series about the world of big data, business intelligence and data warehousing, and the people who are out there leading the industry. I'm your host, Mark Rittman, and Drills of Detail goes out twice a month with each hour-long episode featuring a special guest either from one of the vendors whose products we talk about all the time or someone who's out there implementing projects for customers or helping them understand how they work and how they all fit together. You can subscribe for free at the iTunes store podcast directory and for show notes and details of past episodes visit the Drill to Detail website at www.drilltodetail.com,
Starting point is 00:00:46 where you'll also find links to previous episodes and the odd link to something newsworthy that we'll probably end up discussing in an upcoming show. So my guest in this episode is Kent Graziano. Now, Kent is someone that I've known for many years, actually, through the user group scene around the Oracle world, through development. We both were involved with the Oracle Development Tools user group scene around the Oracle world through development. We used to both be involved with the Oracle Development Tools user group. And more recently, Kent's actually joined Snowflake. So Kent, do you want to introduce yourself first of all? Sure. Thanks, Mark.
Starting point is 00:01:15 Thanks for having me on the podcast here. Yeah, you and I go back quite a ways there with the Oracle Development Tools user group and being on the board of directors. And if I remember correctly, I think you and I were the founding chairs for the BIDW SIG for ODDAHUG back in the day, as they say. I have been in the database world for quite a long time. The majority of my career, though, I did start off as a Fortran programmer doing scientific programming before Oracle and all the database things came along.
Starting point is 00:01:56 But I've been working with Oracle tools and data modeling, Oracle databases, for over 25 years and been very active in the user community that entire time. Started off doing OLTP type data modeling and eventually got introduced to Bill Inman and Claudia Emhoff and got involved in data warehousing and kind of made a migration a number of years back to being more focused around building data warehouses and business intelligence and advanced analytics type systems, primarily in Oracle, but then eventually moving into SQL Server because, as we know, we have a very heterogeneous world out there these days in the database community. So I did get some experience working with a lot of other databases as well, all doing data warehousing. And then kind of a natural evolution of that is we're now starting
Starting point is 00:02:58 to move into the cloud world. And along came this company, Snowflake Computing, and I got introduced to them last year. And after looking at their technology and a lot of things involved in that company, I had an opportunity to go to work for them. And so I've kind of made that transition. I am still an Oracle ACE director in data modeling, but now working in the cloud world and big data world as well through my role at Snowflake. Fantastic and I think I think Kent you're one of the people one of the few people that's actually been doing this longer than me I think certainly when I when I started in this kind of area I suppose back in about year 2000 and so on there were a few people then that I kind of I probably looked up looked up to, really, in a way.
Starting point is 00:03:48 And kind of, you know, see as my peers now, I think there's yourself, people like Dan Vlamis. There's a few people really that were kind of around then and are still around now. And I think, you know, one thing that's impressed me, Kent, with what you've done is how you've kept yourself current, really. You know, you're working with Snowflake DB now, as you said. I remember, I think, working with you in the past on, you know, Designer 2000, asserting that kind of case modeling tools years ago. So, you know, it's great to have that history, but also it's great to be, I suppose, relevant now in these days as well. So, Kent, the reason that we're having this kind of conversation is there's been a kind of rumbling debate going on recently about data modeling, as I'm sure you've been sort of following. And particularly, there was a couple of kind of tweets I think you posted, and I posted something back as well
Starting point is 00:04:28 afterwards, about kind of data modeling these days. And I think I made a point at the time saying, the problem with data modeling is, it's all very good, but customers don't want to pay for it. So the kind of the kind of the kind of carefully crafted dimensional models and kind of third normal form models and all these kind of things that you've done in your kind of carefully crafted dimensional models and kind of third normal form models and all these kind of things that you've done in your kind of career. Do you think, is there still the demand for that now? I mean, do you still see these days that in the kind of the world of NoSQL databases of kind of the Gartner BI magic quadrant and so on, you know, what's your view on the kind of perception of data modeling these days? Well, it's a really interesting area because there's kind of two parts to this question.
Starting point is 00:05:11 And, you know, the first one is you mentioned, you know, is there a demand for it? And in my experience, I've been doing, as you said, data modeling for a long time. I've been, you know, probably 25, over 25 years in data modeling in various forms. And the reality, from my perspective, is there's never truly been a demand for it. It's not like people are knocking down the doors, hiring data modelers and data architects, at least in most organizations. It's an important aspect of design and understanding a business but it's often not valued as much as you know a data architect like myself would think it should be so the demand has waned a little bit here with the advent of big data and no sql and hadoop and kind of the uh the data exploration
Starting point is 00:06:08 tools that you yourself have gotten so involved with uh i have definitely there's definitely been a decline but it's a situation where there are certain use cases where, yeah, arguably data modeling isn't needed, I'll say, up front. And I like to reframe the question to now more of, is data modeling not needed? Or is it really the question of when is data modeling needed in the lifecycle of, in particular, a business intelligence project? So my view is there's always a price. The price always has to be paid at some point. And I think one of the things to my mind that's been, and we've all done this before, we get so wrapped up in the correct way of doing things.
Starting point is 00:07:00 And, you know, we talk to customers and we are so kind of like passionate about a well-designed dimensional model and so on that you know perhaps kind of put in the wrong way it can it can be seen as a blocker for a project i think it's a good point you've got there really that that there's always a price to pay there's always the price to pay at some point and particularly with the kind of the the schema on read schema on right kind of debate so so so going back to that point again you know ken what does that mean in practice then really well schema on read schema on right that's that's a perfect framework to think about this is where we have some of this semi-structured data and you know people call it unstructured but in
Starting point is 00:07:39 reality it's semi-structured when you start talking about schema on read, well, the word schema is still in there, right? There's still a design, there's a pattern, there's a structure, it just may not be something that we know right away. So sort of the data discovery tools that you yourself have been working with the last couple of years help you find and discover the pattern within that data, right? And at some point, though, you need to record that pattern if you want to develop any sort of a repeatable process. We start building out metrics, and they need to be audited. Well, saying, well, oh, well, the numbers changed because the schema changed, and we don't have any control over the schema is going to generally be unacceptable to auditors in particular.
Starting point is 00:08:33 The numbers can't just randomly change because we're doing a dynamic schema on read, and the structure changed, and so now the results have changed. That's not a repeatable process. It's not going to really help the business move forward in some areas. In other areas where you're doing data science and you're doing true data discovery and trying to find new. But when it comes down to it, I'll say the hardcore business and business intelligence and analytics that a lot of businesses need, they still need the model to be stable, read scenario, but at some point, we need to morph that into a more solidified, we'll say schema on write style design so that we can instantiate it. We can get a repeatable process loading the same data, curating the data. You used that term for the last couple of years, and I really like that concept of there's data discovery and then there's some curation where we turn that data we've discovered and where it's coming from and all of that and turn it into a more curated process, into a stable, repeatable framework that then allows the business to really get the value out of the data long term and have it reproducible and usable by a much larger audience than a small data science team. So we can get it out to, again, it's the data democratization concept. Get it to the masses and not have the masses getting different A and manager B looking at the same data source, building their reports, and coming up with two different answers because we're using a flexible schema approach. We really do need to have them getting the same answer from a business perspective. Otherwise, our businesses are going to be back to silos of data, arguing with each other
Starting point is 00:11:10 over whose data is right and what answer is right. And you could have drastically different answers that would imply different business decisions, which could positively or negatively affect the business outcomes. I think it's also's also i suppose we've got a phrase in in the uk horses for courses you know um and it's i guess it's kind of common sense really i mean there's always been an element of people doing their own thing really and i think
Starting point is 00:11:35 that especially these days where people can you know business users can go onto the internet they can get like a you know a bi tool in the cloud very easily. They can download Tableau. They can do whatever. There's always that. And I think that meets a certain demand. Sometimes people will want to kind of analyze data for themselves. They don't want to go through the whole kind of long process of getting a kind of dimensional model put together, going through kind of IT and so on. And I think that there'll always be kind of cases of that really and i think um i think what's been what's good that's coming out of a lot of the kind of the self-service
Starting point is 00:12:09 thing recently is that we've kind of recognized that there are different kind of situations and so on and sometimes full data modeling isn't needed but then as you say um as soon as you start to have sort of a shared number that you have to all kind of rely on this is where it comes in really and i think that the challenging thing really is is and i think this is for all tool vendors really how you kind of like transition between those two kind of modes if you take um oracle bi for example you know you've got data discovery tools as we know um and you've even got data visualization desktop but there's no kind of like i suppose easy way of taking the discoveries and the and the kind of the insights that have been gathered um from people using that and using that to feed back to the curated model, really.
Starting point is 00:12:50 And it's kind of, you know, do you agree? I mean, do you agree there's challenges there, really, that I suppose bridging these two worlds is a challenge, really? Yes, I do. And I think you mentioned how I've tried to stay up on technology, and my career has evolved because of these type of situations. It's someone who has the data architecture, data governance perspective theoretical, well, we've got to have a complete enterprise data model in order to make anything happen. And that's out the window, in my opinion, that long term, yes, it would be great to have a high level conceptual logical model of the data the business deals with. But the practical realities then come down to how do you translate that? You look at what we found in the data discovery area, look at what the requirements are in the, I'll say, more auditable business intelligence curated model area, and put the two together. And you need somebody who can bridge that particular gap. And that's not going to be your data scientist who's writing MapReduce routines. And it's probably
Starting point is 00:14:10 not going to be your Tableau report developer who needs a decent schema really to kind of point at in order to derive any value out of it. Somewhere in between. So I think the role of the, I hate to use the term enterprise data modeler, but that kind of a person who can bridge that gap, who can talk to people in the business, as well as the data scientists, as well as the technologists who are building the databases, the DBAs, the people building out the metadata repositories for the BI tools and the governance people and be able to bridge that gap and build a repeatable model based on the actual business requirements and trying to drive towards whatever the business value is that the organization is looking for. And then that whole concept, though, to me then blends very nicely into agile and taking an agile approach, not this monolithic enterprise data model that we spend years and millions of dollars on and arguing over little bits of the semantics on what we call something, but rather looking at some recent discovery the data scientists have come up with and talking to
Starting point is 00:15:33 the business and saying, how can we turn this into something of long-term value? Or is there any long-term value? That's the first question. Is there long-term value to what the data scientists have found? Or is this a short-term thing that we just – we need the data now, we're going to use it, and we can actually throw it away later? curated model that then the analysts and the people who have need for metrics and self-service bi that they can access it in a consistent and concise and sane manner and not end up with a whole bunch of data silos so so ken i don't know if you um saw the the gardener bi magic quadrant the most recent one The famous one where Oracle work were kind of out of the leaders quadrant If I out of the whole the whole kind of the whole kind of report that there was talk there about referencing
Starting point is 00:16:35 Gartner's kind of Modern bi platform and something that was a bit of a shocker in there But actually I can see the logic of it is to say that well if we're bi tool Metadata should be upfront metadata modeling should be optional i can kind of get that in a way but that a lot of where possible a lot of the kind of the semantic discovery that the kind of the the data model i suppose development should be as kind of as all to magic as possible you know so looking at looking at kind of structures and saying well you know you know, I can tell looking at that column that these are all kind of names of cars in there.
Starting point is 00:17:07 So the column perhaps is, you know, a vehicle or something. Looking at kind of common data, I mean, how much do you think that this process can be automated using kind of, you know, machine learning, using kind of expert? Do you think that it's a reasonable thing to think that a lot of the work that is done now by hand could be kind of automated for the users i think it's actually it's a reasonable thing to think that some of what we've been doing by hand can be automated i've over the last couple of years uh prior to joining snowflake i got to be part of the boulder Brain Trust, an organization of independent analysts that Claudia Imhoff put together. And in that organization, I got to see a lot of these newer tools and got to see some really cool technology that was doing exactly what you're
Starting point is 00:17:56 describing, where it could automatically profile a data set and start making relatively educated guesses as to, well, you know, is this an ID? This is text-based. This is clearly a date. And do some semantic analysis of the data. The interesting part was when they went to try to bring in data from disparate sources, say something over on Hadoop, something in a Mongo, something in an Oracle, and bring it into this sort of federated data warehouse environment and try to automate how do we join that data together. fairly good at ferreting out those relationships based on common data types and common domains where it would do some profiling of the domain values you know high values low values a length of character string things like that go well these two things kind of look like they might be the same. And then they could put forward to the user that perhaps these are join columns. This is something we can relate these two data sets based on these attributes here and these attributes in the second source. And it did, honestly, it did do a fairly good job of doing that. But in the end, it still came down to a human being who knew the business and knew the data at some reasonable level to look at it and go, oh, yeah, yeah, that's an account code in both systems, two completely different names. But yes, that's how we should join those two things together.
Starting point is 00:19:48 And the great part from a data discovery perspective was, of course, as an individual, you wouldn't necessarily have to be an expert on the data sources themselves, but the tool is now presenting to you the possibility that, hey, from this data source and from data source B, these two things might be able to join together. And you can look at it as a business person or a business data analyst and go, oh, yeah, I recognize that. That's our account code. Yes, accept that. Put it into the model, and now we can run some reports joining these two
Starting point is 00:20:26 disparate data sets together. And so it still required a human being who understood the business and the semantics of the data. And so I think we can get there. And over time, I expect there will be some more and more of that. Will that turn into the curated model? Maybe yes, maybe no. Again, it may give you a baseline now for doing some agile modeling and saying, okay, now we see this. And then as the business using the tool
Starting point is 00:20:59 sees value in joining those data sets together, now I think we move it off into a little agile project that says, okay, let's take these bits of data from these two different sources. Let's put it into a curated model, automate the load processes, automate the integration process, and roll it out into what you and I would have originally called a data mart. It may not even be a star schema. It may just be a wide, flat, denormalized table. Again, depending on the tool, it might be a star schema. It might not. There's no telling right now.
Starting point is 00:21:36 Again, it depends on what tool you want to access it with these days. But I can see that kind of process evolving. So I don't think they're not off the mark about the idea of having automatic data modeling, though I don't think we're not there yet where we could say all data modeling tasks could be completely automated. I think we're still quite a ways off from that. Okay. So we could probably talk or say about data modeling in a way I mean there's other areas I think there's the whole question about building kind of
Starting point is 00:22:10 transactional systems or things like MongoDB I think I see a bit of that now happening which is kind of interesting you know and that kind of thing really but we'll come back to data modeling I think towards the end when we talk about another area that you've been involved in data vaults but just for now actually I'm actually curious to see a bit more We'll come back to data modeling, I think, towards the end when we talk about another area that you've been involved in, Data Vault. But just for now, actually, I'm actually curious to hear a bit more about Snowflake DB that you've been working with. So you're a long-term kind of developer with Oracle. I first heard about Snowflake, I think, when Greg Rahm went there a while ago after working on Impala and actually before that working in Oracle.
Starting point is 00:22:42 And you're there now. So tell me, I mean, Kent, you're quite technical. What's the kind of, I suppose, in a way, what is special about Snowflake? What does it do differently? And why? Give us a bit of background here, what Snowflake DB is, first of all.
Starting point is 00:22:55 And then we'll get into a bit more kind of the ways and whys and so on. What is it, first of all? Okay, so fundamentally, Snowflake is a data warehouse as a service. So it's in the cloud. It is a fully relational database. It's MPP.
Starting point is 00:23:17 It's columnar. So it's been designed specifically for high-speed analytics, if you will. So it's designed as a data warehouse platform, not as an OLTP platform. So that's kind of the first bit. And one of the very unique things about it is that it was designed from the ground up to be a cloud-native database.
Starting point is 00:23:45 So it's not a repurposing or refactoring of any existing code base. Two of the technical founders came from Oracle. Benoit and Thierry, both of you and I have good friends who worked for them at Oracle. They started this new project and started with a blank sheet of paper, a blank text editor, whatever it was they were using, and started coding. So started for zero lines of code and built a brand new database
Starting point is 00:24:21 that was specifically designed to be cloud resident and work with the i'll say the unique features that are available in the cloud that we didn't have available in on-prem systems so let's just put that in some context then so examples of what people might understand the cloud databases to be so first of all let's take for example oracle there's an oracle cloud there's an oracle database cloud service okay that's one that's out there at the moment so you've also got things like redshift i think the most most the most kind of famous you know cloud native kind of database is redshift okay so contrast for just in a kind of technical sense you know contrast um snowflake to say oracle database cloud service and this isn't a kind of like a knocking
Starting point is 00:25:03 oracle thing you know but it's obviously that is a example, I suppose, of taking an on-premise database that's probably general purpose, and it's been moved into the cloud, and it's being used for more than just kind of like data warehousing. And you've got something like Redshift, which is probably the canonical kind of version of this. How does Snowflake differ to those two, really? Okay, well, at a very fundamental level, neither of those, whether it's Redshift, which is based on Postgres, and Oracle, which is obviously Oracle,
Starting point is 00:25:35 neither of them were written with the cloud architecture in mind. I hate to use the word legacy, but for the purpose of this conversation, we'll say that legacy code, classic, classic, yeah, that's a better term. I like that better. The classic code base that they have evolved out of was written, you know, some of it 30 years ago, specifically for optimizing performance on known hardware, on a set hardware basis, an operating system, certain number of CPUs, taking advantage of a certain kind of memory and having a certain amount of storage available in a box, right?
Starting point is 00:26:26 Whether it was a server under your desk or a huge rack in a data center. Those companies did a great job, you know, Oracle in particular, because obviously I'm a huge Oracle fan. You've been doing it forever and still an Oracle ACE director. Did a great job building a multipurpose database that could scale within the confines of the hardware environments. And of course, they've built the engineered systems like Exadata, which is engineered specifically for that Exadata hardware. And the hardware has been optimized as well as the software has been optimized to get the best performance.
Starting point is 00:27:10 So they're coming from a completely different paradigm than Snowflake. So we now get to the cloud, and you hear about the elastic cloud. Well, what does that mean? That means you're able to go beyond the confines of hardware that you've pre-purchased in a data center. Today, for on-prem systems, whether it's an Exadata box or even Postgres on a server, you have to do capacity planning and say, how big a load do I expect to need? How much data do I expect to have? And then spec out in some detail and perhaps even do an RFP to go get the hardware to support the software you're going to install on it and the applications you're going to build with it. With the Elastic Cloud, we now have the ability to do things like grab space as we need it,
Starting point is 00:28:15 expand and add compute power as we need it, so we can expand it and contract it. And that's one of the things that we've done with Snowflake is it is truly dynamic. You don't have to pre-allocate space. As you load data, the Snowflake service grabs that space from within the cloud resources. In this case, it happens to be AWS, Amazon Web Services. We're out there. We'll grab the space you need. And the beauty of that is you only get charged for the space you're using. So you don't have to
Starting point is 00:28:52 go buy, you know, 500 terabytes that you might not need for five years. As you load the data, the storage dynamically expands. If you purgege data it dynamically contracts and the price goes down along with it so you're not stuck with oh well my my max was i i one day loaded 10 terabytes and so now i'm getting charged for 10 terabytes of space when the next day i deleted five of those terabytes so so how would it i mean that obviously that that that would that would be how it compares to say oracle and i suppose the devil's advocate argument is you could you could run oracle on aws you know and and you could kind of allocate more cpus dynamically cost wise that's quite difficult isn't it because you haven't got you haven't got the same um you know almost like demand-based pricing and so on there but what about say with redshift i mean that would be the
Starting point is 00:29:44 classic kind of comparison would be um you know redshift to sort of snowflake how i mean you said obviously it's based on postgres it's different there i mean how else again is it different to say redshift apart from the fact it's not based on legacy code and and that's that's that's a very interesting conversation as well as we've we redshift is one of our top competitors, and so we run into them all the time. And I have actually looked at the Redshift administration manual, which is as big or perhaps bigger than some of the DBA manuals for Oracle. Because it was based on classic Postgres code, there are still things you have to do inside of Redshift like partitioning keys. You still have to put in partition keys.
Starting point is 00:30:33 And when you do that, because it's an MPP-type environment, so how many CPUs do you have? Well, I've got four CPUs. I've got four nodes, so I'm going to stripe. I have to come up with a partition key and stripe the data now across those four nodes. Well, in Snowflake, we don't have to do that. We've done a separation between the compute and the storage. So you have one common set of storage, and then you have independent compute nodes. So we have no partitioning keys that have to be defined. There's nothing for a DBA to do in that regard. They just simply have to load the data and model the data to a certain extent. allocated, say I've allocated a four-node cluster, it will grab that data, move it over into SSD cache,
Starting point is 00:31:28 and stripe it appropriately based on the fact that you have four nodes. And we have a dynamic optimization engine that does all that for you. That's why I would say this is a service. Well, on Redshift, you do have to build that into the design of your data model. Now, say, it turns out four nodes is not enough on your Redshift cluster, and you need to go to eight nodes. Well, now you've got that problem because the storage and the compute is tightly coupled in Redshift. How do I move that data that's striped across four nodes now to eight nodes?
Starting point is 00:32:05 Well, there's only one way to do that. You have to unload the data and then reload it and restripe it. And Amazon has a nice utility for that, which the customers and administrator has to go in. You have to set up your eight-node cluster, quiesce your four-node cluster, quiesce your 4-node cluster, export all the data from the 4-node cluster, and import it into the 8-node cluster,
Starting point is 00:32:34 and then move everybody over to work on your 8-node cluster and turn off your 4-node cluster. So it's not, you do have the ability to go back and forth, but there's a certain amount of administration and time. And even in the Redshift manual, you know, there is a utility that will do this all for you. So you don't have to manually do the exports and manually do the imports. It'll do the migration for you. But it does say it could take anywhere from a few minutes to multiple hours to potentially days, depending on how much data you have. So imagine you have a 20 terabyte data warehouse, and that has to be completely exported and then restriped and imported to the new cluster. That
Starting point is 00:33:22 is going to take some physical time. And while that's happening, your data warehouse has to be in read-only mode, your current four nodes. So now you have to shut off your updates. So I was talking, I met someone for dinner in London last night who works as, I think, a CTO or product manager at an e-commerce startup, and they were using Redshift to do their kind of work.
Starting point is 00:33:47 And I think he echoed a lot of things you're saying there about the amount of administration that's involved. And also because it's a kind of column store database, it was quite slow to load. I think that's kind of inherent in that sort of thing. But what they've done is they've actually, and this is interesting, they've actually moved from Redshift to Google,
Starting point is 00:34:03 which is Google Big Table and BigQuery and so on. BigQuery, yeah. So they've actually gone from kind of a pure relational database in the cloud to using kind of BigQuery and so on there. I mean, again, I suppose a kind of devil's advocate question to you is, is it slightly kind of weird now to actually build a new relational database running in the cloud when actually the world is moving towards using kind of Hadoop and various SQL and Hadoop engines and so on. You must have had that thought yourself, really. What's the justification for building another relational database now, really, do you think? Yeah, that's an excellent question, Mark.
Starting point is 00:34:43 Because obviously, when I ran across Snowflake and I first saw it and I looked at it and I'm like, wow, this is pretty interesting. The separation compute from storage, all of that was fascinating. The thing that got me, and this may be the justification here, is it's native SQL. It's SQL. And one of their phrases, and we do use this when we're doing our pitch, is if you know SQL, we can have you working in Snowflake within two hours. And not only that, Snowflake handles unstructured data, JSON and XML data. It doesn't take much longer for you to then be able to actually query JSON structures out of Snowfl mentioned, the Hadoops, the NoSQLs, all of those were built to natively handle SQL.
Starting point is 00:35:52 And SQL, as we've been saying in a lot of our conferences recently, we're saying yes SQL, not no SQL. Yes, SQL. There are more people in the world who understand SQL than certainly understand MapReduce or Hive or Impala or any of those. And so the real justification for this, and I had this conversation with one of our founders, is we're eliminating a lot of the complexity. Yes, you can do all kinds of great stuff with these new open source projects. But it does require a certain level of engineering expertise that most people, even in IT, don't have. And arguably, there are many DBAs who may be world-class DBAs, but it's a stretch for them to go do some of the things that need to be done over in the Hadoop and the NoSQL world to get the same performance, to get the same results, and deliver the same value to their business that they can with a relational database where they can go to something like snowflake which is a pure relational database model but written specifically again for the cloud and it's a very easy transition i mean it it took me no time at all i was writing blog posts and working with snowflake within a week of going to work for the company. We're building reverse engineering data models within a month. Thanks to, you mentioned Greg Ron earlier, I had seen his presentation at Oak Table World last year on querying JSON. I went through his demo and that was pretty much all I needed to
Starting point is 00:37:41 learn how to be able to query JSON out of a Snowflake table that's been loaded. It's a true schema on read. Load the JSON document in, but I can write SQL now to tease it apart without having to write MapReduce or some other complex ETL and putting into relational tables. It's in a relational table, but as a JSON object. Very cool technology to be able to do that. And I think that's the justification there. It's not only just the relational, but being able to merge the relational
Starting point is 00:38:16 with the semi-structured in an environment that is familiar to the majority of people in the database world and easily accessible to them. And then consequently, all of your standard ETL and BI tools can point to it and use it via either ODBC or JDBC connector, just like Redshift, Oracle, SQL Server, Teradata, you name it. So we're on an even playing field there, but now you've got a high-performance cloud-native,
Starting point is 00:38:50 truly cloud-native database engine to work with. Yeah, yeah, definitely. I mean, I noticed, I think I saw somewhere, you've done a piece of work with Stuart at Red Pill where you've put, I think you've connected, he's connected Snowflake or certainly OBI to Snowflake. I mean, it strikes me that's one of the benefits, really. And I don't know if you noticed, there was on Kurt Monash's blog at the weekend, I think there was a post he did or last week on the value of analytic databases.
Starting point is 00:39:17 I think you kind of retweeted it, the mention I made of it, really. And it was interesting. In his view, I mean, he's obviously a big advocate of the new world of of all the stuff i've been talking about recently but he he his view was there for most workloads that are not kind of um for most workloads uh in a kind of like an analytic sense uh you know you might as well move to hadoop now or the various technologies from that but there is one particular kind of use case in one particular kind of scenario where kind of high-end or in this case just you know cloud-based kind of um relational databases make sense and that's in a kind of high-end bi kind of context and when you've got when you've got a kind of structured data set and you want to index it and you want to kind of like to to
Starting point is 00:39:59 summarize it you want to have fast access to that data and so on you know you can't beat a kind of analytic database and the kind and i put snowflake in that kind of category um you know for that kind of thing you're creating more work for yourself than you need to to try and try and reproduce some of that in kind of some of the new world technology but i suppose in a way i mean you and i've been doing this for a long time now and the kind of comment i made at the time when i retweeted this thing was to say analyst databases aren't going away but they they're the new MoLap in a way. And I remember the days when you'd have at the core of your data warehouse a fast kind of high-performance analytic engine that in those days would be an OLAP server
Starting point is 00:40:36 or a data mart or whatever. And I put to you, Ken, would you say that relational databases now are the kind of data mart in a way? They're going to be there, but the total scope of data perhaps is bigger? What do you think? Or do you think actually that all this Hadoop stuff is overrated? And in fact, all the data should be in, say, Snowflake, for example? Well, I'm going to give you the classic consultant answer. It depends, right? Yeah, on the use case. Now, I do think that for certain use cases, yeah, the Hadoop world is overrated because whether it's the semi-structured JSON type data or truly structured data, the relational database is still the place to go. So from that perspective, I don't see it as being a data mart. But we'll look at – Claudia Imhoff has now got her extended data warehouse architecture, sort of the evolution of the corporate information factory.
Starting point is 00:41:40 And it includes dupe and concepts of data lakes and all that. And there are certain types of data, particularly media, that are never going to work well in a relational database. Some people have made it work, but again, you have to say why. I do see those use cases where the Hadoop world and MongoDB and things like that are certainly much more appropriate. But with services like Snowflake, you could build your data lake in a Snowflake database and evolve the schema as you move through the process. So because you can ingest semi-structured data, that's really the majority of what people are doing out in the NoSQL and Hadoop world is things like web logs and machine logs, so the whole IoT thing.
Starting point is 00:42:41 And all of that is truly, it is semi-structured. And the reality turns out to be, if it's being machine generated, the machine is not randomly generating strings of data. There is a structure because there's a program in that machine that produces that stream of data, which means it has a knowable structure. Now, great, we want to ingest it as fast as possible. So schema on read, absolutely the way to go, right? But as we then discover that data and we go through the data and find the business value
Starting point is 00:43:15 and we need to tease out the important bits of those streams of data, those then can move from, I'll say, what we would have called in our old reference architecture, your staging area, which I think today we can maybe call a data lake, right? That we're including the semi-structured data, and that's your data lake. You can have it in a cloud-native data warehouse service like Snowflake, load it as fast as you can, use your data discovery tools, use Tableau, use all those things against the data lake. And as we get into the curation process, then we can create another database that is more structured. And maybe it's in a more relational
Starting point is 00:43:59 form, their normal form or data vault form or something of that nature. And then again, we get onto our layer for the really critical BI pieces where we get into our star schemas and snowflake schemas and aggregates that our BI analysts are going to point at so that they get that easy to use, I'll say managed self-service environment that the report writers are used to, that they can look at something that is going to look like an Excel pivot table to them. And they can do their analysis and slice and dice and drill up and drew down the mole app, as you put it earlier. But I think we can still have that all within the same environment with, again, a service like Snowflake.
Starting point is 00:44:49 But media, yeah. I mean you don't do drill-down analysis on a media file. You may drill across to a media file as a reference based on looking at some metrics. But having that out in the Hadoop world makes complete sense. It still makes sense. So there's a place for all of it. And so I like Claudia's extended data warehouse architecture because it encompasses all of these.
Starting point is 00:45:18 There's multiple styles and approaches to the data, the curated versus less curated data discovery, a place for the data scientists to have their exploration warehouse, if you will, as well as the financial analysts who need the auditable numbers to do their analysis on that they can justify at the end of the year. So it covers the entire gamut. Lots of opportunities. So, Ken, actually, just to interrupt you there, you mentioned Data Vault there. the end of the year so it it covers the entire gamut so lots of opportunity so ken actually just to interrupt you there you mentioned uh data vault there now i i must seem i must seem sometimes like your kind of your worst nightmare because a lot of things that you um you talk about i take a
Starting point is 00:45:56 devil's advocate position on really and but data vault is an interesting one in that um just just to very briefly just to very briefly kind of say what I think it is, and you can talk a bit in a moment, and we do have show notes with the podcast as well, so we'll put some links to the kind of Data Vault material and so on. There's a data modelling technique called Data Vault, which you've been involved in, Kent, for a while, and I won't kind of butcher it by describing it all.
Starting point is 00:46:20 You'll do that concisely in a second. But, you know, it's interesting with Data Vault. First of all, again, with the kind of move towards kind of no no data modeling is data vault kind of you know something that is is i suppose it missed its time um and also but also i've noticed actually in some of the ways that you use data vault the way that you write stuff i think you seem to write stuff have satellites and so on and so forth it suited itself quite well to working with hadoop um so so when we when you and i were going through the topics for the podcast today i thought it'd be interesting to kind of
Starting point is 00:46:49 to take a quick look back at what data vault was just just kind of briefly and just just any ideas from you really as to sort of like i suppose in a way is it still relevant today are there some things from data vault that we could kind of uh take forward into the kind of new world what we're doing um but kent just for anybody who's not heard of Data Vault before, do you want to just very briefly explain what it is? And then to answer the question for me, how relevant is it these days? So just introduce first of all what it is. Sure. So Data Vault is not only a modeling technique,
Starting point is 00:47:20 but it's also now there's an entire methodology that goes with it for developing enterprise data warehouse repositories. So from a modeling perspective, it has three structures, hubs, links, and satellites, which are a very repeatable style of structure. And there's a whole bunch of rules around how you model these things. Dan Linstedt is the gentleman that invented it, who's a good friend of mine. I've been working with him since very shortly after he released the concept to the public
Starting point is 00:47:53 back in Denver in 2002. We found it was, let me say this, it's pattern-based, right? Because it's pattern-based, and I came, as you mentioned before, Designer 2000. You and I came out of the Oracle case and designer world and the ideas of generating code. Well, if you have a pattern-based modeling approach,
Starting point is 00:48:16 you can even generate the models based on the sources. So explain what a pattern-based approach is then really for that so you mentioned pattern-based there what does that mean okay so there's within data vault we have three structures hubs links and satellites and they're very very specific so i'll just talk about hubs because we don't probably have time to talk about all of them so a hub is a table that contains a business key. So what's a business key? Customer name, perhaps. Country name.
Starting point is 00:48:53 Let's go with country name as being a business key. Not some sort of a surrogate key, but from a business perspective, if you were to ask a business person how to identify a particular object, a country, if they're going to write a query, what are they going to query on? Maybe it's the country abbreviation, maybe it's the country name. And so that's the concept of a business key. So it's a very narrow table. It has however many attributes are necessary to define the business key. And then it has a surrogate key, similar to what we do in dimensional modeling, so that when you're doing joins, you can join the other tables without necessarily having to have a five-part natural key, which is what the business key is. Now, because of the structure, it includes,
Starting point is 00:49:35 always includes a surrogate key, always includes one or more columns that are the business key, which would be a secondary unique key constraint, and it has a load date and a record source column metadata. With something like that, it's very easy to model when you understand the semantics of the business. You can design a lot of hubs very, very quickly within a day, and then how do you load that?
Starting point is 00:50:00 Very simple inserts. If it's not in there, you do it not exists. And so there's lots of opportunity to turn this into something that's automatable. Two companies I know for sure, Warescape and AnalyticsDS, both have ETL environments that can work with these patterns and automatically generate all the code to populate these tables. So it's a very repeatable process. When I first used it, what I discovered, it was very easy to teach relational modelers how to do it, because it really is based on normalization theory. I mean, what's the rule with third normal form and COD, right? It's the key, the the whole key and nothing but the key yeah right and so that's a hub and so it's that kind of at a very fundamental level it's very simplistic it's very easy to understand you know getting there and figuring out what the
Starting point is 00:50:59 business key of course is is always a challenge yeah but it it gave me the opportunity to be more agile in my modeling. But it gets you the opportunity now to model a data warehouse in a very incremental manner. And because of the relationships, the relationships are through these link tables. If you miss a relationship, it's easy to add because you're going to add a table. You don't have to change
Starting point is 00:51:28 an existing table. So we've nearly eliminated re-engineering in the Data Vault methodology. Given where we're going with data discovery and schema on read, I actually think it's even more relevant today for building a curated model in an incremental fashion as we're doing that discovery,
Starting point is 00:51:49 where with the old style, I'll say, third normal form, you've got foreign keys. And if you missed a foreign key relationship, you end up having to change the model, and you've got to drop in tables and reload the data. With DataVault, you don't have to do that. Likewise, with star schemas, if you have a fact table and you left out a dimension, well, what do you have to do? You have to drop that fact table, rebuild it, reload the data because the data is at a different level of granularity. In the DataVault methodology, the DataVault model I'm describing with hubs, links, and satellites really is the
Starting point is 00:52:25 central repository. I'm trying to remember the term we used in the Oracle reference architecture. Yeah, so you're saying it's foundation layer then, really? You talk as if it's actually a substitute also for the star schema as well. I mean, it's not, right? So it's more an alternative way of modeling the foundation. The speed and power of a platform like Snowflake have enabled it even more is building a virtual star schema, a virtual reporting layer on top of the data vault has turned out to be very automatable as well because it's pattern-based. A hub plus a satellite fundamentally is a dimension. A link plus its satellites is fundamentally a fact table. And so there's tools have evolved to actually generate that code
Starting point is 00:53:30 to on the fly design new star schemas based on what's in your data vault. So as you evolve the data vault, you can iteratively evolve the reporting layer and it can be virtual as in it uses views. I've done this now with Oracle, SQL Server, and Snowflake to actually build the starter schema that I expose to the report writers and to Tableau and to BusinessObjects and OVIE are all actually views.
Starting point is 00:54:00 And with the power of the databases we now have today, speed is not the problem anymore. We don't have the performance problem. In the old days, star schemas were instantiated because you couldn't get the speed of query. You could not have complex multi-table joins perform at any sort of scale. But I'll say between Exadata, Snowflake, and even in SQL Server I did last year, I was able to do this. I've been able to virtualize those layers so that building the foundation layer really was the foundation for the warehouse.
Starting point is 00:54:41 And we could evolve it from there over time as we learn the new business requirements. And that's one of the reasons I love Data Vault as an approach to data warehousing because it really puts you on a firm footing for doing agile and doing iterative development. One other thing I guess I should mention because we're talking about Hadoop and all of that,
Starting point is 00:55:02 Dan came out about two years ago with DataVault 2.0. And in this last year at the World DataVault Conference, he and his partner Sanjay actually demonstrated using Hadoop as a staging area for DataVault. And one of the interesting keys to doing that, and it's kind of tongue-in-cheek, the word key, is they start using MD5 hash functions
Starting point is 00:55:31 to calculate the surrogate key instead of using sequence generators because the Hadoop world didn't have sequence generators. And we also found there was bottlenecks if you're doing really high-speed loads. Some sequence generators on various platforms couldn't keep up. But if you calculate the surrogate key, because after all, what is a surrogate key?
Starting point is 00:55:53 It's a replacement for the natural key. It doesn't have to be integer. You and I over the years have gotten used to it being integer because that's the way the databases supported it. Well, Dan came up with this idea of, well, what if we use a hash algorithm and hash the attributes in the table together? That, in the end, does turn out to form a unique hash key. And so we've switched to using those for the primary keys throughout Data Vault. And that allows you then to have, you can calculate a hash over in the Hadoop world and in various NoSQL databases. So you can stage the data over there, calculate the key there, and have a way to communicate now between the relational world
Starting point is 00:56:35 and the Hadoop unstructured world using these hash keys as sort of virtual join keys, similar to what we talked about before. So Data Vault has adapted to the new world of Hadoop. So it's still very relevant. I think the thing that kind of got my interest in it again was actually I think you presented on that last year at Oak Table World. And I think you're probably doing a disservice there because you talk about the fact that it's adapted to the hadoop world for example but the impression i got from speaking to you i think that some other people was that the nature of data vault in that it's very much a case of kind of uh insert append all the time you know it lends itself very well
Starting point is 00:57:21 to kind of working with with hadoop as a data store, compared to, say, kind of like, say, dimensional modeling, where you have got these updates to kind of like to dimension values and so on and so forth. Is it the case that, am I right in saying that, that actually Data Vault is actually very well suited to Hadoop as well, more so than, say, other ways of modeling? Is that the case? or did i kind of misunderstand that i i you are no you you you are hitting it right on the head i you know kind of even forgotten about that aspect being able to use data vault really being kind of as you said an insert append model rather than an update approach to data yes that absolutely lends itself to the hadoop world if you want to use Hadoop as your data source because that is the model if you will in the Hadoop world is insert only and
Starting point is 00:58:11 tracking all history which was another goal of Data Vault is all the facts all the time and we're going to track and keep the history in a Data Vault hence the term vault. We're going to put all that history in there so that you can do whatever time-based analytics you want to do to see what has changed over time. And this is one of the things that the data vault approach took from the dimensional world
Starting point is 00:58:42 is the concept of type 2 slowly changing dimensions is we don't overwrite anything in the data vault we create a new row so if the if the data has changed from the last load we time stamp it and we insert it in and of course that works very well either in a relational database or in a no SQL approach. Excellent. Well, you know, Kent, that's fantastic. So I think we've run out of time now. So thank you very much for this. It's three kind of very topical topics there,
Starting point is 00:59:15 you know, data modeling in the world of kind of, you know, I suppose desktop kind of BI tools. We've talked about, you know, Data Vault there, which I think is kind of probably not so well known, but I think it's definitely an area that, so I'm going to put put some links in the show notes people to kind of look at this in there and also you know snowflake which is i think the fact you've gone there the fact some very some people i very much respect have gone there may be interested in sort of snowflake and it's interesting to get from you i suppose a sort of technical a technical rationale really as to kind of why that approach was taken and so on so um so kent
Starting point is 00:59:46 thanks very much for coming on the show um and uh yeah thank you very much and um and see you soon all right well thanks for having me on mark and yeah i guess i'll i'll see you in a another week or so at oracle open world yeah excellent okay thanks a lot. Sure thing.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.