The Data Stack Show - 195: Supply Chain Data Stacks and Snowflake Optimization Pro Tips with Jeff Skoldberg of Green Mountain Data Solutions

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show. We are here with Jeff Skoldberg from Green Mountain Data. Jeff, welcome to the Data Stack Show. We're super excited to have you.

Starting point is 00:00:39 Thanks so much for having me. Excited to be here. All right. Well, you are calling in from the Green Mountains. So, you know, you come by the name of your consulting practice, honestly, but give us a little background. How'd you get into data? Sure. So I first started in data in 2008 when I was working at Keurig. That's call center business analyst. I was with Keurig for about 12 years doing data, a couple of years on actually on the BI team. So kind of transitioning from business analyst who is just knee deep in data all the time to it's actually now it's my job title, even though I've been doing it the whole time. Yeah. And then about five years ago, I said, Hey, let me try and do this on my own,

Starting point is 00:01:21 see what happens. So about five years ago, I left Keurig to become an independent consultant. And since then, I've been helping companies kind of on their analytics journey, doing end-to-end stuff. So business analyst, the data pipeline, solution architecture, as well as the last mile, which is analytics, the dashboard. So that's what I've been doing for the last five years. So Jeff, one of the topics I'm really excited to dive into is cost optimization. That's one that's near and dear to my heart, previously being in a CTO role. So I'm really excited to dive into that one. Any topics you're excited to cover? Cost optimization sounds awesome.

Starting point is 00:01:59 We can talk about some frameworks that I use with my clients as I'm walking them through their business problems and we're thinking about how we're going to solve the problems that they come to me with. That's another one. And we could also talk about maybe how the data landscape has evolved over time. Yeah. Okay. That sounds great. All right, let's dig in. Jeff, one thing I love is that we get to talk about sort of your tenure at Keurig a little bit. It seems less and less common, especially in the world of data, for someone to spend over a decade at a single company. And not only did you do that, but what a time, like what a decade. I mean, you sort of chose the years, you know, to span that time at Keurig from sort of before some of the most high impact

Starting point is 00:02:47 technologies we use today were even invented. So sort of start at the beginning, like, how did you get into, was that like an early job out of school? How did you even get into data at Keurig? Absolutely. So I went to school at University of Vermont, and this was my first job out of college. And Keurig being the largest employer in Vermont at the time, remote work not being as popular as it is now, it was basically my only choice for like working at a big company. My first data role in 2008 there was as a call center business analyst where we're looking at average wait time average talk time and average handle time which is like the sum of the two and using that for forecasting models like when can when should people be taking their lunch breaks or their

Starting point is 00:03:36 15 minute break etc so we're really building a staffing model based off of our call history and those types of metrics so to kind of keep a very long story short, eventually I moved into a role as supply chain analyst where I had very smart managers and mentors who were really great at coming up with KPIs to manage the business. And then I was responsible for bringing them the data so they can really fulfill those KPIs.

Starting point is 00:04:04 So we could go into some examples maybe later in the show if we want to dive deeper there. But it was very much an exercise of like, hey, we're going to try and reduce XYZ cost. And now we need the data to understand XYZ cost. And let's automate that. So we're not paying our analysts to pull data. We're paying our analysts to really present it and show where the pain points are in the weak spots. The good bulk of

Starting point is 00:04:30 my career at Keurig, I would say like eight of those 12 years was as supply chain analyst. And then I moved into an actual BI developer on the BI team when Keurig decided SAP HANA was, or just SAP in general, was going to be their ERP system of choice. My manager made sure that me as a supply chain analyst had unfettered access to the SAP HANA database. So when he had a business question about how the business is doing, is something in stock? What's our in-stock percentage? Are things shipping on time, etc.? That I would be able to write those queries and give them the data. And then shortly after that, I joined the BI team as an official member of the BI team, even though I was unofficial BI that entire time.

Starting point is 00:05:14 Yeah. Very cool. Okay. I have a couple, I have a couple of questions. So one is about the Keurig business and sort of the digital nature of it over that time period. So in terms of like distribution channels, and you said supply chain, was there a big shift to more online sales over that time period? And did that have a big impact on sort of your work with the data. So it's interesting. It actually kind of went the opposite where it went more B2B in the long run. So when I started there in 2007, they were the largest shipper in Vermont.

Starting point is 00:05:54 Or I'm sorry, they were the largest shipper in New England. So this is before people bought everything on Amazon. Yeah. So of course, Amazon is the largest shipper in New England now. Like I don't have to look it up or fact check that. You just kind of know, right? Yeah. Yeah. But in 2007, it was Keira Green Mountain was the largest shipper in New England. They were doing about 20,000 orders a day out of the Waterbury Distribution Center. Holy cow. Yeah. Wow. pretty amazing. And that didn't grow a lot, their consumer direct. It basically stayed about static over the years, plus or minus. What really exploded was their grocery distribution, their club distribution, club being like Sam's Costco BJ's. retail so like bed bath beyond we would call that the retail channel so that's obviously like separate than grocery yeah um and as they became a nationwide brand looking at metrics like what

Starting point is 00:06:54 percent of grocery stores are we in they very quickly got to 100 percent wow of you know the national brands at least so that it was just absolute explosive growth. Wow. Okay. So now walk us through the changes in the tech stack, right? Because I mean, you have sort of business model change and then explosive growth. And then the other variable is the changing tech landscape, right? And absolutely. The explosive growth generally means we have all sorts of new data needs, all sorts of new processes to build way more data than we were ever processing before.

Starting point is 00:07:33 So give us that story. So I was very fortunate that I landed in a company that knew what business intelligence was. And they had a BI team since before I worked there. So I started there in 2007, my first data role in 2008. I don't know how long they had a BI team for, but they had one back then. A small one, but they had a dedicated team called the BI team, which is really cool because a lot of places that I've done consulting at, they're not there yet. So I was really fortunate to learn in an environment

Starting point is 00:08:07 where they were thinking about sustainable data assets, where they were thinking about, oh, well, Jeff, you might do it this one way in Excel because that's all you know how to do. But let's show you like how you would do it in a database and how you structure your data to support this KPI to automate it and stuff. So I actually had really great mentorship

Starting point is 00:08:24 on that tech side as well. So that's really cool that they were light years ahead of their time. But on the other hand, the actual tech that they were using was, of course, Microsoft SQL Server as their business warehouse. Their ERP system was an Oracle-based product called PeopleSoft, which is like, I mean, that's kind of ancient history now, but it was very much a Microsoft shop. And it was interesting to see it evolve to SAP. And honestly, it almost felt like we were coming into SAP Business Objects, I forget, what is it called? BW, SAP Business Warehouse, was like a huge downgrade from your SQL Server analysis services, like the cubes that you have in SQL Server.

Starting point is 00:09:17 Going from analysis services to business objects and business warehouse was like, it was a tough pill to swallow. But then coming into SAP HANA, which acts like a relational like, it was a tough pill to swallow. But then coming into SAP HANA, which acts like a relational database, it acts like a modern data warehouse. Some people might not know about HANA, so I'll just explain real quick. It's an in-memory database. So when you turn it on, it takes all of your tables and it loads them in RAM. So this runs on massive servers that are just specially designed for SAP HANA, usually manufactured by Hewlett-Packard.

Starting point is 00:09:50 So they have a partnership where Hewlett-Packard is designing machines specifically for this database. I had no idea about that. It's totally wild. We're talking like terabytes of RAM. Wow. I had no idea about that's very

Starting point is 00:10:07 cool hannah trivia yeah and so we were able to do forecast analytics on a six billion point data set using a live connection in tableau and you were able to just drag and drop and it would just come up on the screen maybe like a one second delay or something like that computing on live on a six billion point data set wow that's wild totally so that was really cool to that was my first like i will say big data six billion you're kind of getting there in the big data realm you need specialized equipment to process six billion rows and that's how I define big data. Some people will say number of terabytes is big data. I define it as, do you need specialized equipment to actually process this data? And certainly with that 6 billion rows, you do. That was my first kind of experience there with big data. And it was really fun to

Starting point is 00:10:58 be able to optimize HANA, learn about the partitioning, learning how we're going to be clustering our data and organizing it and using a union instead of a join to do those forecast accuracy comparisons was like a huge speed boost. All of like little performance tuning tips that you learn along the way was really enjoyable. Very cool. And when did Tableau come into the picture along with SAP or? Yeah, I think they implemented Tableau in like 2016. So medium early adopters there, like Tableau was certainly around in 2010 and then kind of had a huge growth. Were there any major changes in the pipelining over time, especially as you acquired additional data sources, right? I mean, I'm sure you had,

Starting point is 00:11:46 you know, all sorts of pipelines from all these different, like distributors, vendors, all that sort of stuff. Sure. So a lot of my analysis came from two places, SAP directly, our ERP system, our system of record, and our demand planning system, which was called Demantra, an Oracle based product. And we have Demantra re-forecasting every combination of product and customer every single day. And not only is it doing that once every single day, it's doing it for multiple versions of the forecast.

Starting point is 00:12:23 So you can have your pure system generated forecast. You could have your what your sales outlook forecast. So like what the sales reps, they want to put their little juju on the on the forecast. That's your sales outlook. You can have your annual operating plan, which is what you were saying at the beginning. And that doesn't really get adjusted. And then maybe you get all together. You have like nine versions of the forecast. So it was forecasting every combination of customer

Starting point is 00:12:48 product and then nine different sets of those every single day. And we're saving all of these snapshots for a certain amount of time before we drop them. And there's certain snapshots that we save forever. So like one week leading up to the sale, we really care about that. 13 weeks before the sale, we really care about that. And the reason why we cared about those two things is because one week leading up to the sale, you can make the cake up. 13 weeks leading up to the sale, you can buy the Keurig Brewer and get it there from China. Six months leading up to the sale and one year leading up to the sale. Those are the lags that we would keep forever.

Starting point is 00:13:24 And this was years ago, so I'm improvising here a little bit, but more or less it's directionally accurate. And so we had to come up with a process to ingest all of this data, snapshot it, and then delete certain slices of the snapshots as the time lapsed, and then actually do the comparison.

Starting point is 00:13:46 So that was like my main play in SAP HANA. You were asking me about multiple systems and let me circle back to that. They also had IRI data. IRI is like cash register data. So Sam's Costco, BJ's Walmart. What did you, Eric, or you, John, actually buy at the cash register? And it's collected at the row level like that. Where then they can actually, it's actually pretty creepy what they can do with data. They could be like,

Starting point is 00:14:16 what's the average credit score of the person that buys our K-Cups? Like they could actually like, you know, they can do that because when they get IRI data, they know, unless you pay in cash, they know who bought it. And luckily, I wasn't responsible for that pipeline. One of my friends and colleagues was responsible for a pipeline. So I just got to consume that data as a source within HANA. That got piped into HANA as well. And I was able to act as a consumer.

Starting point is 00:14:44 You say luckily. I assume that one, I bet, was a huge pain. I think it was a big lift because they didn't have the integrations that they might have today. So, you're dealing with automating exports and stuff like that. Sure. Lots of TP servers.

Starting point is 00:14:59 At large volumes. Very large volumes, exactly, because it's every sale. Fascinating. And this is so fun. I feel like we're learning things that we haven't, like the details of some of this stuff that we haven't really talked about on the show.

Starting point is 00:15:17 Oh, totally. I love supply chain analytics because it's its own niche thing. So like consumer packaged goods is certainly its own niche thing, but just general supply chain, any like if you make heaters, I can help you. If you make airplanes, I can help you, you know,

Starting point is 00:15:30 because supply chain of like knowing how much product we have in stock today, when is more product coming in or when are we making more product? How is that comparing up against our demand for product? That's all stuff that now is,

Starting point is 00:15:46 I almost have like a template for it. It's like easy problems for me to solve. So I really enjoy talking about this stuff. And those are the types of clients where I really excel, even though I've had clients in all types of industries at this point. I have, John, I know you probably have a ton of questions, but one last question on this subject for me, sort of looking back over your time and all the change that happened

Starting point is 00:16:08 across a number of different vectors, the technology landscape has changed drastically, but what hasn't changed from your perspective, right? I mean, and what kind of made me think about this a little bit was you saying they had a BI team, you know, before you even joined. What types of things haven't changed, even though the world you described in 2008 is just so drastically different in many ways from a technological standpoint for a company that would be setting up the same sort of infrastructure? Star schema. It hasn't, like, seriously, like, I learned about star schema in 2008 when I was at Curie because that's how they designed their cubes. And nowadays we could take liberties and every year at Coalesce, they're going to have a speech called is Kimball dead or is star schema dead? And they kind of say, yeah, like you don't really need to do that anymore. But realistically, what we're doing is we're just not following all of the Kimball best practices of like surrogate keys and like foreign keys and all of this stuff. But we're still keeping our facts and dimensions

Starting point is 00:17:16 separate. And we're still joining them right before we send the data to Tableau. We take our fact table and we join it to our customer table, our product table, and our fiscal calendar. And then we send one big table to Tableau. So it's not really a star schema because we didn't do it by the book of how all the extra steps and added time and complexity that Kimball said you should do. You don't really need to do that anymore. But this concept of facts and dimensions is timeless absolutely and actually i hear joe reese talking a lot about how like people don't care about data modeling anymore and how just the average data model is very sloppy and it's not following a rigid technique and to a certain extent i think that's kind of okay. Like,

Starting point is 00:18:10 on one hand, it's really cute to be like a perfect star schema, but the six weeks extra that it would take versus just like getting the answer, like, here's my query, and this gets the answer. Sometimes you have to look at time to value. So you always have to look at time to value, right, Eric? Yes. Such a a tired phrase but so true it's so true yeah ryi and time to value yeah john i've been monopolizing the mic yeah so yeah let's dig into some of that cost optimization thing you know we talked before the show that you know starting in let's say 2008 you probably had servers that were maybe you had them in rack space right like that's a nice 2008 hosting company right maybe our servers are in rack space and then like at some point let's say in 2010 2012 ish like you you get into aws but And then you still have servers that are in AWS. And then

Starting point is 00:19:07 eventually, you've got things that come out like Lambda, and then further and further down the road of basically pay per the hour and then pay per the second. So there's this evolution here. And I guess maybe walk us through from your data experience, like how that impacted your job on a regular basis? Like I was thinking about this in 2008 or 2010 or 2012. Now I'm thinking about this in 2024. What's the evolution there specifically around the cost optimization?

Starting point is 00:19:35 Sure. I'll kind of think about this in three chunks. There's my free SAP HANA where you had the world where you described. It's like fixed costs yeah more or less fixed costs your data gets bigger you need to spend more but you know what you're going to pay for a year because you pay for a year up front right then there's like the HANA days and then there's my post 2018 days where that's where I've seen more change so like from the 20 2008 to 2018

Starting point is 00:20:01 the big change that I went through personally in my data endeavor was this adopting of SAP HANA. I'd like to talk more about from 2018 onward as now I'm getting into Snowflake and pay-for-compute models and stuff like that. It is such a drastic mentality. One little pre-note is that not all companies are there yet. I have one of my main clients today is using Amazon Redshift and they're not using the serverless. They're using the pay per month. So it's like they know what they're getting up front. It's a couple thousand dollars a month and it doesn't change. It doesn't matter if you use it or don't use it. It's a fixed cost. There's something to be said for that because they know up front what they're going to spend and there's not going to be like whoops i ran a big query query without a limit clause and i accidentally

Starting point is 00:20:51 spent five thousand dollars you know like that is actually a problem that people have in big query yeah no it is i've done that before not quite to that extent but yeah yeah, I mean, we had an analyst at the last company exploring some Google shopping data. And I think it was like 500 bucks. Yeah. Just for like 20, 30 minutes because of the data set size. Totally. But to your point, companies budget annually based on a fixed cost model.

Starting point is 00:21:19 Like that's what your budget is. It's like whatever it is, you know, $30,000 for a warehouse for you know per month a large company maybe per year at a smaller company but like but then you have these variable costs so then it becomes like as a if you're in any sort of leadership position they're actually really challenging to manage the cost up or down you know obviously like cheaper is always better in general but then like you manage it too well and you lose your budget and down you know obviously like cheaper is always better in general but then like you manage it too well and you lose your budget and then you know then the next year you're fighting to get the

Starting point is 00:21:51 budget back it's a tough problem whereas before it was like like let's buy this let's get it depreciated over three years you know it's an asset like we can you know there's it's not operating it's not just in that operating cost budget like bucket like it is now remote for a lot of companies which i know you can buy reserved you know reserved instances and and you know sign a snowflake contract for multiple years and accomplish some of the same things but i think the mindset though is tough i've always had a tough you know problem with kind of finance and IT and figuring out that. How do we explain? How much is it going to be like, well, we don't know? It's not a good answer, right? Yeah, totally. So I think it comes down to

Starting point is 00:22:36 your organization needs to grow expertise in cost optimization for the warehouse that you're using. Yeah. And I'll talk about Snowflake because that's what I know the most about. I think that most companies grossly overestimate their usage in the first year and then go way over budget of their usage in the second year. Yeah. So Snowflake, I think it's still $25,000 minimum investment

Starting point is 00:23:05 if you don't want to pay on a credit card. So you could sign up, credit card, your account costs $20 a month if you just use it just a little bit. Yeah.

Starting point is 00:23:12 Like whatever it is. But if you want to get, okay, now we want you to send us an invoice instead of, and you want a little bit of a discount. Right.

Starting point is 00:23:22 You have like $25,000 minimum investment to deal with that sales rep. Most clients aren't hitting that their first year because they're implementing. They don't have a lot of usage. Year two, all of a sudden it's like, oops, we spent a lot of money. And so you really kind of have to figure out how to rein it in. And there's certain things, certain areas that we can look at how to rein it in and would you

Starting point is 00:23:46 like me to go into a few of those yeah i think one interesting well i think the why behind like why did it blow up right and i think the positive answer would be we're trying to do democratize data we've got all these analyst writing queries that didn't have access before we have you know bi tools that people are building their own dashboards and maybe even composing their own queries in the tool. So all that is theoretically a positive thing. But you did just democratize data, which means from a cost standpoint, you just gave a bunch of people access to this thing that the meter runs every time you know a new person uses it well and hopefully that's the case but oftentimes the transformation step is spinning the warehouse more than people actually using the data sure and so you want to know what's your ratio of

Starting point is 00:24:40 having your dbt jobs running in your warehouse versus users actually using the data and the other thing is uh a lot of companies are paying more to load their data into snowflake than their snowflake bill is actually costing them meaning that their five-trand bill is higher than their snowflake bill yeah yeah you want to know that most companies i think if that's you want to know that and you want to fix that right now basically because it shouldn't cost a lot of money to load data into Snowflake. So that's kind of looking at the entire step. Fix it by using something other than Fivetrain. Is that what you're saying?

Starting point is 00:25:14 It is actually. Okay. I didn't know if you had some magic Fivetrain optimization you wanted to share with us. No. I mean, sorry. It's like, you know, I have my opinions, but Fivetran has a very well-known pricing problem in the industry. And it's like, I'm not here to talk down about any company by any means. But if you're paying by the row, you want to get less by the row.

Starting point is 00:25:36 But paying by the row is really tough. So let me just give an example. This forecast accuracy thing that we talked about before, where they're re-forecasting every product every day, that means the entire data set changed every day. It's not like event data where you're just getting new events. And it's like, oh, I want Fivetran to track in new events. That means that I have a couple billion new rows every day. So right off the bat, you can't use Fiv five tran because you would never be able to afford it right but a lot of times in supply chain data when you're dealing with mrp which is materials requirements planning or you're dealing with like inventory snapshots or

Starting point is 00:26:14 you're dealing with like forecasting again you have more data that changes every day than doesn't right yeah how much like byproduct how much stock did i have Like, by product, how much stock did I have at this warehouse today? And how much did I have yesterday? Almost all of it changed.

Starting point is 00:26:30 Right. If you're a busy warehouse. So, you don't want to pay by row for those types of things. You want to either

Starting point is 00:26:37 come up with your custom loading or come up with... I love tools. I actually prefer tools over custom programs. I think that when teams adapt tools, they can move faster. They can be a leaner and smaller team by adapting a few

Starting point is 00:26:54 smart tools. So you just want to look at how the tool is priced and make sure that it fits your use case. Well, I think in Fivetran's defense, they have done a nice job of developing a lot of quality connectors because if you compare them to open source connectors on a lot of the like they're just better top connections they're clearly better so in like the marketing space like they've got a like if you compare like i don't know like hubspot or salesforce connectors versus some of the open source ones and you're not like you, you're not just a massive operation. It makes a ton of sense. So it does make a lot of sense for a lot of data sources.

Starting point is 00:27:32 So the data sources that aren't in the list of what I just mentioned, where you just have massive amounts of changing records every day. For systems like HubSpot, where even Salesforce, you could do okay. I've seen Salesforce run out of control on the Fivetran side as well. But a lot of these connectors, I agree with you, they are high quality, maybe the highest quality of reliability and accuracy and ease of use. When I evaluate another extract load tool, I basically say, is it Fivetran simple? And if it's not Fivetran simple, it's kind of just off the list.

Starting point is 00:28:06 And the answer is usually no. Right? Yeah. Like, there are some good ones out there that are, you know, in the mix now, but there are a lot of them that, like, this is just going to be too hard, or you need a more technical person to manage. It's a good tool, but you need a more technical person to manage it.

Starting point is 00:28:22 Yeah. But circling back to where we started is, you want to be hyper aware of, am I spending more on loading data to Snowflake than my entire Snowflake bill? Because that means you're a little bit upside down. Right. And then you also want to know, like, am I spending more on my transformation compute than my end users are consuming? And really see if you could get it to be that your end users are the cause of your high snowflake bill because that's what you really want because that means you're paying the price for democratizing data but that's money well spent whereas the other money might not

Starting point is 00:28:54 be money well spent well and the other thing too is i like to think of it as a pull model versus a push model like push model is like open five train check all the boxes like connect everything you can find an api key for check all the boxes get all the data in versus a pull model is like open five trend check all the boxes like connect everything you can find an api key for check all the boxes get all the data in versus a pull model of like hey there's a business requirement somebody actually needs something pull it through like check the box in five trend transform it and dbt you know deliver it and your bi tool of choice like that's an efficient model but it's kind of easier to do it the other way right of like hey business like look at all this data we have for you we have everything possible from

Starting point is 00:29:31 and then you just list out like every system the business uses it could be 20 systems and then a they're overwhelmed b you're gonna waste a ton of money you know constantly ingesting and storing and that data do you see people struggle with that too, where they just kind of check all the boxes, suck it all in, and aren't using most of it? I do, and every single Fivetran customer that I've helped has gone back and unchecked boxes months later. It's like, how many boxes can we uncheck? What's being used? If it's not being used, let's uncheck it. It is that

Starting point is 00:30:05 hoarder mentality though we were like yeah we might need that you know like let's take it again later it'll be free right it'll be there exactly exactly one thing i was thinking about in related to cost optimization you said you sort of have these frameworks that you use to help your clients think through data projects. Do they incorporate like are they sort of sensitive to the cost optimization? Could you explain your framework? So I wish that they were. So maybe it's my lack of being able to communicate this to the actual IT teams instead of business teams. But a lot of times, at least with the clients that I have, and I'll say that

Starting point is 00:30:54 I stick with clients for a long time. So it's usually a few clients. They have not been using my framework to optimize their costs, unfortunately though i wish they were we're using my framework to solve business problems but not always to optimize the costs but um i was gonna say we could shift into the framework or we can either there's one or two more things i could say about yeah totally sorry i didn't mean to change gears but yeah let's run the cost optimization sure so i i would just to give people a few tips the number one mistake that i see out there which is a very expensive mistake is to use your virtual warehouses in snowflake like cost centers so create like a supply chain extra small

Starting point is 00:31:38 finance extra small hr extra small and so now you're trying to apply like a cost center or the department who's using the warehouse to the warehouse. Because a lot of times I'll log into the client Snowflake account and all three of those extra small warehouses that I just said are on at the same time when one of them could just be on. So if you just had like reporting extra small, now instead of paying to have three warehouses on for three different teams,

Starting point is 00:32:06 you just have one warehouse on. That's like kind of like my number one tip is have as few virtual warehouses as you can, even to the extent of like just one of each size and then just use those. And then the way that you actually apply your cost centers and figuring out who's using the data is through either query tags or putting comments in your dbt models or there's a grid. So I'm a huge fan of select.dev.

Starting point is 00:32:34 They're a snowflake cost optimization company. They have kind of turnkey ways that you can see who's using what data. And what they do is they apply a comment to every query that gets executed in dbt or tableau however you specify it so then i could say okay my supply chain team spent 30 consuming data and my hr team spent 100 consuming data so it's a much it's a much better way of allocating the cost centers than by splitting out your virtual data warehouses yeah let me comment on that i made that mistake early on i did too that's how everyone did it you have to learn yeah but go ahead john sorry yeah yeah i made that mistake early on not too badly but i had a ingestion warehouse a transformation warehouse and

Starting point is 00:33:22 reporting warehouse so i split split it by workload. And then the logic of like, oh wow, all three of those are running. Guess what? That's three times as much as one of them running. So then we just tried like, hey, what if we just literally combined everything into one? And everything was fine. Didn't have any performance problems at all. And then turn on the auto- then so yeah sure for your extra small let that scale up to maybe three nodes or yeah three three clusters actually yeah let's actually talk about auto scaling just for a second yeah but i'm curious like what's the trade-off so like

Starting point is 00:34:01 you have a smaller warehouse that runs for longer, right? To do something versus a larger warehouse that can run for shorter, but it's more expensive. Like what's the, how do you think about that? So the way I think about it is, when you're talking about scaling up to the next size warehouse. Yeah, right.

Starting point is 00:34:18 If the query runs for more than two minutes, then you could try and get it under one minute. So like basically as long, so the one minute is the minimum billing increment. So if you're running a query for exactly a minute, you've had 100% utilization. So you don't want queries running for less than a minute on a larger warehouse size,

Starting point is 00:34:39 but you want them running the shortest amount of time possible for one minute. So that's one kind of target that you can go for. Nice. And the way that you could tell if scaling up is actually going to help you is if you look in the query profile, there's something called like disk spillage. And then there's like spillage to remote storage, which means they actually spilled it to S3.

Starting point is 00:35:00 Okay. So a virtual warehouse is actually, it's a computer, right? So it's trying to process as much as it can in RAM, and when the RAM runs out, it's spilling to disk. But when you fill up the hard drive on that virtual computer, it's now dumping out to S3.

Starting point is 00:35:18 And if it's dumping to S3, you know for a fact that going to the next warehouse size up, it's going to have a bigger hard drive, and it won't dump to S3, because now it a fact that going to the next warehouse size up, it's going to have a bigger hard drive, and it won't dump to S3 because now it had a bigger hard drive. And more RAM, right? And it had more RAM. It had more everything.

Starting point is 00:35:31 That's right. But looking at disk spillage and specifically the remote spillage is how you could tell if scaling up will. And then you don't want to scale up until your query is running in two seconds. You want it to run for a minute. Let's say the query is taking an hour before on an extra small yeah you know you could go order of magnitude up until that thing is running a minute and then it's basically costing the same yeah nice that's super helpful that is yeah that is very cool and then there's this concept of

Starting point is 00:36:01 scaling horizontally so you can have an extra small warehouse and then you's this concept of scaling horizontally. So you can have an extra small warehouse, and then you could say min clusters and max clusters. So I could call it Jeff's extra small, okay? And I'll say max clusters is three, and min clusters is one. That means, and that's a concurrency issue. So if only I'm using it, it'll just be one cluster running.

Starting point is 00:36:23 If now all of a sudden there's like 40 or 50 Tableau users using that same warehouse at the same time, it'll just spin up automatically an extra cluster of Jeff's extra small. So now it's now scaling horizontally to handle concurrency versus scaling up to handle a harder query to process. So I think three is a good number just to start with. So use the smallest warehouse you can. Let it scale up to three if you think you need to.

Starting point is 00:36:54 And for Tableau, a lot of times I'll start Tableau with a medium just because I want Tableau to be a little bit faster. But again, letting it scale up to three if it needs to. And it almost never does. It has to be at least eight queries for it to scale up one more well when if you think about your average workload right at a mid-sized company or even a larger company i would guess that like there's certain peak hours and then peak times a month where you may have like, like, you know, this 100 people or 80 people or whatever. But the majority of the time, even during the workday,

Starting point is 00:37:30 is going to be a fraction of that. So that makes a lot of sense. Totally, totally. Yeah, and that's what's really nice is that Snowflake will then automatically handle it by, hey, what's the number of queries I have in my queue right now? Okay, let's turn on another one just to kill that queue. Yeah, it's nice.

Starting point is 00:37:47 Yeah, so that's my best cost saving tip. That's almost like the free lunch one that anyone could do. You could do it right now without making a huge impact to your organization. You do have to be a little bit careful because you might break some jobs which were using the warehouses that you're deleting. So you do them one at a time, see what breaks. Well, first understand ahead of time what you think is going to break, fix it,

Starting point is 00:38:08 turn off one, see what breaks, you know, that type of thing. So love it. Super helpful. Why don't we talk about your frameworks to round out the show? I'd love to. So when a client comes to me

Starting point is 00:38:23 with a particular business problem, and it's normally the business teams reaching out to me more so than the IT teams. I'll just kind of put that on the table that it's very much like there's a business problem that someone's trying to solve. There's two different frameworks that I walk them through. And the overarching idea that sits on top of these two frameworks is this concept that the only reason to use data is to control a business process. So you want to use data to control something.

Starting point is 00:38:56 Now, it's not an FYI thing. So example, how much did we sell last week? You want to, let's even generalize it further, a sales dashboard. Why have a sales dashboard unless you're trying to control your sales to meet that target? Like a sales dashboard isn't an FYI thing.

Starting point is 00:39:16 It's a, are we marching towards our goal? And if we're behind our target for this point in the year, what are the reasons why I'm behind my target? So that's kind of the overarching thing that sits on top of my two frameworks. But framework number one is just, comes from the Six Sigma manufacturing methodology.

Starting point is 00:39:37 And within Six Sigma, they have this process called define, measure, analyze, improve, control. and you can see this framework is set up that it ends with control so first we start thinking about controlling a business process and then we say here are the steps to actually control a process we're going to define it what you're trying to measure and like what your problem is we're going to define it then we just figure out how to measure it analyze the result what are you going to do to improve it. Then we just figure out how to measure it, analyze the result. What are you going to do to improve it? And then you just get to this point where all you're doing is you're

Starting point is 00:40:08 using the dashboard to control the process and make sure that the process is in control. And you can apply this to anything. You can apply it to sales, to forecast accuracy, to your snowflake spend. So the one thing that I love about this tool called Select that I mentioned earlier is that it has dashboards that show you where you're spending, what are your most expensive queries are. But it really comes down to this fact that we're going to use it to control the process of getting our spend under control. So that's kind of framework one and then framework number two is is what should a dashboard do what should be on a dashboard well every dashboard should do three things it should say where are we today what is the trend and what is the call to action and so if we unpack each one of those

Starting point is 00:41:02 three so where are we today that's like you're at the top they call it a ban a big ass number at the top that's like the good the bad like yeah this tableau dashboard came to my email i see a picture of the dashboard in the body of the email and there's a green check at the top next to a number i know i'm good or there's a red x at the top next to the number i know i'm bad so that's the where are we today. And if we think about, we'll just use the sales dashboard example because it's a very simple concept for people to think about.

Starting point is 00:41:31 If you're behind your target, it's an X, right? So number two of all dashboards must do is what is the trend? So this might be your sales and weekly buckets on a line chart. So the simplest tools in Tableau, the simplest chart types are the best chart types. So we have a line chart that says, here's our sales by week. Here's our target by week, which weeks were above and below the target.

Starting point is 00:41:54 And same week last year, maybe. So maybe you have three lines on a chart. So you could see where this year is comparing to your target and last year. So now you have a really good picture of the trend. And then the third thing is, what is the call to action? And that said another way is, what are the items within my business topic that are driving my business results? And if we think about a sales dashboard, it's like, well, these three sales reps are really behind on their sales. They're the problem. Or these three product IDs are really behind on the sales.

Starting point is 00:42:27 It's these products aren't doing well, or these brands, or these package categories, or whatever it happens to be. But at the bottom of the dashboard, below the trend chart, is going to be a bunch of different bar charts, usually, that can then act as filters on the stuff above it. So I could click on this sales rep's name and then the whole dashboard filters to just

Starting point is 00:42:46 that sales rep and I can see how far behind is he? What are his products? How are his products doing? And so that brings it all together of we now have where are we today? What is the trend? What's the call to action? And I can use that whole thing to more or less bring the process under control. So it's the manager's job who's consuming the dashboard to then use it to then go figure out why the problems are the problems.

Starting point is 00:43:15 That's my framework. Yeah, I love it. Okay, one question right off the bat, and I think I know the answer, but do you find that this increases or decreases the number of dashboards in a company I think it does increase because they want to control more points right but I I do like this idea of so I mean Ethan Aaron always talks about 10 dashboards which I don't think is a reasonable number of dashboards for an organization that has more than 10 departments like I mean do you think just think of any company for an organization that has more than 10 departments like i mean do

Starting point is 00:43:45 you think just think of any company that makes something that goes on the grocery shelf right they have manufacturing they have distribution yeah they have purchasing they have hundreds human resources they have financed they have more than 10 departments you can't just have 10 dashboards right yeah right but each team should not have more than 10. And each individual team maybe should only have like five KPIs that they're really looking at. So if you're the supply chain team, you should care about things that are in stock, things that are late, things that aren't shipping on time, and like how your inventory is and is your forecast accurate. But maybe that's hopefully five things that I just said. And then anything else that you want to measure, you say which one of those five is going. And then you end up with maybe 50 dashboards for the whole entire organization.

Starting point is 00:44:34 But to get back to your question is, it does beg a lot more questions. And they say, well, that was so effective. What else can we control? Yeah. And I just did this at one of my clients where the first dashboard went live. They put it on the monitors throughout the building and everyone was looking at it as they were walking around throughout the building and instantly this particular KPI within like one week got so much better, like hundreds of percent better.

Starting point is 00:45:01 Basically. Like it was just like a market improvement because people's names were on the dashboard. If you put someone's name on a dashboard with a problem next to their name, they're going to go clean up that problem. And the problem started getting cleaned up right away. And you want to be careful

Starting point is 00:45:15 because you don't want a punitive culture, but you also want an effective culture as well. Right. Tell me about this. This is something that I guess I hadn't thought about in this context, but I've done before, where it's almost like you, from a dashboard perspective, like dashboards in anything you put up. So you mentioned like that first week, it got a lot better, right? But that same dashboard a year later, like people just

Starting point is 00:45:40 walk by it and they ignore it. So one of the things that I've done in the past is say you've got five frameworks is keeping it visually fresh, but also just almost like rotating because it's like, all right, team, we're going to focus on like improving this metric and almost like visually like, all right, like that's the biggest number. Everything else is still there. And like, and you know, almost like keeping people's interest as part of the strategy and keeping people's focus where like you're focusing on five things versus one thing so let's pick out what do we think is most impactful this month or this quarter focus on that make that really big

Starting point is 00:46:16 make that the focus like put everything else out there because we don't want to like completely drop off on the other four but we want to focus on the one have you seen people implement that strategy or thought about that so one thing that i've done is when it's time to like hey this dashboard is no longer the thing that we need it's not like the hot topic anymore because right we've achieved control basically yeah if you just put a watch you can sunset the dashboard and just do like a slack alert that's like hey if this number goes above 300 got it i want to know because then maybe i can pull that dashboard out of archive and see what's happening yeah so yeah that's i think just like putting a watch on things so you could use tableau alerts you could use slack alerts if you have a pipeline tool cool yeah nice awesome we're at the buzzer and this has been an amazing conversation jeff i mean i don't actually think

Starting point is 00:47:13 we've discussed sap hannah on the show maybe ever but we learned some really interesting things about it we learned how to optimizeflake and love your framework on the dashboarding. I mean, that's, that really is just, I was thinking back through, you know, unfortunately in, in all the sort of analytics and BI stuff I've done, we never used a framework that clear. But as you were describing that, I was thinking back on which ones worked really well? Which ones can I look back on and be like, that was actually extremely effective. And they basically all aligned with almost the exact, it's like, oh yeah, the big number at the top where it's like, this is good or bad.

Starting point is 00:48:03 And that's really the only thing that is like yeah that was really effective awesome i'm so glad to hear the conversation was enjoyable and yeah thanks a lot guys for having me on and asking great questions absolutely absolutely well have a great one up in the green mountains and we'll see you out in the data sphere on LinkedIn. On LinkedIn. Peace. The Data Stack Show is brought to you by Rudderstack, the warehouse-native customer data platform.

Starting point is 00:48:36 Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 195: Supply Chain Data Stacks and Snowflake Optimization Pro Tips with Jeff Skoldberg of Green Mountain Data Solutions

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 195: Supply Chain Data Stacks and Snowflake Optimization Pro Tips with Jeff Skoldberg of Green Mountain Data Solutions

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.