The Data Stack Show - 195: Supply Chain Data Stacks and Snowflake Optimization Pro Tips with Jeff Skoldberg of Green Mountain Data Solutions
Episode Date: June 26, 2024Highlights from this week’s conversation include:Jeff's Background and Transition to Independent Consulting (0:03)Working at Keurig and Business Model Changes (2:16)Tech Stack Evolution and SAP HANA... Implementation (7:33)Adoption of Tableau and Data Pipelines (11:21)Supply Chain Analytics and Timeless Data Modeling (15:49)Impact of Cloud Computing on Cost Optimization (18:35)Challenges of Managing Variable Costs (20:59)Democratization of Data and Cost Impact (23:52)Quality of Fivetran Connectors (27:29)Data Ingestion and Cost Awareness (29:44)Virtual Warehouse Cost Management (31:22)Auto-Scaling and Performance Optimization (33:09)Cost-Saving Frameworks for Business Problems (38:19)Dashboard Frameworks (40:53)Increasing Dashboards (43:29)Final thoughts and takeaways (46:28)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies. Welcome back to the show. We are here with Jeff Skoldberg from
Green Mountain Data. Jeff, welcome to the Data Stack Show. We're super excited to have you.
Thanks so much for having me. Excited to be here.
All right. Well, you are calling in from the
Green Mountains. So, you know, you come by the name of your consulting practice, honestly,
but give us a little background. How'd you get into data? Sure. So I first started in data in
2008 when I was working at Keurig. That's call center business analyst. I was with Keurig for
about 12 years doing data, a couple of years on actually on the BI team. So kind of transitioning from business analyst who is just knee deep in data all the
time to it's actually now it's my job title, even though I've been doing it the whole time.
Yeah. And then about five years ago, I said, Hey, let me try and do this on my own,
see what happens. So about five years ago, I left Keurig to become an independent consultant. And since then, I've been helping companies kind of on their analytics
journey, doing end-to-end stuff. So business analyst, the data pipeline, solution architecture,
as well as the last mile, which is analytics, the dashboard. So that's what I've been doing
for the last five years. So Jeff, one of the topics I'm really excited to dive into is cost optimization.
That's one that's near and dear to my heart, previously being in a CTO role.
So I'm really excited to dive into that one.
Any topics you're excited to cover?
Cost optimization sounds awesome.
We can talk about some frameworks that I use with my clients as I'm walking them through
their business problems and we're thinking about how we're going to solve the problems that they come to me with.
That's another one. And we could also talk about maybe how the data landscape has evolved over time.
Yeah. Okay. That sounds great. All right, let's dig in. Jeff, one thing I love is that
we get to talk about sort of your tenure at Keurig a little bit. It seems less and
less common, especially in the world of data, for someone to spend over a decade at a single company.
And not only did you do that, but what a time, like what a decade. I mean, you sort of chose the years,
you know, to span that time at Keurig from sort of before some of the most high impact
technologies we use today were even invented. So sort of start at the beginning, like,
how did you get into, was that like an early job out of school? How did you even get into data at
Keurig? Absolutely. So I went to school at University of Vermont, and this was my first
job out of college. And Keurig being the largest employer in Vermont at the time, remote work not
being as popular as it is now, it was basically my only choice for like working at a big company.
My first data role in 2008 there was as a call center business analyst where we're looking at average wait time
average talk time and average handle time which is like the sum of the two and using that for
forecasting models like when can when should people be taking their lunch breaks or their
15 minute break etc so we're really building a staffing model based off of our call history
and those types of metrics so to kind of keep a very long story short,
eventually I moved into a role as supply chain analyst
where I had very smart managers and mentors
who were really great at coming up with KPIs
to manage the business.
And then I was responsible for bringing them the data
so they can really fulfill those KPIs.
So we could go into some examples maybe later in the show
if we want to dive deeper there.
But it was very much an exercise of like,
hey, we're going to try and reduce XYZ cost.
And now we need the data to understand XYZ cost.
And let's automate that.
So we're not paying our analysts to pull data.
We're paying our analysts to really present it and show where the pain points are in the weak spots. The good bulk of
my career at Keurig, I would say like eight of those 12 years was as supply chain analyst.
And then I moved into an actual BI developer on the BI team when Keurig decided SAP HANA was, or just SAP in general, was going to be their ERP system of choice.
My manager made sure that me as a supply chain analyst had unfettered access to the SAP HANA database.
So when he had a business question about how the business is doing, is something in stock?
What's our in-stock percentage? Are things shipping on time, etc.?
That I would be able to write those queries and give them the data.
And then shortly after that, I joined the BI team as an official member of the BI team,
even though I was unofficial BI that entire time.
Yeah.
Very cool.
Okay.
I have a couple, I have a couple of questions.
So one is about the Keurig business and sort of the digital nature of it over that time period. So in terms of like distribution channels, and you said supply chain, was there a big shift to more online sales over that time period? And did that have a big impact on sort of your work with the data. So it's interesting. It actually kind of went the opposite
where it went more B2B in the long run.
So when I started there in 2007,
they were the largest shipper in Vermont.
Or I'm sorry,
they were the largest shipper in New England.
So this is before people bought everything on Amazon.
Yeah.
So of course, Amazon is the largest shipper
in New England now.
Like I don't have to look it up or fact check that. You just kind of know, right? Yeah. Yeah. But in 2007, it was Keira Green Mountain was the largest shipper in New England. They were doing about 20,000 orders a day out of the Waterbury Distribution Center. Holy cow. Yeah. Wow. pretty amazing. And that didn't grow a lot, their consumer direct. It basically stayed about static over the years, plus or minus. What really exploded was their grocery distribution, their club distribution, club being like Sam's Costco BJ's. retail so like bed bath beyond we would call that the retail channel so that's obviously like
separate than grocery yeah um and as they became a nationwide brand looking at metrics like what
percent of grocery stores are we in they very quickly got to 100 percent wow of you know the
national brands at least so that it was just absolute explosive growth. Wow. Okay. So now walk us through the
changes in the tech stack, right? Because I mean, you have sort of business model change and then
explosive growth. And then the other variable is the changing tech landscape, right? And
absolutely. The explosive growth generally means
we have all sorts of new data needs,
all sorts of new processes to build
way more data than we were ever processing before.
So give us that story.
So I was very fortunate that I landed in a company
that knew what business intelligence was.
And they had a BI team since before I worked there.
So I started there in 2007, my first data role in 2008. I don't know how long they had a BI team for,
but they had one back then. A small one, but they had a dedicated team called the BI team,
which is really cool because a lot of places that I've done consulting at, they're not there yet.
So I was really fortunate to learn in an environment
where they were thinking about sustainable data assets,
where they were thinking about,
oh, well, Jeff, you might do it this one way in Excel
because that's all you know how to do.
But let's show you like how you would do it in a database
and how you structure your data to support this KPI
to automate it and stuff.
So I actually had really great mentorship
on that tech side as well. So that's really cool that they were light years ahead of their time.
But on the other hand, the actual tech that they were using was, of course, Microsoft SQL Server
as their business warehouse. Their ERP system was an Oracle-based product called
PeopleSoft, which is like, I mean, that's kind of ancient history now, but it was very much a Microsoft shop.
And it was interesting to see it evolve to SAP.
And honestly, it almost felt like we were coming into SAP Business Objects, I forget, what is it called?
BW, SAP Business Warehouse, was like a huge downgrade from your SQL Server analysis services,
like the cubes that you have in SQL Server.
Going from analysis services to business objects and business warehouse was like, it was a
tough pill to swallow.
But then coming into SAP HANA, which acts like a relational like, it was a tough pill to swallow.
But then coming into SAP HANA, which acts like a relational database, it acts like a modern data warehouse.
Some people might not know about HANA, so I'll just explain real quick.
It's an in-memory database.
So when you turn it on, it takes all of your tables and it loads them in RAM. So this runs on massive servers that are just specially designed for SAP HANA,
usually manufactured by Hewlett-Packard.
So they have
a partnership
where Hewlett-Packard is designing
machines specifically for
this database. I had no idea about that.
It's totally
wild. We're talking like terabytes of RAM.
Wow. I had no idea about that's very
cool hannah trivia yeah and so we were able to do forecast analytics on a six billion point data set
using a live connection in tableau and you were able to just drag and drop and it would
just come up on the screen maybe like a one second delay or something like that computing on live on a six billion point data set
wow that's wild totally so that was really cool to that was my first like i will say big data
six billion you're kind of getting there in the big data realm you need specialized equipment to
process six billion rows and that's how I define big data.
Some people will say number of terabytes is big data. I define it as, do you need specialized equipment to actually process this data? And certainly with that 6 billion rows, you do.
That was my first kind of experience there with big data. And it was really fun to
be able to optimize HANA, learn about the partitioning, learning how we're going to be clustering our data and organizing it and using a union instead of a join to do those forecast accuracy comparisons was like a huge speed boost.
All of like little performance tuning tips that you learn along the way was really enjoyable.
Very cool.
And when did Tableau come into the picture along with SAP or?
Yeah, I think they implemented Tableau in like 2016. So medium early adopters there,
like Tableau was certainly around in 2010 and then kind of had a huge growth.
Were there any major changes in the pipelining over time, especially as you acquired additional
data sources, right? I mean, I'm sure you had,
you know, all sorts of pipelines from all these different, like distributors, vendors,
all that sort of stuff. Sure. So a lot of my analysis came from two places, SAP directly,
our ERP system, our system of record, and our demand planning system, which was called Demantra,
an Oracle based product.
And we have Demantra re-forecasting every combination of product and customer every
single day.
And not only is it doing that once every single day, it's doing it for multiple versions of
the forecast.
So you can have your pure system generated forecast.
You could have your what your sales outlook forecast.
So like what the sales reps, they want to put their little juju on the on the forecast.
That's your sales outlook.
You can have your annual operating plan, which is what you were saying at the beginning.
And that doesn't really get adjusted.
And then maybe you get all together.
You have like nine versions of the forecast. So it was forecasting every combination of customer
product and then nine different sets of those every single day. And we're saving all of these
snapshots for a certain amount of time before we drop them. And there's certain snapshots that we
save forever. So like one week leading up to the sale, we really care about that. 13 weeks before the sale, we really care about that.
And the reason why we cared about those two things is because one week leading up to the
sale, you can make the cake up.
13 weeks leading up to the sale, you can buy the Keurig Brewer and get it there from China.
Six months leading up to the sale and one year leading up to the sale.
Those are the lags that we would keep forever.
And this was years ago,
so I'm improvising here a little bit,
but more or less it's directionally accurate.
And so we had to come up with a process
to ingest all of this data, snapshot it,
and then delete certain slices of the snapshots
as the time lapsed,
and then actually do the comparison.
So that was like my main play in SAP HANA. You were asking me about multiple systems and
let me circle back to that. They also had IRI data. IRI is like cash register data. So
Sam's Costco, BJ's Walmart. What did you, Eric, or you, John,
actually buy at the cash register?
And it's collected at the row level like that.
Where then they can actually,
it's actually pretty creepy what they can do with data.
They could be like,
what's the average credit score
of the person that buys our K-Cups?
Like they could actually like,
you know, they can do that
because when they get IRI data, they know, unless you pay in cash, they know who
bought it. And luckily, I wasn't responsible for that pipeline. One of my friends and colleagues
was responsible for a pipeline. So I just got to consume that data as a source within HANA.
That got piped into HANA as well. And I was able to act as a consumer.
You say luckily. I assume that one, I bet,
was a huge pain.
I think it was a big lift because they didn't have
the integrations that they might have today.
So, you're dealing
with automating exports and stuff
like that. Sure.
Lots of TP servers.
At large volumes.
Very large volumes, exactly,
because it's every sale.
Fascinating.
And this is so fun.
I feel like we're learning things that we haven't,
like the details of some of this stuff
that we haven't really talked about on the show.
Oh, totally.
I love supply chain analytics because it's its own niche thing.
So like consumer packaged goods is certainly its own niche thing,
but just general supply chain,
any like if you make heaters,
I can help you.
If you make airplanes,
I can help you, you know,
because supply chain of like
knowing how much product
we have in stock today,
when is more product coming in
or when are we making more product?
How is that comparing up
against our demand for product?
That's all stuff that now is,
I almost have like a template for it.
It's like easy problems for me to solve.
So I really enjoy talking about this stuff.
And those are the types of clients where I really excel,
even though I've had clients
in all types of industries at this point.
I have, John, I know you probably have a ton of questions,
but one last question on this subject for me, sort of looking back over your time and all the change that happened
across a number of different vectors, the technology landscape has changed drastically,
but what hasn't changed from your perspective, right?
I mean, and what kind of made me think about this a little bit was you saying they had
a BI team, you know, before you even joined.
What types of things haven't changed, even though the world you described in 2008 is just so drastically different in many ways from a technological standpoint for a company that would be setting up the same sort of infrastructure?
Star schema.
It hasn't, like, seriously, like, I learned about star schema in 2008 when I was at Curie because that's how they designed their cubes. And nowadays we could take liberties and every year at Coalesce, they're going to have a speech called is Kimball dead or is star schema dead? And they kind of say, yeah, like you don't really need to do that anymore. But realistically, what we're doing is we're just not following all of the Kimball best practices of like surrogate keys
and like foreign keys and all of this stuff. But we're still keeping our facts and dimensions
separate. And we're still joining them right before we send the data to Tableau. We take our
fact table and we join it to our customer table, our product table, and our fiscal calendar. And then we send one big table to Tableau. So it's not really a star schema
because we didn't do it by the book of how all the extra steps and added time and complexity
that Kimball said you should do. You don't really need to do that anymore. But this concept of facts
and dimensions is timeless absolutely and actually
i hear joe reese talking a lot about how like people don't care about data modeling anymore
and how just the average data model is very sloppy and it's not following
a rigid technique and to a certain extent i think that's kind of okay. Like,
on one hand, it's really cute to be like a perfect star schema, but the six weeks extra that it would take versus just like getting the answer, like, here's my query, and this gets the
answer. Sometimes you have to look at time to value. So you always have to look at time to
value, right, Eric? Yes. Such a a tired phrase but so true it's so true yeah ryi and time to value
yeah john i've been monopolizing the mic yeah so yeah let's dig into some of that cost optimization
thing you know we talked before the show that you know starting in let's say 2008
you probably had servers that were maybe you had them in rack space right like that's a nice 2008
hosting company right maybe our servers are in rack space and then like at some point let's say
in 2010 2012 ish like you you get into aws but And then you still have servers that are in AWS. And then
eventually, you've got things that come out like Lambda, and then further and further down the road
of basically pay per the hour and then pay per the second. So there's this evolution here. And
I guess maybe walk us through from your data experience, like how that impacted your job
on a regular basis?
Like I was thinking about this in 2008 or 2010 or 2012.
Now I'm thinking about this in 2024.
What's the evolution there
specifically around the cost optimization?
Sure.
I'll kind of think about this in three chunks.
There's my free SAP HANA
where you had the world where you described.
It's like fixed costs yeah more or less
fixed costs your data gets bigger you need to spend more but you know what you're going to pay
for a year because you pay for a year up front right then there's like the HANA days and then
there's my post 2018 days where that's where I've seen more change so like from the 20 2008 to 2018
the big change that I went through personally in my data endeavor was this adopting
of SAP HANA. I'd like to talk more about from 2018 onward as now I'm getting into Snowflake
and pay-for-compute models and stuff like that. It is such a drastic mentality. One little pre-note
is that not all companies are there yet. I have one of my main
clients today is using Amazon Redshift and they're not using the serverless. They're using
the pay per month. So it's like they know what they're getting up front. It's a couple thousand
dollars a month and it doesn't change. It doesn't matter if you use it or don't use it. It's a fixed
cost. There's something to be said for that because they know up front what they're going to spend and there's not going to be like whoops i ran a big query query without a limit clause and i accidentally
spent five thousand dollars you know like that is actually a problem that people have in big query
yeah no it is i've done that before not quite to that extent but yeah yeah, I mean, we had an analyst at the last company exploring some Google shopping data.
And I think it was like 500 bucks.
Yeah.
Just for like 20, 30 minutes because of the data set size.
Totally.
But to your point, companies budget annually
based on a fixed cost model.
Like that's what your budget is.
It's like whatever it is, you know,
$30,000 for a warehouse for you know
per month a large company maybe per year at a smaller company but like but then you have these
variable costs so then it becomes like as a if you're in any sort of leadership position they're
actually really challenging to manage the cost up or down you know obviously like cheaper is always
better in general but then like you manage it too well and you lose your budget and down you know obviously like cheaper is always better in general but then like you manage
it too well and you lose your budget and then you know then the next year you're fighting to get the
budget back it's a tough problem whereas before it was like like let's buy this let's get it
depreciated over three years you know it's an asset like we can you know there's it's not
operating it's not just in that operating cost budget like bucket like it is now remote for a lot of companies which i know you can buy
reserved you know reserved instances and and you know sign a snowflake contract for multiple years
and accomplish some of the same things but i think the mindset though is tough i've always had a
tough you know problem with kind of finance and IT and figuring out that.
How do we explain? How much is it going to be like, well, we don't know?
It's not a good answer, right? Yeah, totally. So I think it comes down to
your organization needs to grow expertise in cost optimization for the warehouse that you're using.
Yeah. And I'll talk about Snowflake
because that's what I know the most about.
I think that most companies
grossly overestimate their usage in the first year
and then go way over budget of their usage in the second year.
Yeah.
So Snowflake, I think it's still $25,000 minimum investment
if you don't want to pay
on a credit card.
So you could sign up,
credit card,
your account costs $20 a month
if you just use it
just a little bit.
Yeah.
Like whatever it is.
But if you want to get,
okay, now we want you
to send us an invoice
instead of,
and you want a little bit
of a discount.
Right.
You have like $25,000
minimum investment
to deal with that sales rep.
Most clients aren't hitting that their first year because they're implementing.
They don't have a lot of usage.
Year two, all of a sudden it's like, oops, we spent a lot of money.
And so you really kind of have to figure out how to rein it in.
And there's certain things, certain areas that we can look at how to rein it in and would you
like me to go into a few of those yeah i think one interesting well i think the why behind like
why did it blow up right and i think the positive answer would be we're trying to do democratize
data we've got all these analyst writing queries that didn't have access before we have you know
bi tools that people are building their own dashboards and maybe even composing their own queries in the tool.
So all that is theoretically a positive thing.
But you did just democratize data, which means from a cost standpoint, you just gave a bunch of people access to this thing that the meter runs every time you know a new person uses it well and
hopefully that's the case but oftentimes the transformation step is spinning the warehouse
more than people actually using the data sure and so you want to know what's your ratio of
having your dbt jobs running in your warehouse versus users actually using the data and the
other thing is uh a lot of companies are paying more to load their data into snowflake than their
snowflake bill is actually costing them meaning that their five-trand bill is higher than their
snowflake bill yeah yeah you want to know that most companies i think if that's you want to know
that and you want to fix that right now basically because it shouldn't cost a lot of money to load data into Snowflake.
So that's kind of looking at the entire step.
Fix it by using something other than Fivetrain.
Is that what you're saying?
It is actually.
Okay.
I didn't know if you had some magic Fivetrain optimization you wanted to share with us.
No.
I mean, sorry.
It's like, you know, I have my opinions, but Fivetran has a very well-known pricing problem in the industry.
And it's like, I'm not here to talk down about any company by any means.
But if you're paying by the row, you want to get less by the row.
But paying by the row is really tough.
So let me just give an example.
This forecast accuracy thing that we talked about before, where they're re-forecasting every product every day, that means the entire data set changed every day.
It's not like event data where you're just getting new events.
And it's like, oh, I want Fivetran to track in new events.
That means that I have a couple billion new rows every day.
So right off the bat, you can't use Fiv five tran because you would never be able to afford it right but a lot of times in supply chain data when you're dealing with
mrp which is materials requirements planning or you're dealing with like inventory snapshots or
you're dealing with like forecasting again you have more data that changes every day than doesn't
right yeah how much like byproduct how much stock did i have Like, by product, how much stock
did I have at this
warehouse today?
And how much did I
have yesterday?
Almost all of it
changed.
Right.
If you're a busy
warehouse.
So, you don't want
to pay by row
for those types
of things.
You want to either
come up with your
custom loading
or come up with...
I love tools.
I actually prefer
tools over
custom programs. I think that
when teams adapt tools, they can move faster. They can be a leaner and smaller team by adapting a few
smart tools. So you just want to look at how the tool is priced and make sure that it fits your
use case. Well, I think in Fivetran's defense, they have done a nice job of developing a lot of quality
connectors because if you compare them to open source connectors on a lot of the like they're
just better top connections they're clearly better so in like the marketing space like they've got a
like if you compare like i don't know like hubspot or salesforce connectors versus some of the open
source ones and you're not like you, you're not just a massive operation.
It makes a ton of sense.
So it does make a lot of sense for a lot of data sources.
So the data sources that aren't in the list of what I just mentioned,
where you just have massive amounts of changing records every day.
For systems like HubSpot, where even Salesforce, you could do okay.
I've seen Salesforce run out of control on the Fivetran side as well.
But a lot of these connectors, I agree with you, they are high quality, maybe the highest quality of
reliability and accuracy and ease of use. When I evaluate
another extract load tool, I basically say, is it Fivetran simple?
And if it's not Fivetran simple, it's kind of just off the list.
And the answer is usually no.
Right? Yeah.
Like, there are some good ones out there
that are, you know, in the mix
now, but there are a lot of them that, like,
this is just going to be too hard, or you need a more
technical person to manage. It's a good tool, but you need
a more technical person to manage it.
Yeah. But circling back to
where we started is, you want to be hyper aware of, am I spending more on loading data to Snowflake than my entire Snowflake bill?
Because that means you're a little bit upside down.
Right.
And then you also want to know, like, am I spending more on my transformation compute than my end users are consuming?
And really see if you could get it to be that your end users are the
cause of your high snowflake bill because that's what you really want because that means you're
paying the price for democratizing data but that's money well spent whereas the other money might not
be money well spent well and the other thing too is i like to think of it as a pull model versus
a push model like push model is like open five train check all the boxes like connect everything
you can find an api key for check all the boxes get all the data in versus a pull model is like open five trend check all the boxes like connect everything you can find an api
key for check all the boxes get all the data in versus a pull model of like hey there's a business
requirement somebody actually needs something pull it through like check the box in five trend
transform it and dbt you know deliver it and your bi tool of choice like that's an efficient model
but it's kind of easier to do it the other way right
of like hey business like look at all this data we have for you we have everything possible from
and then you just list out like every system the business uses it could be 20 systems and then a
they're overwhelmed b you're gonna waste a ton of money you know constantly ingesting and storing
and that data do you see people struggle with that too, where they just kind of check all the boxes,
suck it all in, and aren't using most of it? I do, and every
single Fivetran customer that I've helped has gone back and
unchecked boxes months later. It's like, how many boxes can we
uncheck? What's being used? If it's not being used, let's uncheck it.
It is that
hoarder mentality though we were like yeah we might need that you know like let's take it again later
it'll be free right it'll be there exactly exactly one thing i was thinking about
in related to cost optimization you said you sort of have these frameworks that you use to help your clients think through data projects.
Do they incorporate like are they sort of sensitive to the cost optimization?
Could you explain your framework?
So I wish that they were.
So maybe it's my lack of being able to communicate this to the actual IT teams instead
of business teams. But a lot of times, at least with the clients that I have, and I'll say that
I stick with clients for a long time. So it's usually a few clients. They have not been using
my framework to optimize their costs, unfortunately though i wish they were we're using my
framework to solve business problems but not always to optimize the costs but um i was gonna say we
could shift into the framework or we can either there's one or two more things i could say about
yeah totally sorry i didn't mean to change gears but yeah let's run the cost optimization
sure so i i would just to give people a few tips the number one
mistake that i see out there which is a very expensive mistake is to use your virtual
warehouses in snowflake like cost centers so create like a supply chain extra small
finance extra small hr extra small and so now you're trying to apply like a cost center
or the department who's using the warehouse to the warehouse.
Because a lot of times I'll log into the client Snowflake account
and all three of those extra small warehouses that I just said
are on at the same time when one of them could just be on.
So if you just had like reporting extra small,
now instead of paying to have three warehouses on
for three different teams,
you just have one warehouse on.
That's like kind of like my number one tip
is have as few virtual warehouses as you can,
even to the extent of like just one of each size
and then just use those.
And then the way that you actually apply your cost centers
and figuring out who's using the data is through either query tags or putting comments in your dbt models or there's a grid.
So I'm a huge fan of select.dev.
They're a snowflake cost optimization company.
They have kind of turnkey ways that you can see who's using what data.
And what they do is they apply a comment
to every query that gets executed in dbt or tableau however you specify it so then i could
say okay my supply chain team spent 30 consuming data and my hr team spent 100 consuming data so
it's a much it's a much better way of allocating the cost centers than by splitting out your virtual data warehouses yeah let me comment on that i made that mistake early on i did too that's how
everyone did it you have to learn yeah but go ahead john sorry yeah yeah i made that mistake
early on not too badly but i had a ingestion warehouse a transformation warehouse and
reporting warehouse so i split split it by workload.
And then the logic of like,
oh wow, all three of those are running. Guess what? That's three times as much as one of them running.
So then we just tried like, hey, what if we just literally combined everything into one?
And everything was fine. Didn't have any performance problems at all.
And then turn on the auto- then so yeah sure for your extra small
let that scale up to maybe three nodes or yeah three three clusters actually yeah let's actually
talk about auto scaling just for a second yeah but i'm curious like what's the trade-off so like
you have a smaller warehouse that runs for longer, right? To do something versus a larger warehouse
that can run for shorter,
but it's more expensive.
Like what's the, how do you think about that?
So the way I think about it is,
when you're talking about scaling up
to the next size warehouse.
Yeah, right.
If the query runs for more than two minutes,
then you could try and get it under one minute.
So like basically as long,
so the one minute is the minimum billing increment.
So if you're running a query for exactly a minute,
you've had 100% utilization.
So you don't want queries running for less than a minute
on a larger warehouse size,
but you want them running
the shortest amount of time possible for one minute.
So that's one kind of target that you can go for.
Nice.
And the way that you could tell if scaling up is actually going to help you
is if you look in the query profile, there's something called like disk spillage.
And then there's like spillage to remote storage,
which means they actually spilled it to S3.
Okay.
So a virtual warehouse is actually, it's a computer, right?
So it's trying to process as much
as it can in RAM, and when
the RAM runs out, it's spilling to disk.
But when you fill up
the hard drive on that virtual computer,
it's now dumping out to S3.
And if it's dumping to S3,
you know for a fact that
going to the next warehouse size up,
it's going to have a bigger hard drive, and it won't dump to S3, because now it a fact that going to the next warehouse size up, it's going to have a bigger hard drive,
and it won't dump to S3 because now it had a bigger hard drive.
And more RAM, right?
And it had more RAM.
It had more everything.
That's right.
But looking at disk spillage and specifically the remote spillage
is how you could tell if scaling up will.
And then you don't want to scale up until your query is running in two seconds.
You want it to run for a minute.
Let's say the query is taking an hour before on an extra small yeah you know you could go order
of magnitude up until that thing is running a minute and then it's basically costing the same
yeah nice that's super helpful that is yeah that is very cool and then there's this concept of
scaling horizontally so you can have an extra small warehouse and then you's this concept of scaling horizontally. So you can have an extra small warehouse,
and then you could say min clusters and max clusters.
So I could call it Jeff's extra small, okay?
And I'll say max clusters is three,
and min clusters is one.
That means, and that's a concurrency issue.
So if only I'm using it,
it'll just be one cluster running.
If now all of a sudden there's like 40 or 50 Tableau users
using that same warehouse at the same time,
it'll just spin up automatically an extra cluster of Jeff's extra small.
So now it's now scaling horizontally to handle concurrency
versus scaling up to handle a harder query to process.
So I think three is a good number just to start with.
So use the smallest warehouse you can.
Let it scale up to three if you think you need to.
And for Tableau, a lot of times I'll start Tableau with a medium
just because I want Tableau to be a little bit faster.
But again, letting it scale up to three if it needs to.
And it almost never does.
It has to be at least eight queries for it to scale up one more well when if you think about your average workload
right at a mid-sized company or even a larger company i would guess that like there's certain
peak hours and then peak times a month where you may have like, like, you know, this 100 people or 80 people or whatever.
But the majority of the time, even during the workday,
is going to be a fraction of that.
So that makes a lot of sense.
Totally, totally.
Yeah, and that's what's really nice is that Snowflake
will then automatically handle it by,
hey, what's the number of queries I have in my queue right now?
Okay, let's turn on another one just to kill that queue.
Yeah, it's nice.
Yeah, so that's my best cost saving tip.
That's almost like the free lunch one that anyone could do.
You could do it right now without making a huge impact to your organization.
You do have to be a little bit careful because you might break some jobs
which were using the warehouses that you're deleting.
So you do them one at a time, see what breaks.
Well, first understand ahead of time
what you think is going to break, fix it,
turn off one, see what breaks,
you know, that type of thing.
So love it.
Super helpful.
Why don't we talk about your frameworks
to round out the show?
I'd love to.
So when a client comes to me
with a particular business problem, and it's normally the business
teams reaching out to me more so than the IT teams.
I'll just kind of put that on the table that it's very much like there's a business problem
that someone's trying to solve.
There's two different frameworks that I walk them through.
And the overarching idea that sits on top of these two frameworks is this concept that the only reason to use data
is to control a business process.
So you want to use data to control something.
Now, it's not an FYI thing.
So example, how much did we sell last week?
You want to, let's even generalize it further,
a sales dashboard.
Why have a sales dashboard
unless you're trying to control your sales
to meet that target?
Like a sales dashboard isn't an FYI thing.
It's a, are we marching towards our goal?
And if we're behind our target
for this point in the year,
what are the reasons why I'm behind my target?
So that's kind of the overarching thing
that sits on top of my two frameworks.
But framework number one is just,
comes from the Six Sigma manufacturing methodology.
And within Six Sigma,
they have this process called
define, measure, analyze, improve, control. and you can see this framework is set
up that it ends with control so first we start thinking about controlling a business process
and then we say here are the steps to actually control a process we're going to define it what
you're trying to measure and like what your problem is we're going to define it then we just
figure out how to measure it analyze the result what are you going to do to improve it. Then we just figure out how to measure it, analyze the result. What are you
going to do to improve it? And then you just get to this point where all you're doing is you're
using the dashboard to control the process and make sure that the process is in control. And
you can apply this to anything. You can apply it to sales, to forecast accuracy, to your snowflake
spend. So the one thing that I love about this tool called Select that I mentioned earlier is
that it has dashboards that show you where you're spending, what are your most expensive queries are.
But it really comes down to this fact that we're going to use it to control the process of
getting our spend under control. So that's kind of framework one and then framework number two is is what should a dashboard do what
should be on a dashboard well every dashboard should do three things it should say where are
we today what is the trend and what is the call to action and so if we unpack each one of those
three so where are we today that's like you're at the top
they call it a ban a big ass number at the top that's like the good the bad like yeah this tableau
dashboard came to my email i see a picture of the dashboard in the body of the email and there's a
green check at the top next to a number i know i'm good or there's a red x at the top next to
the number i know i'm bad so that's the where are we today. And if we think about,
we'll just use the sales dashboard example
because it's a very simple concept
for people to think about.
If you're behind your target, it's an X, right?
So number two of all dashboards must do
is what is the trend?
So this might be your sales and weekly buckets
on a line chart.
So the simplest tools in Tableau, the simplest chart types are the best chart types.
So we have a line chart that says, here's our sales by week.
Here's our target by week, which weeks were above and below the target.
And same week last year, maybe.
So maybe you have three lines on a chart.
So you could see where this year is comparing to your target and last year.
So now you have a really good picture of the trend.
And then the third thing is, what is the call to action? And that said another way is, what are the items within
my business topic that are driving my business results? And if we think about a sales dashboard,
it's like, well, these three sales reps are really behind on their sales. They're the problem. Or
these three product IDs are really behind on the sales.
It's these products aren't doing well,
or these brands, or these package categories,
or whatever it happens to be.
But at the bottom of the dashboard,
below the trend chart,
is going to be a bunch of different bar charts, usually,
that can then act as filters on the stuff above it.
So I could click on this sales rep's name and then the whole dashboard filters to just
that sales rep and I can see how far behind is he?
What are his products?
How are his products doing?
And so that brings it all together of we now have where are we today?
What is the trend?
What's the call to action?
And I can use that whole thing to more or less bring the process under control.
So it's the manager's job who's consuming the dashboard to then use it to then go figure out why the problems are the problems.
That's my framework.
Yeah, I love it.
Okay, one question right off the bat, and I think I know the answer,
but do you find that this increases or decreases
the number of dashboards in a company I think it does increase because they want to control
more points right but I I do like this idea of so I mean Ethan Aaron always talks about 10 dashboards
which I don't think is a reasonable number of dashboards for an organization that has more than
10 departments like I mean do you think just think of any company for an organization that has more than 10 departments like i mean do
you think just think of any company that makes something that goes on the grocery shelf right
they have manufacturing they have distribution yeah they have purchasing they have hundreds
human resources they have financed they have more than 10 departments you can't just have 10
dashboards right yeah right but each team should not have more than 10. And each individual team maybe should only have like five KPIs that they're really looking at.
So if you're the supply chain team, you should care about things that are in stock, things that are late, things that aren't shipping on time, and like how your inventory is and is your forecast accurate.
But maybe that's hopefully five things that I just said.
And then anything else that you want to measure, you say which one of those five is going.
And then you end up with maybe 50 dashboards for the whole entire organization.
But to get back to your question is, it does beg a lot more questions.
And they say, well, that was so effective.
What else can we control?
Yeah.
And I just did this at one of my clients where the first dashboard went live.
They put it on the monitors throughout the building and everyone was looking at
it as they were walking around throughout the building and instantly this particular
KPI within like one week got so much better, like hundreds of percent better.
Basically.
Like it was just like a market improvement because people's
names were on the dashboard.
If you put someone's name on a dashboard
with a problem next to their name,
they're going to go clean up that problem.
And the problem started getting cleaned up right away.
And you want to be careful
because you don't want a punitive culture,
but you also want an effective culture as well.
Right. Tell me about this.
This is something that I guess
I hadn't thought about in this context,
but I've done before, where it's almost like you, from a dashboard perspective,
like dashboards in anything you put up. So you mentioned like that first week,
it got a lot better, right? But that same dashboard a year later, like people just
walk by it and they ignore it. So one of the things that I've done in the past is say you've got five frameworks is
keeping it visually fresh, but also just almost like rotating because it's like, all right,
team, we're going to focus on like improving this metric and almost like visually like,
all right, like that's the biggest number.
Everything else is still there.
And like, and you know, almost like keeping people's interest as part of the strategy and
keeping people's focus where like you're focusing on five things versus one thing so let's pick out
what do we think is most impactful this month or this quarter focus on that make that really big
make that the focus like put everything else out there because we don't want to like completely
drop off on the other four but we want to focus on the one have you seen people implement that strategy or thought about that so one thing that i've done is when it's time to like hey this
dashboard is no longer the thing that we need it's not like the hot topic anymore because right we've
achieved control basically yeah if you just put a watch you can sunset the dashboard and just do like a
slack alert that's like hey if this number goes above 300 got it i want to know because then maybe
i can pull that dashboard out of archive and see what's happening yeah so yeah that's i think just
like putting a watch on things so you could use tableau alerts you could use slack alerts if you have a pipeline tool cool yeah nice awesome
we're at the buzzer and this has been an amazing conversation jeff i mean i don't actually think
we've discussed sap hannah on the show maybe ever but we learned some really interesting things
about it we learned how to optimizeflake and love your framework on the
dashboarding. I mean, that's, that really is just, I was thinking back through,
you know, unfortunately in, in all the sort of analytics and BI stuff I've done,
we never used a framework that clear. But as you were describing
that, I was thinking back on which ones worked really well? Which ones can I look back on and
be like, that was actually extremely effective. And they basically all aligned with almost the
exact, it's like, oh yeah, the big number at the top where it's like, this is good or bad.
And that's really the only thing that is like yeah that was really effective awesome i'm so glad to hear the
conversation was enjoyable and yeah thanks a lot guys for having me on and asking great questions
absolutely absolutely well have a great one up in the green mountains
and we'll see you out in the data sphere on LinkedIn.
On LinkedIn.
Peace.
The Data Stack Show is brought to you by Rudderstack,
the warehouse-native customer data platform.
Rudderstack is purpose-built to help data teams
turn customer data into competitive advantage.
Learn more at ruddersack.com.