The Data Stack Show - 124: Pragmatism About Data Stacks with Pedram Navid of West Marin Data
Episode Date: February 1, 2023Highlights from this week’s conversation include:Pedram’s journey into the world of data (4:05)What should the datastack at an early-stage startup look like? (9:53)New ideas surrounding access con...trol for data (24:45)What can data teams learn around complexity from software engineering (30:55)Scaling up instead of scaling out in processing data (37:40)Why DuckDB is making so much noise in the market (41:06)Final thoughts and takeaways (53:25)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show. If you have followed LinkedIn or Substack
influencers in the data space, you've probably come across Pedram Navid. He is a really smart
guy, has written some really helpful articles on lots of data-related things. I actually found his content researching several
topics before meeting him. And we got the chance to meet him and invited him on the show.
And I'm super excited to chat with him, Kostas. He started out in finance and the financial world
with data, and then was at several startups in the Bay Area, most recently High Touch,
and now he's running his own consultancy.
So where am I going to start with my questions?
That's a difficult part.
I think one thing that I do really want to dig into with him, which we haven't talked
a ton about on the show, is data stacks at early stage companies. You know, we've
talked with a lot of startup founders who have created startups, especially in the data space,
obviously. We've talked with a lot of data practitioners at various sizes of companies.
And I don't know if we've talked with many data practitioners who have done this at multiple very early stage startup companies
in the SaaS space. And so I think that's a really helpful thing to think through for me and for a
lot of our listeners by getting an opinion from someone who's done this multiple times over about
what do you actually need in that stage as a company in terms of your data stack? And then
the other question I want to ask is, are you thinking about scale?
You know, because generally startups need to become hyper growth, or at least that's
the plan.
So those are my two big questions.
What does the data stack look like?
And then how do you think about building it in a way that can scale, you know, if you
hit the jackpot?
Yeah, for me, I want to start with learning from him, like what's the difference
between working in a very hard and regulated industry, like finance, where
he initially was working at and then going and working like in a series A startup.
Yeah, that's huge.
And also what is helpful to keep from the work on a big and probably bureaucratic
like organization, when you go and work in a chaotic environment, like a series A
pre-post-product market, but pre-growth, let's say, stage
combining where things are like changing constantly, but it would be awesome to
hear from him, like what he found useful from his experience in doing that.
That's one thing.
And the other thing is that, Pedrom is like exposed to all the new things
that are happening in these industries.
Like to hear from him, like what's his take and opinion on some
technologies like Dr.
B, for example.
Oh yeah.
This whole thing of, okay, let's scale out or scale out what we should do with infrastructure and how we should process our data.
So yeah, let's start with him.
Let's do it.
Pedram, welcome to the Data Sack Show.
It's been a long time coming.
Thanks.
Glad to be here.
All right. Well, we'll start where we always do. Give us your background, especially the parts about how you got into data
in the first place. It's an old story now. I started at a bank a long, long time ago, and
we had data coming in from a vendor through PowerPoint slides.
And it had two columns on it, one for this month and one for last month.
And that was all the data we had.
Yeah.
And every month they would send us a new PowerPoint slide, replacing one column with the other.
And so I think my boss asked, is there a way where we can kind of figure out what's going on month by month over a trend?
And so I would hand copy
this data from PowerPoint to Excel.
One thing led to another,
and I built a dashboard.
Eventually, I learned VBA
because I got tired of
doing things manually.
And that was really
the gateway drug into
the rest of my career.
Python, R, data science, all that happened through the span of 12 years.
Then I moved to the Bay Area and I thought, you know, enough banking, let's jump into startup life.
Worked at a few different startups and the data scientists eventually the data engineer because
i thought data science just took too long to get results and one thing led to another and most
recently i ended up at high touch as their head of data doing data marketing and product
oh so many questions okay one thing from the early part of the story,
were you, so it sounds like you sort of went through your learnings, you know, sort of,
you know, VBA, you know, through to Python and then, you know, other subsequent,
you know, subsequent languages and methodologies there. Were you doing all that at a bank? And if
so, were you sort of teaching yourself and bringing that technology into the bank? And the reason I
ask is, you know, traditionally we think about banks as sort of being resistant to sort of
technological change, especially if they're getting data, you know, delivered in PowerPoint.
So we'd love to hear a little bit more about that journey and how you brought those technologies in.
I mean, what was that like?
It was difficult to say the least.
So BBA was allowed because Microsoft Excel was allowed.
And so you were allowed to use that.
I learned BBA on my own, painfully, slowly.
I think as most people learn it, I doubt many people go to school for BBA.
So that was just the beginning.
And then as I was searching,
I found about this thing called Python.
And I probably wasn't supposed to download it
to my bank laptop, but I did.
And so that helped a little bit with the automation.
And again, it was all really self-driven, self-taught,
just trying to solve problems I didn't want to do myself.
I was like purely motivated by laziness.
And I mean, I think to this day,
that's still the driving factor behind what I do.
I love it.
As we move towards things like R
to actually do real business modeling and analysis,
that's when I got the most resistance.
We were doing like compensation modeling
for 12,000 employees in Microsoft Excel.
And we were passing down this one spreadsheet
back and forth.
And...
FTP?
No, email.
Oh, man.
Maybe SharePoint if you were lucky so as it's moving through heads everyone's changing these models they're dragging and dropping and
stuff is changing and things are breaking that no one knows obviously right and six months go by
you rolled out your competition model and you got to figure out why the numbers aren't right. And you go back and you find
that some guy
accidentally filled the wrong
column in spreadsheet.
Or even worse, the executive would change
their mind on what the package should look like every
five minutes. Then you go back and update
50 different apps, try to recalculate things.
So I thought there must be a
better way. I learned about
this thing called R. I was learning about data science on the side.
And so I thought, what if I put all this logic into code
instead of into workbook
and try to automate some of this work?
Arvifi was very upset.
He did not like it.
He thought R was a black box.
And I realized what he
was mad about wasn't
using R. He just wanted a spreadsheet.
So I would
do all the work in R and then just output it to a spreadsheet
at the end of the day and give it to him.
You can still have that.
And then everything was fine.
It was always a work of
appeasing stakeholders that never end.
Yeah. Yeah, that's such a good insight. It's always a work of appeasing stakeholders that never end. Yeah.
Yeah, that's such a good insight.
It's funny to hear the concept of R being a black box
because, I mean, nothing could be further from the truth.
Completely open source.
Yeah, perception is reality.
That's super helpful.
Okay, well, let's...
So then let's fast forward to move to the Bay Area,
you were involved in multiple startups, most recently High Touch,
and did a bunch of data stuff at early stage startups. So, you know, in our chat beforehand,
you were saying, you know, sort of, you know, seed to series A, you know, stage of those companies.
And one thing I'm really interested in
that I've wanted to ask you for a while is
what your take is on what the data stack
at an early stage startup should look like.
And there are a couple of motivating factors. One, I'm selfishly interested
because, you know, I'm involved with that every day. But it's not something we've talked about
on the show a ton. You know, we've talked with people running startups, running data startups.
You know, we've talked with enterprises, but we haven't really honed in on,
okay, you're a really early stage company.
You know, what does your data stack look like?
And then, well, I'll follow up with part B of the question.
But yeah, so you're series A, you know,
sort of late seed stage,
and you're running data at that company.
What do you actually need?
And I can't just say it depends, right?
Well, just explain what the dependencies are.
Yeah, let's say, all right.
My motivating factor whenever I do things is I need something that I don't need to babysit,
and I'm willing to trade off costs for
engineering time because
I'm just one person
and again I'm very lazy but I'm also
probably busy doing other things
I need something that just works
and I think
in those early stage startups
your data is usually not very big
yep right
and I might say it'd be blasphemy but I might argue that your data is usually not very big. Yep. Right. And I might say it'd be blasphemy,
but I might argue that your data is not that valuable
when you're first starting out.
It's good to have.
Can you unpack that a little bit more?
Yeah.
I agree with you, but I think that's really helpful.
If the goal of data is to help drive decisions,
at an early stage company,
you don't have that much data, right?
Because there's not much happening yet.
And you probably know
every customer you have.
And you probably know
how you close that deal
and where you got it from.
So what are you really learning
from a really complex data stack right now?
You're not building models.
You're not scoring leads.
You're not doing marketing attribution.
At the end of the day,
you're maybe counting revenue
and maybe a number of customers.
That's really the value when you're first starting out.
Now, it's good to start with that stuff
because as those complex questions build over time,
having a nice foundation can make it easier to answer those things.
But I think we don't need to invest,
unless like data is your product,
you probably don't need to invest a ton into your data stack
in the early days.
No, that makes sense.
I think, you know, one specific example of that I've experienced multiple times is that
things like multi-touch attribution are extremely powerful, but you actually have to have a
pretty huge amount of data and generally a lot of paid programs running in order for a
multi-touch attribution model to really be additive in terms of shifting marketing budget
right and when you're not spending a ton of money you know you can spend a lot of time developing a
model that might be accurate at the end of the day you It's like, well, okay, we're going to move 10 grand from
this bucket to this bucket. It's not a huge deal. That's super interesting. Okay. How about scale
though? Because in an ideal world, these early stage startups hit hyper growth and scale really
quickly. And when that happens, tons of stuff breaks across the company, you know, which is just the way that things go. And,
you know, people have to fix all sorts of stuff, you know, from org charts to data stack. So
how do you think about that aspect of it? Right? Like early on, you want something that just works.
It's a small team. Does, do the tools available scale? How do you think something that just works. It's a small team.
Do the tools available scale?
How do you think about that side of it?
That is a really good question. So if you look up, let's go through the whole stack.
On the InGeft side, there's a few options.
There's your FibedCran, there's your Airbytes, and so on.
And those,
I mean, that scales
as long as your wallets are deep.
Right?
So, that's probably
fine when you're first starting out because you don't want to invest
too heavily into that.
It's hard to anyway.
So,
that is something you can always take down the road
and decide, do we want to keep using
this or should we build something internally to help
reduce cost? You can pay to push that decision
off. Exactly. Yeah. Until it's too
painful and then you can deal with it.
On the data warehouse side,
you're probably not going to go wrong with
Snowflake or BigQuery.
You probably don't need Databricks,
I would assume.
And you can't see a good reason to use Redshift anymore.
Yeah, probably.
I mean, I doubt you'll hit
scaling limits with Snowflake.
Again,
BigQuery is a bit more questionable.
But again,
you really got to be pushing numbers
to be hitting problems there.
And what else do you need to do?
It's DBT for modeling, which sure, you'll probably hit, again, limits there.
But if you're at the scale where you're hurting yourself
through what's capable through that stack,
then you've got really good problems.
You must have a ton of data and a ton of business.
And you can just throw engineers at it at that point.
So it would welcome that issue.
If the stock I built today doesn't really scale,
then that's great.
Let's hire more people and fix it.
Yep.
100%.
100%.
Yeah.
I think I'm thinking about some of our,
you know,
large customers and yeah, you have to be at a pretty big scale to sort of, you know, large customers. And yeah, you have to be at a pretty big scale
to sort of, you know,
I'm thinking about ones that have migrated off,
you know, Redshift into,
you know, almost going fully
onto like data lake infrastructure, right?
But you're talking about like unbelievable,
unbelievable scale
when you sort of outpace like, you know,
basic warehouse stuff, which is super interesting.
You could probably get away with Postgres if you really wanted to, the data warehouse
in the early days, right?
That's probably what you will hit limits on.
So that's where I think maybe just go with Snowflake and hope you don't.
But if you're cost conscious and you just wanted something cheap and simple, Postgres is pretty strong and powerful.
Yeah, super interesting.
Okay, other than the tools that you just mentioned, and then I'll pass the mic over to Kostas because, of course, the rhythm of the show is that I monopolize and then he does.
What are the nice- haves for you?
Right, so I understand like the core infrastructure.
So you have ingest, you have warehousing,
you have a modeling layer,
you know, in the early stage, that's all you need.
Are there any sort of, okay,
you have a larger budget than you expected.
So I'm going to just, you know,
I'm going to do some quality
of life or some, do you have any preferences around things that you would add to that stack?
I don't believe in quality of life for the data team. I just haven't seen one that like
increases my quality of life enough to justify the expense. For me, it's much more like tactical,
like planning out for the future. So I've got my basic data stack.
Probably going to need BI, right?
So maybe we can start with...
I was going to ask about that if you didn't mention it.
You probably will need BI at some point.
Maybe you start with a superset and it's pretty cheap and free.
Maybe you decide you need a semantic layer because the demands
on your team are growing high
and then you move to
a looker or a light down.
That's all.
These are all
valid places to be.
There's Metabase.
There's nothing wrong
with any of those.
I think those are all
highly dependent on your team.
I'm going to call that
a nice to have.
You probably need it
at some point.
It's just like,
when is the right time?
Product Analytics
is another one.
So getting data from RutterStack into Amplitude
or any of the other ones out there.
Feature adoption and sort of understanding.
Yeah, innovation, growth, all that kind of funnel stuff.
That, I mean, that's usually driven by demand,
not by, you just want to do for fun, right?
So if your marketing team and your product teams are asking for this stuff, you got to find a solution.
And the solution usually isn't writing SQL queries for funnels because nobody wants or knows how to do that.
Instead, you give them something self-serve.
That's kind of how I look at it.
Everything else just seems, I don't know.
I need something motivating for me
to go get it
there's like data quality is always
one people talk about there's catalog
there's metadata
those all
seem nice to have but would I go out and
spend my marketing or my data dollars on it
not unless I had
a pressing need yeah
would you throw sort of orchestration tools into that bucket i mean i think about the cataloging
and orchestration again we're talking about early stage startups here we're not talking about the
validity of these tools in general right because at scale like obviously data teams are running all these things, but the cataloging piece and the orchestration piece, I sort of see as really a next level where you have a growing data team and you have a level of complexity where, you know, those have a lot more appeal.
But in the early stages, like they actually add more complexity in some ways than quality of life.
100%.
I mean, at the end of the day,
how big is your data team, right?
Do you really need a catalog
when you're the one building every table?
How long does it take, maybe?
Right?
So, I mean, we can build a catalog
and pretend that we'll put it
in front of all our stakeholders
and they'll go look at it.
They never do.
They never will.
That's just not a thing
that they're ever going to do.
Data catalog is for the data team.
At the end of the day, if I'm the data team,
I don't really need one.
Problems of scale are what those tools tend to address.
In the early days, those aren't your problems.
Yeah, super interesting.
Okay, actually, one more question in that same train of thought.
Sorry, Kostas.
Have you learned any lessons around when to introduce
or even how to introduce tooling?
Because I think you make a really interesting point
on something like a cataloging tool
where you can take something that inherently, in and itself is very useful, can be extremely useful to teams to drive data discovery, etc., like especially at scale. without context in a way that really paints those tools in a bad light. Or even, I mean, you could even think about in some cases, like a tool like dbt, which,
you know, feels ubiquitous to us in the industry, right.
But can seem redundant to someone whose context is, well, just write SQL right on your warehouse.
You know, that, that seems redundant, right.
Have you learned any lessons on like when and how to introduce tooling in a way that drives wider adoption?
If it's something that you have a lot of conviction about.
Not talking about the quality of life stuff, but something you have conviction about.
I don't know if I should be honest.
I think the tooling I tend to introduce is always driven by demand at the end of the day.
And so when I look at tools that are more cross-functional,
no one cares about the tools I use internally.
I mean, why would they?
It's like caring whether or not someone's using Svelte.
It doesn't matter what the engineering team uses.
That's a concern for them.
Most of the concerns for the data team
are really data team concerns.
No one cares if you're using PPT or not,
or Snowflake or BigQuery.
Those are your sort of issues.
I think where it becomes tricky
is each stakeholder tooling.
So your BI layer is really that interface
between your team and other teams.
Cataloging is similar.
It's that interface between your team and other teams.
Although I would argue cataloging
is really most useful within data teams.
So that's really the way I look at it.
And if it's something that external focus,
like the amplitudes, like the axis,
and looker, and the light dashes,
then it's definitely a mutual discussion about
what are your needs?
What types of workflows are you going to use, and
let's try all this POC together.
It will never be me just making
a decision for everybody, but I want
stakeholders involved so that they have
buy-in, and they can see the value
of the decisions we're making.
At the end of the day, they'll be consuming
this far more than I will, so
let's make sure that they do.
And for the most part that's worked, they tend to love the
tools that we fit together.
That's great.
Well said.
Wonderful advice.
All right, Costas.
Costas Pintasilaouiheva Thank you, Eric.
Thank you for giving me the microphone.
So, Pedram, I have a question.
It's been like, I don't know, like five, 10 years now that there is some kind of
like explosion in terms of, I'm calling it like innovation or new products or
whatever, like when it comes to working with data, right, I would have a modern
data stack, if you just take like a map of the modern data stack, it's all the
different like products that it's sold.
A lot, right?
And you will hear about quality, about storage, modeling, semantic layers.
I don't know, meta semantic layers, whatever.
There is one thing though that I don't hear that much and maybe it's my fault, but I love your thoughts on that because you are also coming, you came from a very regulated industry,
banking, right?
And you moved into like series A companies where obviously like things are like much
more scrappy when it comes like to how we regulate access around data. But what's going on with access control over the data that we have?
Like, how do we control what's going on with this data or who has access to that?
Or how we share it?
How do we process it?
Or when someone comes and says, oh, I have the right to be forgotten or whatever, going like every whatever Excel
reference, like the reference in an Excel document you have in your company,
you have to remove me.
So what have you seen there?
What's your opinion?
And is it my fault that I don't hear that much about that?
David Pérez- It's definitely not your fault.
I would blame the marketers on this one again.
So they're not doing a great enough job of educating you. it's definitely not your fault. I would blame the marketers on this one again.
They're not doing a great enough job of educating you.
There are two companies
I know of in this space, so it is
not very big.
Immuta, I think, is one.
And I just talked to one
called Jetty today, actually, about this.
And they're
both trying to approach this
problem of act as control and about this. They're both trying to approach this, I guess, problem
of access control
and visibility into
who has access to what.
And the problem is there's just so many
tools that you have to
regulate access on.
If you think of, you have your data
in Snowflake and it goes into Looker.
Just those two tools.
That's probably two completely different sets of
ways of managing permissions.
And it's not enough to manage it
just on Snowflake and hope the rest
works because of the way that
it's going to work. You might have access to
finance data in Looker that you can
expect. So getting
that right, I think it's really hard.
And I don't think many
startups are actually thinking about it or worried about it. I think it's really hard and i don't think many startups are actually thinking
about it or worried about it i think it's pretty open in the early days of who has access to data
and people tend to lock things down not because of the regulatory side but more because
people aren't using the data correctly at least in my experience
i tend to default to having things open initially. And then that always backfires
because everybody's going in and querying data,
coming up with answers, and they're always wrong.
And they're asking you to check their queries for them.
You're like, ah, wait a minute.
No, no one gets access anymore.
That's the type of access control that we have
with startups, really.
Banking is totally different.
Obviously, it's very regulated
to an incredible degree where it took i think we had a
typo on a field in a dashboard and i requested it to be fixed and it was a three to four week
estimate because it had to go through like a different team and you had to pay with brown
dollars and it
come back and get approved and all this stuff, it's like all these layers,
just to fix the typos.
So I never want to work in that environment again.
But I, it's probably something we could learn about, you know, maybe hearing
a little bit more about who has access to what and how we manage permissions
across the data stack for sure.
Yeah.
Yeah.
I think you made those like a very good point.
It's not just about, I mean, the data only, it's the overall resources around
data that you have to govern somehow.
And it's not only security or like privacy, it's also like how easily
things can turn into a mess. Like I've seen, like when you,
for example, you have a big engineering team and you give access like to everyone on the
Snowflake instance, like the things that will happen there are not good. Eric knows, Eric knows
very well because I think one of the results of this policy
was having a database named after his name on Snowflake.
That bad boy is still in production.
Really?
So where did he live?
Eric DB lives on.
Eric DB, Eric DB will live on.
I will give it up when BetterSack IP is. But yes, Eric DB is live on. I will give it up when better sack IPS, but yes, Eric DB
still runs production dashboards.
Henry Suryawirawan, Well, yeah, because like after a while, like when
it just starts having like many people getting serve likes from these resources,
it's not that easy to decommission it.
Like it's, it's a Definitely. It's a nice... I think it's expensive because they...
Not everyone knows
how Snowflake charges you.
Yeah.
If you're doing a small query
every five seconds,
well, the data's small.
How much would it cost?
Well, it costs $20,000
over a year.
So,
I think
people will care about governance
eventually at some point
and it's just like
how many times have you gotten burned before you do
yeah I didn't really care about
governance at my first
startup but I certainly cared about it
at my last one
you just it's easy to see how things
go wrong people People make mistakes.
And no data team
wants to be faced
with another question
about why two numbers
don't match.
Because this guy over there
went and queried something
and got what they thought
was the right number.
And now it's your job
to go and unwind
this 15-page query
that they wrote
to figure out why
these two numbers
are different.
That's a very,
very good point.
And it brings me like to like my next question.
So, okay.
Resource management in general, and like in a pretty complex environment, it's
not anything new in engineering, right?
It's just think about someone with like a necessary or like a DevOps in a medium-sized like startup doing a WS.
Like the complexity is just like crazy over there.
That's why we have products like Hasek or Terraform,
Ponomi like all these things out there.
So software engineering has like many years now that is dealing with complexity.
And complexity is part of productization, not just like complexity
because the problem is complex at its root as a science problem.
There's a lot of like discussion about bringing, let's say best practices from
software engineering into the data space.
Good example of that is dbt, for example, right?
Like how it enables workflows and best practices from software engineering.
Where do we stand with that?
Do you think there's like more that like data teams can learn
from software engineering?
Is like data teams at the end should become just engineering,
software engineering teams and just for the same things?
Or there's some kind of like space or new priorities there that are like,
you know, applicable only for data teams?
It is a really good question.
Certainly DBT has helped, I think.
I remember the old days where data teams, and many still do this,
your SQL queries were saved in a text file on your desktop,
and there was no version control.
You just had to ask someone how they ran something,
and they would send it to you by email, right?
So we've come a long way, I would say,
especially on the data
modeling transformation side.
A lot of the tools in the ecosystem are also
moving towards that model right there.
Building
in things like version control
and declarative, like YAML
configuration, or how you set these things
up. I think that's
all great, but I do
wonder if data teams
themselves are sometimes missing the bigger picture of how
these things work together. If I think back to
the older data engineering types of people,
they tended to come in through more technical backgrounds, right?
They came in through computer science or software engineering, and they learned about all the trade-offs there were between performance and how data moves between systems and what it means for data to use a cache or to go to your drive or disk
or to go through the network and what all those things meant for response tasks that type of stuff
i think most engineers kind of understand and know well and then all the associated stuff that
comes around it with like deploying do containers, Kubernetes and all this.
It was kind of like they learned this stuff because they had to.
And I think Noe has been really helpful.
I do think there's a lot of people coming up data outside of that.
And maybe they haven't had exposure to that side of the world. And I do see it sometimes biting us a little bit when we're starting to move
data into what is really a production-ized setting without some of that
understanding of what software engineers have learned over the years.
That's maybe our tooling is good, but I don't think the conversation about how
we think about moving that stuff around
has really happened yet.
What does it mean to
Corey, Dan, and Snowflake?
How does that actually work?
And
what does it mean to transfer data outside of regions?
And what does that look like in COGS?
And that type of thing.
So I think that type of stuff we still need to maybe do a better job of.
It's still early days,
but when you look at it from five years ago,
we've definitely come a long way.
Do you think it's tweaking that is missing
or let's say knowledge or best practices?
I think the tooling is actually pretty good these days.
It's really best practices,
it's knowledge,
and I think it's learning from each other.
We don't tend to talk too much
about this stuff, right?
When I look at the talk people do in data,
it's sometimes about the tooling itself,
but it's really about how we move stuff
into production,
or how we thought about different trade-offs in terms of performance characteristics. That type of questioning doesn't come up enough in my mind versus some of the other types of talks we're having right now.
Yeah, that's an excellent point.
How can we change this?
Better conferences, more collective processes.
I should be writing more about this stuff too.
Like I'm just as guilty as anyone else.
It is happening.
People are asking questions.
Jacob Madsen, for example,
he created the modern data stack in a box
not too long ago.
And that project, I really see him work with him
to build Dr.
Kubernetes into it.
So if that's something you want to learn more about, you should check out.
It's GitHub repo.
It has all that stuff in there.
It's still early days, but I mean, hopefully this is part of that conversation too.
Yeah, that's great.
You mentioned like conferences.
Do you have any favorite conference out there? Like any, I don't know, like conference that you really got a lot of value,
not from the networking part and like all these things,
but also like from, you know, like the content that was created
and how it was delivered as part of the conference.
On the data side, not a ton.
I am really jealous of some of the like software engineering conferences
that I see out there, like PyCon, for example, has always been really good.
RStudio used to have a good conference a few years back. I think less so now.
It's become much more ecosystem, platform focused.
I think all conferences kind of end up that way at some point.
If they're run by a vendor, though, maybe that's just inevitable.
Normcom, I have to give a shout out to that one.
That looks really good.
By Vicky Boykis.
I'm up in a few weeks weeks actually, so it's free.
It's online, like 18 hours long.
We definitely checked that one out.
A lot of good people are talking about that one.
That's cool.
Well, some great resources.
Cool.
And okay.
Next, my next question is about, you mentioned when you were talking with Eric about
starting and what's like the data stack like for, for in your company and like
depending on the scale you are at, there is, or at least it feels like there is
some kind of change in the mindset of people in the industry right now, instead of going and using systems that scale out, like to try and build
like systems that scale up, right?
And I think like a very good example of that is DuckDB, right?
Something that you can run locally, it's going to fry your CPU because it's going to use like every last register of the last
core in there, like to process data.
And people are interested in that.
What's your, like, what's your take on that?
Like, how do you feel about it?
I'm still trying to figure it out, I think, is my take.
I really like.tp
I use it locally a lot
but
to me it's like SQLite
like
a great tool
for the right context
but
you rarely will deploy an application using SQLite.
You call me move to Postgres, right?
Or MySQL.
But it could be great to have SQLite for your test cases because it'll run faster.
You don't have to set up infrastructure.
Like that's fine.
StuffedDB to me feels like it's either middleware within someone else's application stack or a great tool to use locally because you don't want to move data around.
That totally makes sense.
But if your production data isn't in your cloud data warehouse, I don't know how bringing it locally to your laptop is going to solve any of that.
It's a tough argument to make. I don't know, bringing it locally to your laptop is going to solve any of that. It's a tough argument to make.
I don't know, but we'll see.
Daniel P Leprincea- Yeah, I haven't seen the use case for it, but
that doesn't mean it's not out there.
Okay.
So how do you typically use it yourself?
Like, for example, me, like I, I mean, okay, whenever I'm like, need to do
something like quick with data and I prefer to do it in SQL obviously.
And I don't want like to load the data, you know, like that kind of stuff.
Yeah.
Like that could be like great, right.
And you can do that like with quite a lot of data also.
It's like, it can scale like pretty well, like on your laptop.
But how do you use it?
What's some interesting use cases for you?
I use it the exact same way.
So I'm working on a little side project
to do entity resolution
and benchmarking different methods using it.
And so WPB is great for that
because I have a couple files on my laptop.
I want to read them in.
I don't want to spend up Postgres.
Perfect.
I'll load it into WPB.
I can run some SQL,
do some aggregation on top of it.
That works pretty well.
That's really the only use case I have.
But I've heard of other people
doing more important things with it.
So I've heard of people using it
as part of an ETL pipeline,
but they now deploy it to production
to speed up some type of transformation they're doing.
And so, I mean, that kind of makes sense, right?
It's just another tool in your toolbox.
Yeah.
But for me, it's really been, I guess, just like local development and playing around and not having to spin up more infrastructure to play the thing.
Yeah.
Why do you think that it has created so much noise in the market?
The reason I'm asking you is because like recently I was thinking, because I'd
have liked to download ClickHouse and around like with big ClickHouse and to
be honest, like ClickHouse doesn't have that much of a different experience
for working with local data, right?
Like it's single binary, you download these, like it has a lot of tooling,
like amazing support, like for importing data and like creating the data.
Amazing performance too.
Like you can do similar things like as you do with.tb, but okay.
ClickHouse has been known for different kind of use cases.
I've never heard anyone say, let me download it to do something local, right?
But so why DuckDB?
What did they do so right?
And they create this kind of perception in the industry.
I have no idea, to be honest.
And I'm always scared to speculate because they'll come after me.
I don't know. I mean, people
love it. So they must be doing something right.
Like, it's
a genuinely useful tool.
Mode uses it.
Companies are using it in their production application
as part of middleware. That totally
makes sense to me.
It's nice having
a way to read
a bunch of CSV and Parquet files
on your computer.
That was traditionally a little bit harder to do.
But it's fab.
So, I mean, it's great.
I don't know why it became
so popular and so loud.
Yeah, I don't know.
It just took the world by storm.
I can't speculate on why, but I'm happy for that.
Henry Suryawirawanacik, Okay.
Which brings me like to my last question before I give the mic back to Eric.
Marketing and content around these technologies, right?
There's a lot of education that needs to happen.
Like, when you educate people how to use tools.
But maybe, I don't know, even with DuckDB, probably they did something right
with distribution of the technology, which always includes marketing there somehow.
Maybe one day we'll learn what's the magic there.
But you've been also like in, you've worked at Hytouch, right?
And like at Hytouch, again, you were like part of a team and the product
that was new in the industry, like reverse ETL was like something like that point.
So based on your experience, like what are like some really good tools for
reaching out to people out there and
helping them to understand the value of the tools and become better data engineers
or data scientists or whatever, when they have to work with data?
Yeah.
I don't know if it's a tool but
I mean the way I always look at it is
like
where are the people
who you think would benefit from your product
and then
if you truly believe
that your product has value
how do you teach them about that value
at the end of the day
that's all I think marketing is.
And when viewed from that lens,
it makes it easier to think of
what are the possible steps you could do.
So I can walk through how I thought about it at Hightouch.
At Hightouch, I knew what the product did.
It helped move data, for example,
from your warehouse to Salesforce. That was one
very simple use case, right?
And I knew who benefited from that. It was
people like me who used to have to write this code
manually, usually through the Python
integration. And so
having a good understanding
of
what the value is and who it's for,
marketing becomes very
easy.
It's okay, well,
if people like me would benefit from this,
how do I reach them?
Well, do they know what reverse ETL is?
And in the early days, the answer was no.
So we had to educate.
And so a lot of my work was spent around educating people on what it is,
what the value is, what it means,
why it's different from X, Y, and Z.
Once we kind of had a good bit of understanding
of what that was, so the next question is,
how do we make people aware of our company, iDutch, right?
And that's a little bit harder.
And there's no shortcut.
It's just, to me, just like constantly creating content
to bring people to our website that data people
would find genuinely useful and so i would just write about things i was curious for the most part
or things i had learned i think those two things are great places to start and so i would create
content on things like the difference between Airflow, Dynastar, and Prefect. Something I've always wondered, and if you go and Google it, you won't find much. You'll find,
you know, marketing pieces that talk about them a little bit, but no one's actually tried all
three and written about it. So that's what I did. I downloaded them all three and wrote about it.
And that became a great source of traffic to our website because it was the only thing that
had covered all those things.
And so that's usually the way I think about it.
It's like, how do I generate something useful for people that I have a unique perspective on that hasn't been done before?
If you can do that, then hopefully that will bring people to your website.
Yeah, makes a lot of sense.
Eric, the microphone is yours.
Excited? Oh, I'm so excited. I am so excited. Oh yeah.
Was that a, that's me or bedroom? now consulting, you know, which is relatively recent.
And you came out of doing sort of data and marketing at, you know, venture backed, most recently a venture-backed data company, right? So, you know, the marketing vortex, you know, in the data world, you know, in venture-backed companies for data vendors is,
you know, it's pretty intense. I mean, that's what I live in every day.
But now you're consulting, right? So you have companies that bring you problems and you need to figure out
the best way to solve them. Have you had any changes of perspective going from the world of
venture-backed data vendor to a company's paying me to help them solve pretty specific problems?
I think I quickly realized how far ahead we all are of our customers.
When I started to talk to them.
The modern data stack, the number of companies out there that are actually
implementing it is very small. The number of companies who there that are actually implementing it is very small.
The number of companies who know about it are small.
The number of companies who know about DBT is actually quite small.
You talk to most of these companies, they don't even have data teams at the time.
Now, maybe that's selection bias because you're talking to me.
But a lot of companies out there don't have a data team.
They have people who know what they want and have found ways to get it,
for better or for worse, often for worse, which is, again, why they're talking to me.
So I think we have been in a bubble.
I certainly have been in a bubble over the past couple of years.
And I think a lot of our spenders are kind of guilty of that.
Pushing a system
that's actually pretty complex
out to people.
And not to say that it's not useful
or good. It's the same one I will
implement a lot of the time.
I think we often forget
how far ahead we are
and where we need to start
a conversation with people like we probably
can't talk to people about the merits of like data dipping within a data warehouse
when they don't even know that they need a data warehouse right so a lot of my work is really going back to basics and trying to figure out like how do we
teach people what this data stack is all about without confusing them that's already hard enough
and then probably the harder thing is to show them what the actual value is of doing all this work
because if at the end of the day,
you put in all this work and all they get is a report,
well, they were already getting that
before they started talking to you.
And so hopefully you can say,
well, what you were doing before served this need.
But let's talk about
not just doing what you were doing before,
but all the things that we can start to do
now that your data is centralized
we can bring in data from three or four
different systems
we can start to be really
nuanced about how we look at attribution
and we can look at
all the way down to your
product level to see
where different channels interact
with each other when people
want to activate or
make revenue.
That's when I think people can start to
kind of see what's actually possible
data. What they
come to you is, hey, I need to know how many
customers I have. And if you
just sort of stop the conversation there and give them that with the
data warehouse, it's like, great.
Why did I pay this much
money for this?
Right. I could have kept doing that for what
you charged me. But if you can start to
bring the focus around, like the whole point of this is to
actually bring data in from different systems and start answering questions that
you weren't able to answer before and they're actually going to give you
insight into your business business then i think like you can start to sell them on this idea and that's
where most customers are they're nowhere near where we are today where we're talking about
version control data modeling observability and all this stuff no one has any clue what any of
that stuff means okay last question and i would love for you to speak to our listeners who are and of course with podcast analytics it's really difficult to
know how large this subset is that i already have millions of viewers
millions and millions how do you break out of that bubble? If you are working in a context,
I'll try to broaden it.
If you're working in a context
where you're sort of in the data echo chamber
and that's your job day to day,
how do you break out of the bubble?
That's a good question.
Get off Twitter and get off Slack
and go meet real company
I don't know
yeah
like how do you talk to people who aren't even talking to you
I think it's a tough thing to do
I don't know
talk to people who aren't in data
as much as you can
when you go outside
talk to people
and ask them
what questions you're asking with data,
how they're solving the same problems
that you're solving.
Because at the end of the day,
these people are doing this stuff.
Like I've seen people do marketing
attribution in Salesforce.
I have no idea how it's done,
but I know it's a pretty common thing
that people do.
And it's like,
well, they don't have a warehouse.
How are they doing this stuff?
So the more you can talk
to people outside of
the data world,
the better I think it will be.
All of us.
Yeah.
Such, such sage wisdom, Pedram.
This has been
a really wonderful show.
It's flown by
and we'd love to have you
back on soon.
This was great.
Happy to come back anytime.
My takeaway, Kostas,
which has been a recurring theme
throughout the show,
even from some of the very,
very early episodes,
is that generally keeping it simple
is the best policy.
And if you hear, you know,
Pedram, who is probably more than anyone, is the best policy. And if you hear, you know,
Pedram, who is probably more than anyone,
you know, familiar with the most cutting edge tooling
in the data space,
you know, even, you know,
stuff that very small
startup companies are building.
You know, he picked a couple
of core pieces of technology
and said, this is what you need. And when you start to break it with scale, then you've hit the jackpot, you know? And so when you talk to practitioners, I just love how simple it is for them. terminology to describe technology. They just talk about the utility of various things that
are required of them in their job. And it really is pretty simple. And so I guess,
you know, per some of the conversation that we had about working with them, it can get really tricky to navigate all the marketing terminology.
And I'm of course, someone who's creating that problem
actively in the data space.
I love the simplicity.
Yeah.
I think, Pedronka has like a very pragmatic approach to things, which is,
first of all, it's like super valuable for someone who's doing
his job of being a consultant, right?
Because at the end, if you are a consultant, one of the biggest values that you can deliver
to your customer is go and like the guy and help them like focus on what really matters
for them and make the right choices.
Right.
So it's pretty difficult, like to avoid this Fogo hype, you know, like it's
everywhere, like, you know, like a cheerleader of something, so it's, I don't
know, I really enjoyed the conversation with him because it was very, you know,
down to earth and very pragmatic.
And so yeah, like he talked about like the real problems and when you have
the problems and when you don't have the problem, so I really enjoyed the
conversation with him and he should be writing more and communicating this style of talking about
what's going on in the industry because it's super useful and it's missing. I think we need
more of voices. I agree. All right. Well, thanks for tuning in. Subscribe if you haven't,
tell a friend and we will catch you on the next one. We hope you enjoyed this episode of the
Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.