The Data Stack Show - 124: Pragmatism About Data Stacks with Pedram Navid of West Marin Data

Episode Date: February 1, 2023

Highlights from this week’s conversation include:Pedram’s journey into the world of data (4:05)What should the datastack at an early-stage startup look like? (9:53)New ideas surrounding access con...trol for data (24:45)What can data teams learn around complexity from software engineering (30:55)Scaling up instead of scaling out in processing data (37:40)Why DuckDB is making so much noise in the market (41:06)Final thoughts and takeaways (53:25)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. If you have followed LinkedIn or Substack influencers in the data space, you've probably come across Pedram Navid. He is a really smart guy, has written some really helpful articles on lots of data-related things. I actually found his content researching several
Starting point is 00:00:46 topics before meeting him. And we got the chance to meet him and invited him on the show. And I'm super excited to chat with him, Kostas. He started out in finance and the financial world with data, and then was at several startups in the Bay Area, most recently High Touch, and now he's running his own consultancy. So where am I going to start with my questions? That's a difficult part. I think one thing that I do really want to dig into with him, which we haven't talked a ton about on the show, is data stacks at early stage companies. You know, we've
Starting point is 00:01:26 talked with a lot of startup founders who have created startups, especially in the data space, obviously. We've talked with a lot of data practitioners at various sizes of companies. And I don't know if we've talked with many data practitioners who have done this at multiple very early stage startup companies in the SaaS space. And so I think that's a really helpful thing to think through for me and for a lot of our listeners by getting an opinion from someone who's done this multiple times over about what do you actually need in that stage as a company in terms of your data stack? And then the other question I want to ask is, are you thinking about scale? You know, because generally startups need to become hyper growth, or at least that's
Starting point is 00:02:11 the plan. So those are my two big questions. What does the data stack look like? And then how do you think about building it in a way that can scale, you know, if you hit the jackpot? Yeah, for me, I want to start with learning from him, like what's the difference between working in a very hard and regulated industry, like finance, where he initially was working at and then going and working like in a series A startup.
Starting point is 00:02:46 Yeah, that's huge. And also what is helpful to keep from the work on a big and probably bureaucratic like organization, when you go and work in a chaotic environment, like a series A pre-post-product market, but pre-growth, let's say, stage combining where things are like changing constantly, but it would be awesome to hear from him, like what he found useful from his experience in doing that. That's one thing. And the other thing is that, Pedrom is like exposed to all the new things
Starting point is 00:03:28 that are happening in these industries. Like to hear from him, like what's his take and opinion on some technologies like Dr. B, for example. Oh yeah. This whole thing of, okay, let's scale out or scale out what we should do with infrastructure and how we should process our data. So yeah, let's start with him. Let's do it.
Starting point is 00:03:58 Pedram, welcome to the Data Sack Show. It's been a long time coming. Thanks. Glad to be here. All right. Well, we'll start where we always do. Give us your background, especially the parts about how you got into data in the first place. It's an old story now. I started at a bank a long, long time ago, and we had data coming in from a vendor through PowerPoint slides. And it had two columns on it, one for this month and one for last month.
Starting point is 00:04:30 And that was all the data we had. Yeah. And every month they would send us a new PowerPoint slide, replacing one column with the other. And so I think my boss asked, is there a way where we can kind of figure out what's going on month by month over a trend? And so I would hand copy this data from PowerPoint to Excel. One thing led to another, and I built a dashboard.
Starting point is 00:04:54 Eventually, I learned VBA because I got tired of doing things manually. And that was really the gateway drug into the rest of my career. Python, R, data science, all that happened through the span of 12 years. Then I moved to the Bay Area and I thought, you know, enough banking, let's jump into startup life.
Starting point is 00:05:20 Worked at a few different startups and the data scientists eventually the data engineer because i thought data science just took too long to get results and one thing led to another and most recently i ended up at high touch as their head of data doing data marketing and product oh so many questions okay one thing from the early part of the story, were you, so it sounds like you sort of went through your learnings, you know, sort of, you know, VBA, you know, through to Python and then, you know, other subsequent, you know, subsequent languages and methodologies there. Were you doing all that at a bank? And if so, were you sort of teaching yourself and bringing that technology into the bank? And the reason I
Starting point is 00:06:08 ask is, you know, traditionally we think about banks as sort of being resistant to sort of technological change, especially if they're getting data, you know, delivered in PowerPoint. So we'd love to hear a little bit more about that journey and how you brought those technologies in. I mean, what was that like? It was difficult to say the least. So BBA was allowed because Microsoft Excel was allowed. And so you were allowed to use that. I learned BBA on my own, painfully, slowly.
Starting point is 00:06:41 I think as most people learn it, I doubt many people go to school for BBA. So that was just the beginning. And then as I was searching, I found about this thing called Python. And I probably wasn't supposed to download it to my bank laptop, but I did. And so that helped a little bit with the automation. And again, it was all really self-driven, self-taught,
Starting point is 00:07:07 just trying to solve problems I didn't want to do myself. I was like purely motivated by laziness. And I mean, I think to this day, that's still the driving factor behind what I do. I love it. As we move towards things like R to actually do real business modeling and analysis, that's when I got the most resistance.
Starting point is 00:07:30 We were doing like compensation modeling for 12,000 employees in Microsoft Excel. And we were passing down this one spreadsheet back and forth. And... FTP? No, email. Oh, man.
Starting point is 00:07:48 Maybe SharePoint if you were lucky so as it's moving through heads everyone's changing these models they're dragging and dropping and stuff is changing and things are breaking that no one knows obviously right and six months go by you rolled out your competition model and you got to figure out why the numbers aren't right. And you go back and you find that some guy accidentally filled the wrong column in spreadsheet. Or even worse, the executive would change their mind on what the package should look like every
Starting point is 00:08:15 five minutes. Then you go back and update 50 different apps, try to recalculate things. So I thought there must be a better way. I learned about this thing called R. I was learning about data science on the side. And so I thought, what if I put all this logic into code instead of into workbook and try to automate some of this work?
Starting point is 00:08:38 Arvifi was very upset. He did not like it. He thought R was a black box. And I realized what he was mad about wasn't using R. He just wanted a spreadsheet. So I would do all the work in R and then just output it to a spreadsheet
Starting point is 00:08:54 at the end of the day and give it to him. You can still have that. And then everything was fine. It was always a work of appeasing stakeholders that never end. Yeah. Yeah, that's such a good insight. It's always a work of appeasing stakeholders that never end. Yeah. Yeah, that's such a good insight. It's funny to hear the concept of R being a black box
Starting point is 00:09:12 because, I mean, nothing could be further from the truth. Completely open source. Yeah, perception is reality. That's super helpful. Okay, well, let's... So then let's fast forward to move to the Bay Area, you were involved in multiple startups, most recently High Touch, and did a bunch of data stuff at early stage startups. So, you know, in our chat beforehand,
Starting point is 00:09:39 you were saying, you know, sort of, you know, seed to series A, you know, stage of those companies. And one thing I'm really interested in that I've wanted to ask you for a while is what your take is on what the data stack at an early stage startup should look like. And there are a couple of motivating factors. One, I'm selfishly interested because, you know, I'm involved with that every day. But it's not something we've talked about on the show a ton. You know, we've talked with people running startups, running data startups.
Starting point is 00:10:20 You know, we've talked with enterprises, but we haven't really honed in on, okay, you're a really early stage company. You know, what does your data stack look like? And then, well, I'll follow up with part B of the question. But yeah, so you're series A, you know, sort of late seed stage, and you're running data at that company. What do you actually need?
Starting point is 00:10:42 And I can't just say it depends, right? Well, just explain what the dependencies are. Yeah, let's say, all right. My motivating factor whenever I do things is I need something that I don't need to babysit, and I'm willing to trade off costs for engineering time because I'm just one person and again I'm very lazy but I'm also
Starting point is 00:11:11 probably busy doing other things I need something that just works and I think in those early stage startups your data is usually not very big yep right and I might say it'd be blasphemy but I might argue that your data is usually not very big. Yep. Right. And I might say it'd be blasphemy, but I might argue that your data is not that valuable
Starting point is 00:11:29 when you're first starting out. It's good to have. Can you unpack that a little bit more? Yeah. I agree with you, but I think that's really helpful. If the goal of data is to help drive decisions, at an early stage company, you don't have that much data, right?
Starting point is 00:11:53 Because there's not much happening yet. And you probably know every customer you have. And you probably know how you close that deal and where you got it from. So what are you really learning from a really complex data stack right now?
Starting point is 00:12:10 You're not building models. You're not scoring leads. You're not doing marketing attribution. At the end of the day, you're maybe counting revenue and maybe a number of customers. That's really the value when you're first starting out. Now, it's good to start with that stuff
Starting point is 00:12:27 because as those complex questions build over time, having a nice foundation can make it easier to answer those things. But I think we don't need to invest, unless like data is your product, you probably don't need to invest a ton into your data stack in the early days. No, that makes sense. I think, you know, one specific example of that I've experienced multiple times is that
Starting point is 00:12:53 things like multi-touch attribution are extremely powerful, but you actually have to have a pretty huge amount of data and generally a lot of paid programs running in order for a multi-touch attribution model to really be additive in terms of shifting marketing budget right and when you're not spending a ton of money you know you can spend a lot of time developing a model that might be accurate at the end of the day you It's like, well, okay, we're going to move 10 grand from this bucket to this bucket. It's not a huge deal. That's super interesting. Okay. How about scale though? Because in an ideal world, these early stage startups hit hyper growth and scale really quickly. And when that happens, tons of stuff breaks across the company, you know, which is just the way that things go. And,
Starting point is 00:13:49 you know, people have to fix all sorts of stuff, you know, from org charts to data stack. So how do you think about that aspect of it? Right? Like early on, you want something that just works. It's a small team. Does, do the tools available scale? How do you think something that just works. It's a small team. Do the tools available scale? How do you think about that side of it? That is a really good question. So if you look up, let's go through the whole stack. On the InGeft side, there's a few options. There's your FibedCran, there's your Airbytes, and so on.
Starting point is 00:14:27 And those, I mean, that scales as long as your wallets are deep. Right? So, that's probably fine when you're first starting out because you don't want to invest too heavily into that. It's hard to anyway.
Starting point is 00:14:41 So, that is something you can always take down the road and decide, do we want to keep using this or should we build something internally to help reduce cost? You can pay to push that decision off. Exactly. Yeah. Until it's too painful and then you can deal with it. On the data warehouse side,
Starting point is 00:14:58 you're probably not going to go wrong with Snowflake or BigQuery. You probably don't need Databricks, I would assume. And you can't see a good reason to use Redshift anymore. Yeah, probably. I mean, I doubt you'll hit scaling limits with Snowflake.
Starting point is 00:15:15 Again, BigQuery is a bit more questionable. But again, you really got to be pushing numbers to be hitting problems there. And what else do you need to do? It's DBT for modeling, which sure, you'll probably hit, again, limits there. But if you're at the scale where you're hurting yourself
Starting point is 00:15:34 through what's capable through that stack, then you've got really good problems. You must have a ton of data and a ton of business. And you can just throw engineers at it at that point. So it would welcome that issue. If the stock I built today doesn't really scale, then that's great. Let's hire more people and fix it.
Starting point is 00:15:54 Yep. 100%. 100%. Yeah. I think I'm thinking about some of our, you know, large customers and yeah, you have to be at a pretty big scale to sort of, you know, large customers. And yeah, you have to be at a pretty big scale to sort of, you know,
Starting point is 00:16:08 I'm thinking about ones that have migrated off, you know, Redshift into, you know, almost going fully onto like data lake infrastructure, right? But you're talking about like unbelievable, unbelievable scale when you sort of outpace like, you know, basic warehouse stuff, which is super interesting.
Starting point is 00:16:31 You could probably get away with Postgres if you really wanted to, the data warehouse in the early days, right? That's probably what you will hit limits on. So that's where I think maybe just go with Snowflake and hope you don't. But if you're cost conscious and you just wanted something cheap and simple, Postgres is pretty strong and powerful. Yeah, super interesting. Okay, other than the tools that you just mentioned, and then I'll pass the mic over to Kostas because, of course, the rhythm of the show is that I monopolize and then he does. What are the nice- haves for you?
Starting point is 00:17:09 Right, so I understand like the core infrastructure. So you have ingest, you have warehousing, you have a modeling layer, you know, in the early stage, that's all you need. Are there any sort of, okay, you have a larger budget than you expected. So I'm going to just, you know, I'm going to do some quality
Starting point is 00:17:25 of life or some, do you have any preferences around things that you would add to that stack? I don't believe in quality of life for the data team. I just haven't seen one that like increases my quality of life enough to justify the expense. For me, it's much more like tactical, like planning out for the future. So I've got my basic data stack. Probably going to need BI, right? So maybe we can start with... I was going to ask about that if you didn't mention it. You probably will need BI at some point.
Starting point is 00:17:57 Maybe you start with a superset and it's pretty cheap and free. Maybe you decide you need a semantic layer because the demands on your team are growing high and then you move to a looker or a light down. That's all. These are all valid places to be.
Starting point is 00:18:13 There's Metabase. There's nothing wrong with any of those. I think those are all highly dependent on your team. I'm going to call that a nice to have. You probably need it
Starting point is 00:18:20 at some point. It's just like, when is the right time? Product Analytics is another one. So getting data from RutterStack into Amplitude or any of the other ones out there. Feature adoption and sort of understanding.
Starting point is 00:18:36 Yeah, innovation, growth, all that kind of funnel stuff. That, I mean, that's usually driven by demand, not by, you just want to do for fun, right? So if your marketing team and your product teams are asking for this stuff, you got to find a solution. And the solution usually isn't writing SQL queries for funnels because nobody wants or knows how to do that. Instead, you give them something self-serve. That's kind of how I look at it. Everything else just seems, I don't know.
Starting point is 00:19:06 I need something motivating for me to go get it there's like data quality is always one people talk about there's catalog there's metadata those all seem nice to have but would I go out and spend my marketing or my data dollars on it
Starting point is 00:19:22 not unless I had a pressing need yeah would you throw sort of orchestration tools into that bucket i mean i think about the cataloging and orchestration again we're talking about early stage startups here we're not talking about the validity of these tools in general right because at scale like obviously data teams are running all these things, but the cataloging piece and the orchestration piece, I sort of see as really a next level where you have a growing data team and you have a level of complexity where, you know, those have a lot more appeal. But in the early stages, like they actually add more complexity in some ways than quality of life. 100%. I mean, at the end of the day,
Starting point is 00:20:07 how big is your data team, right? Do you really need a catalog when you're the one building every table? How long does it take, maybe? Right? So, I mean, we can build a catalog and pretend that we'll put it in front of all our stakeholders
Starting point is 00:20:19 and they'll go look at it. They never do. They never will. That's just not a thing that they're ever going to do. Data catalog is for the data team. At the end of the day, if I'm the data team, I don't really need one.
Starting point is 00:20:32 Problems of scale are what those tools tend to address. In the early days, those aren't your problems. Yeah, super interesting. Okay, actually, one more question in that same train of thought. Sorry, Kostas. Have you learned any lessons around when to introduce or even how to introduce tooling? Because I think you make a really interesting point
Starting point is 00:20:55 on something like a cataloging tool where you can take something that inherently, in and itself is very useful, can be extremely useful to teams to drive data discovery, etc., like especially at scale. without context in a way that really paints those tools in a bad light. Or even, I mean, you could even think about in some cases, like a tool like dbt, which, you know, feels ubiquitous to us in the industry, right. But can seem redundant to someone whose context is, well, just write SQL right on your warehouse. You know, that, that seems redundant, right. Have you learned any lessons on like when and how to introduce tooling in a way that drives wider adoption? If it's something that you have a lot of conviction about. Not talking about the quality of life stuff, but something you have conviction about.
Starting point is 00:21:56 I don't know if I should be honest. I think the tooling I tend to introduce is always driven by demand at the end of the day. And so when I look at tools that are more cross-functional, no one cares about the tools I use internally. I mean, why would they? It's like caring whether or not someone's using Svelte. It doesn't matter what the engineering team uses. That's a concern for them.
Starting point is 00:22:19 Most of the concerns for the data team are really data team concerns. No one cares if you're using PPT or not, or Snowflake or BigQuery. Those are your sort of issues. I think where it becomes tricky is each stakeholder tooling. So your BI layer is really that interface
Starting point is 00:22:35 between your team and other teams. Cataloging is similar. It's that interface between your team and other teams. Although I would argue cataloging is really most useful within data teams. So that's really the way I look at it. And if it's something that external focus, like the amplitudes, like the axis,
Starting point is 00:22:55 and looker, and the light dashes, then it's definitely a mutual discussion about what are your needs? What types of workflows are you going to use, and let's try all this POC together. It will never be me just making a decision for everybody, but I want stakeholders involved so that they have
Starting point is 00:23:14 buy-in, and they can see the value of the decisions we're making. At the end of the day, they'll be consuming this far more than I will, so let's make sure that they do. And for the most part that's worked, they tend to love the tools that we fit together. That's great.
Starting point is 00:23:31 Well said. Wonderful advice. All right, Costas. Costas Pintasilaouiheva Thank you, Eric. Thank you for giving me the microphone. So, Pedram, I have a question. It's been like, I don't know, like five, 10 years now that there is some kind of like explosion in terms of, I'm calling it like innovation or new products or
Starting point is 00:23:53 whatever, like when it comes to working with data, right, I would have a modern data stack, if you just take like a map of the modern data stack, it's all the different like products that it's sold. A lot, right? And you will hear about quality, about storage, modeling, semantic layers. I don't know, meta semantic layers, whatever. There is one thing though that I don't hear that much and maybe it's my fault, but I love your thoughts on that because you are also coming, you came from a very regulated industry, banking, right?
Starting point is 00:24:35 And you moved into like series A companies where obviously like things are like much more scrappy when it comes like to how we regulate access around data. But what's going on with access control over the data that we have? Like, how do we control what's going on with this data or who has access to that? Or how we share it? How do we process it? Or when someone comes and says, oh, I have the right to be forgotten or whatever, going like every whatever Excel reference, like the reference in an Excel document you have in your company, you have to remove me.
Starting point is 00:25:13 So what have you seen there? What's your opinion? And is it my fault that I don't hear that much about that? David Pérez- It's definitely not your fault. I would blame the marketers on this one again. So they're not doing a great enough job of educating you. it's definitely not your fault. I would blame the marketers on this one again. They're not doing a great enough job of educating you. There are two companies
Starting point is 00:25:30 I know of in this space, so it is not very big. Immuta, I think, is one. And I just talked to one called Jetty today, actually, about this. And they're both trying to approach this problem of act as control and about this. They're both trying to approach this, I guess, problem
Starting point is 00:25:45 of access control and visibility into who has access to what. And the problem is there's just so many tools that you have to regulate access on. If you think of, you have your data in Snowflake and it goes into Looker.
Starting point is 00:26:02 Just those two tools. That's probably two completely different sets of ways of managing permissions. And it's not enough to manage it just on Snowflake and hope the rest works because of the way that it's going to work. You might have access to finance data in Looker that you can
Starting point is 00:26:18 expect. So getting that right, I think it's really hard. And I don't think many startups are actually thinking about it or worried about it. I think it's really hard and i don't think many startups are actually thinking about it or worried about it i think it's pretty open in the early days of who has access to data and people tend to lock things down not because of the regulatory side but more because people aren't using the data correctly at least in my experience i tend to default to having things open initially. And then that always backfires
Starting point is 00:26:45 because everybody's going in and querying data, coming up with answers, and they're always wrong. And they're asking you to check their queries for them. You're like, ah, wait a minute. No, no one gets access anymore. That's the type of access control that we have with startups, really. Banking is totally different.
Starting point is 00:27:00 Obviously, it's very regulated to an incredible degree where it took i think we had a typo on a field in a dashboard and i requested it to be fixed and it was a three to four week estimate because it had to go through like a different team and you had to pay with brown dollars and it come back and get approved and all this stuff, it's like all these layers, just to fix the typos. So I never want to work in that environment again.
Starting point is 00:27:34 But I, it's probably something we could learn about, you know, maybe hearing a little bit more about who has access to what and how we manage permissions across the data stack for sure. Yeah. Yeah. I think you made those like a very good point. It's not just about, I mean, the data only, it's the overall resources around data that you have to govern somehow.
Starting point is 00:27:56 And it's not only security or like privacy, it's also like how easily things can turn into a mess. Like I've seen, like when you, for example, you have a big engineering team and you give access like to everyone on the Snowflake instance, like the things that will happen there are not good. Eric knows, Eric knows very well because I think one of the results of this policy was having a database named after his name on Snowflake. That bad boy is still in production. Really?
Starting point is 00:28:34 So where did he live? Eric DB lives on. Eric DB, Eric DB will live on. I will give it up when BetterSack IP is. But yes, Eric DB is live on. I will give it up when better sack IPS, but yes, Eric DB still runs production dashboards. Henry Suryawirawan, Well, yeah, because like after a while, like when it just starts having like many people getting serve likes from these resources, it's not that easy to decommission it.
Starting point is 00:29:01 Like it's, it's a Definitely. It's a nice... I think it's expensive because they... Not everyone knows how Snowflake charges you. Yeah. If you're doing a small query every five seconds, well, the data's small. How much would it cost?
Starting point is 00:29:17 Well, it costs $20,000 over a year. So, I think people will care about governance eventually at some point and it's just like how many times have you gotten burned before you do
Starting point is 00:29:33 yeah I didn't really care about governance at my first startup but I certainly cared about it at my last one you just it's easy to see how things go wrong people People make mistakes. And no data team wants to be faced
Starting point is 00:29:48 with another question about why two numbers don't match. Because this guy over there went and queried something and got what they thought was the right number. And now it's your job
Starting point is 00:29:58 to go and unwind this 15-page query that they wrote to figure out why these two numbers are different. That's a very, very good point.
Starting point is 00:30:05 And it brings me like to like my next question. So, okay. Resource management in general, and like in a pretty complex environment, it's not anything new in engineering, right? It's just think about someone with like a necessary or like a DevOps in a medium-sized like startup doing a WS. Like the complexity is just like crazy over there. That's why we have products like Hasek or Terraform, Ponomi like all these things out there.
Starting point is 00:30:37 So software engineering has like many years now that is dealing with complexity. And complexity is part of productization, not just like complexity because the problem is complex at its root as a science problem. There's a lot of like discussion about bringing, let's say best practices from software engineering into the data space. Good example of that is dbt, for example, right? Like how it enables workflows and best practices from software engineering. Where do we stand with that?
Starting point is 00:31:16 Do you think there's like more that like data teams can learn from software engineering? Is like data teams at the end should become just engineering, software engineering teams and just for the same things? Or there's some kind of like space or new priorities there that are like, you know, applicable only for data teams? It is a really good question. Certainly DBT has helped, I think.
Starting point is 00:31:45 I remember the old days where data teams, and many still do this, your SQL queries were saved in a text file on your desktop, and there was no version control. You just had to ask someone how they ran something, and they would send it to you by email, right? So we've come a long way, I would say, especially on the data modeling transformation side.
Starting point is 00:32:09 A lot of the tools in the ecosystem are also moving towards that model right there. Building in things like version control and declarative, like YAML configuration, or how you set these things up. I think that's all great, but I do
Starting point is 00:32:28 wonder if data teams themselves are sometimes missing the bigger picture of how these things work together. If I think back to the older data engineering types of people, they tended to come in through more technical backgrounds, right? They came in through computer science or software engineering, and they learned about all the trade-offs there were between performance and how data moves between systems and what it means for data to use a cache or to go to your drive or disk or to go through the network and what all those things meant for response tasks that type of stuff i think most engineers kind of understand and know well and then all the associated stuff that
Starting point is 00:33:22 comes around it with like deploying do containers, Kubernetes and all this. It was kind of like they learned this stuff because they had to. And I think Noe has been really helpful. I do think there's a lot of people coming up data outside of that. And maybe they haven't had exposure to that side of the world. And I do see it sometimes biting us a little bit when we're starting to move data into what is really a production-ized setting without some of that understanding of what software engineers have learned over the years. That's maybe our tooling is good, but I don't think the conversation about how
Starting point is 00:34:04 we think about moving that stuff around has really happened yet. What does it mean to Corey, Dan, and Snowflake? How does that actually work? And what does it mean to transfer data outside of regions? And what does that look like in COGS?
Starting point is 00:34:22 And that type of thing. So I think that type of stuff we still need to maybe do a better job of. It's still early days, but when you look at it from five years ago, we've definitely come a long way. Do you think it's tweaking that is missing or let's say knowledge or best practices? I think the tooling is actually pretty good these days.
Starting point is 00:34:46 It's really best practices, it's knowledge, and I think it's learning from each other. We don't tend to talk too much about this stuff, right? When I look at the talk people do in data, it's sometimes about the tooling itself, but it's really about how we move stuff
Starting point is 00:35:03 into production, or how we thought about different trade-offs in terms of performance characteristics. That type of questioning doesn't come up enough in my mind versus some of the other types of talks we're having right now. Yeah, that's an excellent point. How can we change this? Better conferences, more collective processes. I should be writing more about this stuff too. Like I'm just as guilty as anyone else. It is happening.
Starting point is 00:35:32 People are asking questions. Jacob Madsen, for example, he created the modern data stack in a box not too long ago. And that project, I really see him work with him to build Dr. Kubernetes into it. So if that's something you want to learn more about, you should check out.
Starting point is 00:35:50 It's GitHub repo. It has all that stuff in there. It's still early days, but I mean, hopefully this is part of that conversation too. Yeah, that's great. You mentioned like conferences. Do you have any favorite conference out there? Like any, I don't know, like conference that you really got a lot of value, not from the networking part and like all these things, but also like from, you know, like the content that was created
Starting point is 00:36:19 and how it was delivered as part of the conference. On the data side, not a ton. I am really jealous of some of the like software engineering conferences that I see out there, like PyCon, for example, has always been really good. RStudio used to have a good conference a few years back. I think less so now. It's become much more ecosystem, platform focused. I think all conferences kind of end up that way at some point. If they're run by a vendor, though, maybe that's just inevitable.
Starting point is 00:36:57 Normcom, I have to give a shout out to that one. That looks really good. By Vicky Boykis. I'm up in a few weeks weeks actually, so it's free. It's online, like 18 hours long. We definitely checked that one out. A lot of good people are talking about that one. That's cool.
Starting point is 00:37:14 Well, some great resources. Cool. And okay. Next, my next question is about, you mentioned when you were talking with Eric about starting and what's like the data stack like for, for in your company and like depending on the scale you are at, there is, or at least it feels like there is some kind of change in the mindset of people in the industry right now, instead of going and using systems that scale out, like to try and build like systems that scale up, right?
Starting point is 00:37:54 And I think like a very good example of that is DuckDB, right? Something that you can run locally, it's going to fry your CPU because it's going to use like every last register of the last core in there, like to process data. And people are interested in that. What's your, like, what's your take on that? Like, how do you feel about it? I'm still trying to figure it out, I think, is my take. I really like.tp
Starting point is 00:38:27 I use it locally a lot but to me it's like SQLite like a great tool for the right context but you rarely will deploy an application using SQLite.
Starting point is 00:38:47 You call me move to Postgres, right? Or MySQL. But it could be great to have SQLite for your test cases because it'll run faster. You don't have to set up infrastructure. Like that's fine. StuffedDB to me feels like it's either middleware within someone else's application stack or a great tool to use locally because you don't want to move data around. That totally makes sense. But if your production data isn't in your cloud data warehouse, I don't know how bringing it locally to your laptop is going to solve any of that.
Starting point is 00:39:23 It's a tough argument to make. I don't know, bringing it locally to your laptop is going to solve any of that. It's a tough argument to make. I don't know, but we'll see. Daniel P Leprincea- Yeah, I haven't seen the use case for it, but that doesn't mean it's not out there. Okay. So how do you typically use it yourself? Like, for example, me, like I, I mean, okay, whenever I'm like, need to do something like quick with data and I prefer to do it in SQL obviously.
Starting point is 00:39:45 And I don't want like to load the data, you know, like that kind of stuff. Yeah. Like that could be like great, right. And you can do that like with quite a lot of data also. It's like, it can scale like pretty well, like on your laptop. But how do you use it? What's some interesting use cases for you? I use it the exact same way.
Starting point is 00:40:08 So I'm working on a little side project to do entity resolution and benchmarking different methods using it. And so WPB is great for that because I have a couple files on my laptop. I want to read them in. I don't want to spend up Postgres. Perfect.
Starting point is 00:40:26 I'll load it into WPB. I can run some SQL, do some aggregation on top of it. That works pretty well. That's really the only use case I have. But I've heard of other people doing more important things with it. So I've heard of people using it
Starting point is 00:40:40 as part of an ETL pipeline, but they now deploy it to production to speed up some type of transformation they're doing. And so, I mean, that kind of makes sense, right? It's just another tool in your toolbox. Yeah. But for me, it's really been, I guess, just like local development and playing around and not having to spin up more infrastructure to play the thing. Yeah.
Starting point is 00:41:06 Why do you think that it has created so much noise in the market? The reason I'm asking you is because like recently I was thinking, because I'd have liked to download ClickHouse and around like with big ClickHouse and to be honest, like ClickHouse doesn't have that much of a different experience for working with local data, right? Like it's single binary, you download these, like it has a lot of tooling, like amazing support, like for importing data and like creating the data. Amazing performance too.
Starting point is 00:41:34 Like you can do similar things like as you do with.tb, but okay. ClickHouse has been known for different kind of use cases. I've never heard anyone say, let me download it to do something local, right? But so why DuckDB? What did they do so right? And they create this kind of perception in the industry. I have no idea, to be honest. And I'm always scared to speculate because they'll come after me.
Starting point is 00:42:07 I don't know. I mean, people love it. So they must be doing something right. Like, it's a genuinely useful tool. Mode uses it. Companies are using it in their production application as part of middleware. That totally makes sense to me.
Starting point is 00:42:23 It's nice having a way to read a bunch of CSV and Parquet files on your computer. That was traditionally a little bit harder to do. But it's fab. So, I mean, it's great. I don't know why it became
Starting point is 00:42:40 so popular and so loud. Yeah, I don't know. It just took the world by storm. I can't speculate on why, but I'm happy for that. Henry Suryawirawanacik, Okay. Which brings me like to my last question before I give the mic back to Eric. Marketing and content around these technologies, right? There's a lot of education that needs to happen.
Starting point is 00:43:05 Like, when you educate people how to use tools. But maybe, I don't know, even with DuckDB, probably they did something right with distribution of the technology, which always includes marketing there somehow. Maybe one day we'll learn what's the magic there. But you've been also like in, you've worked at Hytouch, right? And like at Hytouch, again, you were like part of a team and the product that was new in the industry, like reverse ETL was like something like that point. So based on your experience, like what are like some really good tools for
Starting point is 00:43:43 reaching out to people out there and helping them to understand the value of the tools and become better data engineers or data scientists or whatever, when they have to work with data? Yeah. I don't know if it's a tool but I mean the way I always look at it is like where are the people
Starting point is 00:44:12 who you think would benefit from your product and then if you truly believe that your product has value how do you teach them about that value at the end of the day that's all I think marketing is. And when viewed from that lens,
Starting point is 00:44:28 it makes it easier to think of what are the possible steps you could do. So I can walk through how I thought about it at Hightouch. At Hightouch, I knew what the product did. It helped move data, for example, from your warehouse to Salesforce. That was one very simple use case, right? And I knew who benefited from that. It was
Starting point is 00:44:50 people like me who used to have to write this code manually, usually through the Python integration. And so having a good understanding of what the value is and who it's for, marketing becomes very easy.
Starting point is 00:45:05 It's okay, well, if people like me would benefit from this, how do I reach them? Well, do they know what reverse ETL is? And in the early days, the answer was no. So we had to educate. And so a lot of my work was spent around educating people on what it is, what the value is, what it means,
Starting point is 00:45:22 why it's different from X, Y, and Z. Once we kind of had a good bit of understanding of what that was, so the next question is, how do we make people aware of our company, iDutch, right? And that's a little bit harder. And there's no shortcut. It's just, to me, just like constantly creating content to bring people to our website that data people
Starting point is 00:45:47 would find genuinely useful and so i would just write about things i was curious for the most part or things i had learned i think those two things are great places to start and so i would create content on things like the difference between Airflow, Dynastar, and Prefect. Something I've always wondered, and if you go and Google it, you won't find much. You'll find, you know, marketing pieces that talk about them a little bit, but no one's actually tried all three and written about it. So that's what I did. I downloaded them all three and wrote about it. And that became a great source of traffic to our website because it was the only thing that had covered all those things. And so that's usually the way I think about it.
Starting point is 00:46:28 It's like, how do I generate something useful for people that I have a unique perspective on that hasn't been done before? If you can do that, then hopefully that will bring people to your website. Yeah, makes a lot of sense. Eric, the microphone is yours. Excited? Oh, I'm so excited. I am so excited. Oh yeah. Was that a, that's me or bedroom? now consulting, you know, which is relatively recent. And you came out of doing sort of data and marketing at, you know, venture backed, most recently a venture-backed data company, right? So, you know, the marketing vortex, you know, in the data world, you know, in venture-backed companies for data vendors is, you know, it's pretty intense. I mean, that's what I live in every day.
Starting point is 00:47:39 But now you're consulting, right? So you have companies that bring you problems and you need to figure out the best way to solve them. Have you had any changes of perspective going from the world of venture-backed data vendor to a company's paying me to help them solve pretty specific problems? I think I quickly realized how far ahead we all are of our customers. When I started to talk to them. The modern data stack, the number of companies out there that are actually implementing it is very small. The number of companies who there that are actually implementing it is very small. The number of companies who know about it are small.
Starting point is 00:48:27 The number of companies who know about DBT is actually quite small. You talk to most of these companies, they don't even have data teams at the time. Now, maybe that's selection bias because you're talking to me. But a lot of companies out there don't have a data team. They have people who know what they want and have found ways to get it, for better or for worse, often for worse, which is, again, why they're talking to me. So I think we have been in a bubble. I certainly have been in a bubble over the past couple of years.
Starting point is 00:49:02 And I think a lot of our spenders are kind of guilty of that. Pushing a system that's actually pretty complex out to people. And not to say that it's not useful or good. It's the same one I will implement a lot of the time. I think we often forget
Starting point is 00:49:20 how far ahead we are and where we need to start a conversation with people like we probably can't talk to people about the merits of like data dipping within a data warehouse when they don't even know that they need a data warehouse right so a lot of my work is really going back to basics and trying to figure out like how do we teach people what this data stack is all about without confusing them that's already hard enough and then probably the harder thing is to show them what the actual value is of doing all this work because if at the end of the day,
Starting point is 00:50:07 you put in all this work and all they get is a report, well, they were already getting that before they started talking to you. And so hopefully you can say, well, what you were doing before served this need. But let's talk about not just doing what you were doing before, but all the things that we can start to do
Starting point is 00:50:26 now that your data is centralized we can bring in data from three or four different systems we can start to be really nuanced about how we look at attribution and we can look at all the way down to your product level to see
Starting point is 00:50:41 where different channels interact with each other when people want to activate or make revenue. That's when I think people can start to kind of see what's actually possible data. What they come to you is, hey, I need to know how many
Starting point is 00:50:57 customers I have. And if you just sort of stop the conversation there and give them that with the data warehouse, it's like, great. Why did I pay this much money for this? Right. I could have kept doing that for what you charged me. But if you can start to bring the focus around, like the whole point of this is to
Starting point is 00:51:17 actually bring data in from different systems and start answering questions that you weren't able to answer before and they're actually going to give you insight into your business business then i think like you can start to sell them on this idea and that's where most customers are they're nowhere near where we are today where we're talking about version control data modeling observability and all this stuff no one has any clue what any of that stuff means okay last question and i would love for you to speak to our listeners who are and of course with podcast analytics it's really difficult to know how large this subset is that i already have millions of viewers millions and millions how do you break out of that bubble? If you are working in a context,
Starting point is 00:52:07 I'll try to broaden it. If you're working in a context where you're sort of in the data echo chamber and that's your job day to day, how do you break out of the bubble? That's a good question. Get off Twitter and get off Slack and go meet real company
Starting point is 00:52:26 I don't know yeah like how do you talk to people who aren't even talking to you I think it's a tough thing to do I don't know talk to people who aren't in data as much as you can when you go outside
Starting point is 00:52:43 talk to people and ask them what questions you're asking with data, how they're solving the same problems that you're solving. Because at the end of the day, these people are doing this stuff. Like I've seen people do marketing
Starting point is 00:52:55 attribution in Salesforce. I have no idea how it's done, but I know it's a pretty common thing that people do. And it's like, well, they don't have a warehouse. How are they doing this stuff? So the more you can talk
Starting point is 00:53:06 to people outside of the data world, the better I think it will be. All of us. Yeah. Such, such sage wisdom, Pedram. This has been a really wonderful show.
Starting point is 00:53:18 It's flown by and we'd love to have you back on soon. This was great. Happy to come back anytime. My takeaway, Kostas, which has been a recurring theme throughout the show,
Starting point is 00:53:31 even from some of the very, very early episodes, is that generally keeping it simple is the best policy. And if you hear, you know, Pedram, who is probably more than anyone, is the best policy. And if you hear, you know, Pedram, who is probably more than anyone, you know, familiar with the most cutting edge tooling
Starting point is 00:53:52 in the data space, you know, even, you know, stuff that very small startup companies are building. You know, he picked a couple of core pieces of technology and said, this is what you need. And when you start to break it with scale, then you've hit the jackpot, you know? And so when you talk to practitioners, I just love how simple it is for them. terminology to describe technology. They just talk about the utility of various things that are required of them in their job. And it really is pretty simple. And so I guess,
Starting point is 00:54:35 you know, per some of the conversation that we had about working with them, it can get really tricky to navigate all the marketing terminology. And I'm of course, someone who's creating that problem actively in the data space. I love the simplicity. Yeah. I think, Pedronka has like a very pragmatic approach to things, which is, first of all, it's like super valuable for someone who's doing his job of being a consultant, right?
Starting point is 00:55:08 Because at the end, if you are a consultant, one of the biggest values that you can deliver to your customer is go and like the guy and help them like focus on what really matters for them and make the right choices. Right. So it's pretty difficult, like to avoid this Fogo hype, you know, like it's everywhere, like, you know, like a cheerleader of something, so it's, I don't know, I really enjoyed the conversation with him because it was very, you know, down to earth and very pragmatic.
Starting point is 00:55:50 And so yeah, like he talked about like the real problems and when you have the problems and when you don't have the problem, so I really enjoyed the conversation with him and he should be writing more and communicating this style of talking about what's going on in the industry because it's super useful and it's missing. I think we need more of voices. I agree. All right. Well, thanks for tuning in. Subscribe if you haven't, tell a friend and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
Starting point is 00:56:30 We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.