The Data Stack Show - The PRQL: How Would You Define a Data Pipeline? Featuring the RudderStack Eng. Team

Episode Date: December 10, 2021

On the PRQL this week, Eric and Kostas bring in some of the Rudderstack engineering team to discuss data pipelines and preview episode 66 of the Data Stack Show. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Sack Show prequel. We just recorded an episode with the head of data infrastructure at Robinhood, and man, it was a very interesting conversation. This is a special prequel episode because we have myself, Mitesh, and Sumant all in San Francisco, where Kostas lives this week. So we're actually able to invite Mitesh and Sumant to come on this prequel. So do you want to just do a really quick intro and say what you do at RudderStack? I am Mitesh. I manage infrastructure team and also control plane team at RudderStack. Sumant?
Starting point is 00:00:44 Hey, yeah, I'm Sumant. I lead the engineering here at RedStack, like the multiple teams. So this is exciting because I know these prequels are super short, but we have two people who live and breathe data every single day. And so we're going to ask their opinions
Starting point is 00:00:59 on some of the topics we covered with Sri from Robinhood. So here's my question. And this is kind of something we talk about a lot on the show is putting a definition to a term that people maybe take for granted. And Kostas has a great question. In the context of data, how would you define a pipeline? So who wants to go first? Mitesh, since you went first in the intro, why don't you take this one first?
Starting point is 00:01:30 Yeah, pipeline is something from where data moves in a reliable fashion. Nowadays, data becomes so much distributed. So everyone wants the data to be either their centralized place or something. But the problem, one of the biggest problem is moving it reliably. So what I feel is pipeline is something through which data flow in a very reliable way. So even if anything goes down, we are confident that data will reach to the desired destination. Ah, interesting. Yeah.
Starting point is 00:01:53 So there's sort of a resiliency component to the definition of a pipeline. Super interesting. All right. Sumanth. Yeah. For me, like pipeline is something, uh, yeah, like where the data flows and there are no leaks. It's important that like everything is like reliably delivered to the destinations.
Starting point is 00:02:11 And like as the pipelines are getting better and better, like you also probably need to have a like better hold on the quality of the data that also flows through your pipe. So it's like a pipe which has a shape and all the other thing that goes to that shape is like delivered and also like goes in the right shape. Yeah, very cool. All right, Costas, you're up. Yeah. First of all, guys, thank you so much for visiting me here in San Francisco.
Starting point is 00:02:37 I was so lonely, so I'm very, very happy that you're all here. And Eric, I would say that I think the main outcome is that pipeline is something very very personal right like everyone like has a very different relationship with it so it's probably something quite important right so guys my question is i think easier i mean you've been you're like you've been in the profession of engineering like for quite a while you're all like both of you like very very experienced you have seen many different things i mean this past let's say i don't know decades right which technology you would say that it's the most influential one when it comes to data infrastructure.
Starting point is 00:03:27 Which one changed, let's say, the space very, very radically? Mitesh, you first. Yeah, I think storage, like S3, capability of storing data on more like very cheap storage, like basically dividing storage and compute like i think that changed like everything drastically because now storage is so cheap that you can store insane amount of data and that is like at a so lower low cost and like come removing out compute now like you are only paying for like what you are actually doing like with the data so i think separating storage and compute change the like complete pedigram like all together yeah yeah that's a great point
Starting point is 00:04:14 yeah for me it's kafka so it basically changed how companies build their data infrastructure so like it's it's like a single stop solution for like a lot of things that companies use internally that's very interesting cool let's see if our guests agree with all that yeah absolutely actually having recorded the show i can go ahead and tell our listeners there's lots of there's lots of interesting content on all of those subjects. So Mitesh, Sumant, thank you for enlightening us and I'm excited to spend some time with you in San Francisco. Yeah, looking forward to it, Eric. Thanks, Kostas and Eric, for having us. Thank you.
Starting point is 00:04:56 And be sure to subscribe so you can catch the next episode with the Head of Data Infrastructure at Robinhood. It's a good one that you're not going to want to miss.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.