The Data Stack Show - The PRQL: How Would You Define a Data Pipeline? Featuring the RudderStack Eng. Team
Episode Date: December 10, 2021On the PRQL this week, Eric and Kostas bring in some of the Rudderstack engineering team to discuss data pipelines and preview episode 66 of the Data Stack Show. ...
Transcript
Discussion (0)
Welcome to the Data Sack Show prequel.
We just recorded an episode with the head of data infrastructure at Robinhood, and man,
it was a very interesting conversation.
This is a special prequel episode because we have myself, Mitesh, and Sumant all in San Francisco, where Kostas lives this week.
So we're actually able to invite Mitesh and Sumant to come on this prequel.
So do you want to just do a really quick intro and say what you do at RudderStack?
I am Mitesh. I manage infrastructure team and also control plane team at RudderStack.
Sumant?
Hey, yeah, I'm Sumant.
I lead the engineering here at RedStack,
like the multiple teams.
So this is exciting
because I know these prequels are super short,
but we have two people who live and breathe data
every single day.
And so we're going to ask their opinions
on some of the topics we covered
with Sri from Robinhood.
So here's my question.
And this is kind of something
we talk about a lot on the show is putting a definition to a term that people maybe take for
granted. And Kostas has a great question. In the context of data, how would you define
a pipeline? So who wants to go first? Mitesh, since you went first in the intro, why don't you
take this one first?
Yeah, pipeline is something from where data moves in a reliable fashion.
Nowadays, data becomes so much distributed.
So everyone wants the data to be either their centralized place or something.
But the problem, one of the biggest problem is moving it reliably.
So what I feel is pipeline is something through which data flow in a very reliable way. So even if anything goes down, we are confident that data will
reach to the desired destination.
Ah, interesting.
Yeah.
So there's sort of a resiliency component to the definition of a pipeline.
Super interesting.
All right.
Sumanth.
Yeah.
For me, like pipeline is something, uh, yeah, like where the data flows and there are no
leaks.
It's important that like everything is like reliably delivered to the destinations.
And like as the pipelines are getting better and better, like you also probably need to
have a like better hold on the quality of the data that also flows through your pipe.
So it's like a pipe which has a shape and all the other thing that goes to that shape
is like delivered and also like goes in the right shape.
Yeah, very cool.
All right, Costas, you're up.
Yeah.
First of all, guys, thank you so much for visiting me here in San Francisco.
I was so lonely, so I'm very, very happy that you're all here.
And Eric, I would say that I think the main outcome is that pipeline is
something very very personal right like everyone like has a very different relationship with it so
it's probably something quite important right
so guys my question is i think easier i mean you've been you're like you've been in the profession of
engineering like for quite a while you're all like both of you like very very experienced
you have seen many different things i mean this past let's say i don't know decades right which
technology you would say that it's the most influential one when it comes to data infrastructure.
Which one changed, let's say, the space very, very radically?
Mitesh, you first.
Yeah, I think storage, like S3, capability of storing data on more like very cheap storage,
like basically dividing storage and compute like i think that
changed like everything drastically because now storage is so cheap that you can store insane
amount of data and that is like at a so lower low cost and like come removing out compute now like
you are only paying for like what you are actually doing like with the data so i think separating storage
and compute change the like complete pedigram like all together yeah yeah that's a great point
yeah for me it's kafka so it basically changed how companies build their data infrastructure so
like it's it's like a single stop solution for like a lot of things that companies use
internally that's very interesting cool let's see if our guests agree with all that yeah
absolutely actually having recorded the show i can go ahead and tell our listeners there's lots of
there's lots of interesting content on all of those subjects. So Mitesh, Sumant, thank you for
enlightening us and I'm excited to spend some time with you in San Francisco.
Yeah, looking forward to it, Eric. Thanks, Kostas and Eric, for having us.
Thank you.
And be sure to subscribe so you can catch the next episode with the Head of Data Infrastructure
at Robinhood. It's a good one that you're not going to want to miss.