The Data Stack Show - Shop Talk: Kostas Settles the Real-Time vs. Streaming Debate
Episode Date: November 18, 2022In this bonus episode, Eric and Kostas talk shop around the topic of streaming. ...
Transcript
Discussion (0)
Welcome to the Data Sack Show, Shop.Costas.
We have talked with people who built amazing data technology at companies like Netflix, Uber, and LinkedIn.
But you and I actually don't record our talks about data very much.
But we actually talk about data together a ton. And so Brooks had this amazing
idea of just recording some of the conversations that you and I have before and after the show
about data and our opinions on it. And really, this has been one of my favorite things that we
do. So welcome to Shop Talk. It is where Costas and I share opinions and thoughts on a personal
level about what we're seeing in
the data space. And it really is simple. We ask one another a question and the other one tries
to answer it. So without further ado, here is Shop Talk. Welcome to the Data Sack Show Shop Talk,
where Costas and I talk shop about all things data and probably share too much of our personal opinions.
Costas, I believe that it's my turn this time.
It is, yes.
You can torture me as much as you want.
Any question you might have, I'm all yours.
Okay, I saw an interesting company launch.
This was on H news recently and they call themselves a real-time api
data connector you basically like i guess it's like sub they they say like sub minute
you know sort of you never basically have more than like sub minute drift
in between data sources and i mean i looked it and it kind of just looks like a,
it looks like it takes,
this is just at a very, very high level.
I haven't actually read the docs or anything like that.
But just based on the architectural diagram
on their marketing site,
which we all know is the definitive source of truth
for every product's architecture.
Yeah.
It kind of seems like, you know, almost real-time APIs
are almost like a streaming ETL,
if you will, right?
Like what you would
traditionally load in batch,
you're now loading,
you know, essentially in real-time
with sub-minute latency
to some sort of data store downstream.
Anyways, this got me thinking,
and I'm really interested to know
your thoughts on this.
So there are multiple sort of streaming technologies in this sort of vein, right?
You know, you have Materialize, right, which is sort of streaming SQL.
You have a number of other technologies. And one thing I'm interested to know is, do you think that these will get super wide market
adoption or is, say, sub-minute latency really only a problem for a certain subset of companies?
Or actually, I have a third flavor.
Okay.
Do you think that it will become so cost-effective that it just doesn't matter?
Right?
Like, why not stream in real time if you can, because it's cheap?
Like, right now, part of the challenge is, like, the infrastructure to do that at
skills, you know, tends to be, like, pretty hefty.
I mean, I don't know.
I would disagree a little bit with that.
Like, I don't think that it's that expensive to like start working
with this system, right?
Like they scale up and down.
Yeah.
Like if you have a lot of data, obviously it's going to be more expensive, but
if there's more data, it means that like there is a reason you have more data.
Hopefully it represents also like that your business is generating more
data.
It's about your growth, right?
Like,
David Pérez- yeah, I guess maybe another way to say it would be like, cost has
multiple vectors, right?
Like if you just set up a set and forget it, like 24 hour job that you never look at unless it fails versus like managing Kinesis, like those are very different.
Like, yeah, I mean, okay.
I don't know exactly like how we like works like from a quick look at the websites.
Like, oh yeah.
Moving lake is the tool.
Yeah.
I didn't even mention the tool.
Yeah.
Okay.
First of all, we are talking about like a very heavy product, which means like they are still trying to figure out like the product market suite probably, right?
Yeah.
So we need to keep that in our minds.
They're going to be batch ETL, traditional batch ETL in like three weeks when they pivot.
Alex Wrigley- I mean, yeah, I don't know.
Like I, for example, like if you go to the connectors, like, and select like
the Bank of America connector that they have, you will see that like the entities
that they support, like three of them that they mentioned here, one of them is
real-time, the other two are not, right?
And that's because obviously like Bank of America does not offer like a
light seat over accounts and sub-accounts, but it does over like transactions.
So that's what you can get like in real time, right?
I mean, there's like, there are like, let's say inherent limitations with
what like the systems will like to expose in real time and whatnot, right?
Yep.
And the way that I see it, from what I see here, yeah, obviously you can
push data from your Parcom America API or account to a database, right? But I don't think you're going to be generating that much data.
Unless you are, I don't know, I mean, how many transactions a day you can have
like on the park account, right?
Usually they're like thousands, like maybe millions.
So why I'm saying that, like, and what I'm trying to say is that real time,
like we need, when we are like approaching these questions, like to start always like from
the use case, like what are we trying to like to achieve with these systems and how
we are going to be using the data from the systems.
Yep.
Right.
Like, yeah, like if I need like the transactions to consolidate, let's say
all the transactions for my P and L, like at the end of the month or whatever, do
I need to do that like real time?
Probably not.
Do I need like the transactions to like create a notification so like,
I don't know, like a salesperson can do something like as soon as possible?
Yeah.
Is this as soon as possible sub milliseconds?
No, we are still working with humans.
Like they are not doing, like,
to react like that fast, right?
Do we use these transactions, like, to do
HFT, like, high-frequency trading?
Oh, yeah, like, then
but then again, like, we're talking about
a completely different type of system, right?
Yeah, yeah, that's it. Yeah, yep.
So, real-time,
like, traditional, like,
streaming is one thing, real-time is another thing, okay? Yeah. Streaming, like traditional, like streaming is one thing,
real-time is another thing.
Okay.
Yeah.
Streaming and like...
That's a great distinction.
You have pool, sober pool and like all that stuff.
Like it's like provide like different ergonomics around like
working with your data, right?
Real-time has to do with latency.
Like how fast you have like to react to any piece of information, right? Let's say you are the system that scans the sky for inbound nuclear warheads from the enemy.
You probably would like to react pretty fast, right?
And you want to guarantee that, right?
So it's going to be fast.
You don't want like one time to be fast, another time being a little slower, you know?
Right?
So it heavily depends like on like, what are you trying to do with the data that you have?
Right.
And what are like the notifications or like the real-time dashboards that you are going to build
and who is like consuming them? the notifications or like the real-time dashboards that you are going to build with like consumer. So my question to you as like a marketeer, which like one of like the very standard like go-to
market strategies when it comes like to data was like, oh, marketing needs like real-time data.
They need, I don't know, like sub-minutes latencies and stuff like that.
Is it true?
Like, what do you think, like, what marketers need
when it comes to data?
Well, I would start out by saying I think that the...
I do think some of these technologies are really compelling because,
you know,
from a marketing perspective or even like a product perspective,
you know,
you could do the analytics.
You've been able to do like the analytics thing,
like real-time analytics,
say for quite some time,
right?
I mean,
real-time web analytics or real-time product analytics.
You know, you can sort of... Like, there are really great products out there
that do that.
But also, as it's becoming easier
to get more data, you know, together
and sort of to basically compute interesting things
with separate sets of data,
some of the, like, infrastructure that actually allows you to compute some of this stuff,
say in near real time, so that you get, instead of just observing a user behavior
and then seeing that in a dashboard, sort of with direct lines like product analytics
through whatever pipeline, you're actually doing some sort of compute along the way
that includes additional data, which is really compelling, right? Because then you get a lot
more insight downstream, even if it's in the, let's just say it's in the same dashboard that
you're looking at, right? Do you have some sort of compute along the way? So that is
very compelling because the amount of context and fidelity that you can get is way, way higher,
potentially. Still pretty hard to do, actually, like technologically, you know? Or, I mean,
it's not like the patterns aren't a mystery, but it's also like a lot of pieces that you have to
put together and run and, you know. So I I would say I agree with you that it really depends on the use cases.
Right.
So let's take an example of like a situation in the real world where real time, you know,
or near real time or whatever.
Actually, we should probably discuss like the definition of real time because actually
it's sort of at the root of the issue. Well, let's say you have some sort of app that,
you know, like a ride sharing app or, you know, whatever, some sort of like transportation thing.
Weather can be a really big influence on that, right? So if you think about like customer
acquisition from a marketing standpoint, you know, or app activation, right? Like we want to increase usage or get people beyond their first ride or first interaction
or whatever that is. Weather can be a big driver for that, right? So rain is coming, you know,
go ahead and book your ride or schedule it or like whatever that is, right? You know, from that,
so from that standpoint, you actually need to like pull a bunch of data in, run a bunch of computes, and in a pretty quick manner, send out a message to certain users in a certain location to try to get them to take that level of, you know, kind of, let's call it like creating like a personalized
experience based on a high level of context on those particular users, particular situation
in a particular location.
That also includes a lot of context around like their individual usage of your, you know,
service or whatever.
So those things, sure. I would argue though that
the companies who will truly benefit from that level of detail and that level of infrastructure
tend to be like really large companies with really large user bases. Right? Like that's not very common.
Yeah.
Yeah, I agree with you.
What I would like to add, like, especially because like we started
talking about this because of what's the name of the moving link.
Moving link.
Yeah.
And please, like, that's not like, I'm not trying to say anything bad about them.
Right.
Let's just make this clear.
But sure.
Well, all the hyper news comments already did that for you.
So there's probably nothing you can.
Yeah.
Like, to be honest, like I have like huge respect, like for someone who is trying
like to build something like this today.
Okay.
Like it takes like a lot of, how to say that, like, it's not exactly like
an easy to penetrate right now.
It doesn't.
Like there are many solutions out there, right?
Like so I have like huge respect, like for people who are trying to do that.
And usually what happens is that like, you need to understand like a little
bit how like you start the company,
right? And how you start like building a product, like you have like an idea in general, right? Like
you know where you want to be, but at the same point, like you need at the same time, sorry,
like you need to differentiate enough so you can have a starting point. Yep. Okay. So yes, you throw something out there, like you try like to create like a new
way of, let's say, solving a problem.
And that's like, let's say the conversation started with the market.
That's what you see here.
It's like a conversation starter.
Like, hey, like we are solving this problem.
Is it important for you?
Cool.
Come here.
Like, that's how we solve it.
Maybe it's not the right way to do it.
Maybe it is.
We don't know.
But sometimes you need to start.
And that's what we see here in a company like MoonGrade.
And again, huge respect for what they are doing, because this is the ugly part of building a company where everyone can easily have an opinion.
I can very easily say, this is going to fail.
Yeah, obviously.
Like it's easy to say at this point that like it's going to fail.
Right.
But that's not the point here.
It's not like you're trying to, what you're trying to do is like start like a
dialogue with the market until you figure out like what's the real opportunity and
how like exactly the opportunity in the market that you have
chosen, right? In this case, it's like data management. So that's how we should see these things.
And yeah, Mauro, is it going to be like real time? Is it going to be bots? Is it going to be
something else? Maybe it's both, right? We'll see. I think it's going to be very, very interesting now that I have a first impression of what
Moby Click is like to revisit that in six months.
On another discussion and see where the product is in six months from now.
And try to understand what happened in between.
Right?
Like that's generated the changes that hopefully we're going to observe.
We really should do that
and we can like replay clips
of this conversation.
You know,
and then they raise
like a huge amount of money
and are super successful
and then...
Yeah, and hopefully, yeah,
like guys,
if you're going to do that,
like let us know.
Maybe, who knows,
like we might find
like angel investors.
Yeah.
I'm not moving enough real-time
transactions
from my Bank of America
account via ADI
to
to write
big checks
yes
it doesn't have to be
a big check
it can't be a small check
it's true
a check is still a check
right
that is true
that is very true
I will say
for one,
I'm excited.
I'm bullish on real-time stuff.
I think as the experience
gets better and better
and more accessible,
a lot of times,
even in my job,
we don't need to know stuff
in real-time,
but it's really nice to.
It's really convenient.
I think I can go wrong.
Look at stuff too often. Looking at numbers too often can actually be unhealthy or a distraction.
But when you think about things like campaigns that you're running or product launches or
other things like that, it is kind of cool to see like, is there initial resonance?
You know, it's kind of neat.
I don't know.
I'm excited.
What I would add to this is that, like, moving data around in real time is not that hard.
Like, what is, like, much more complicated and where it's really, like, gets hard to
set, let's say, very strict SLAs is, like, when you have to process the data in real time.
If you want to execute very complex queries where you have, I don't know,
like joints between tens of tables and I don't know how many aggregations,
blah, blah, blah, like all these things, this is hard, right?
That's where things start like really, really hard.
So yeah, like moving the data around is one thing, processing the data and making them.
For someone like to consume is a completely different kind of problem.
Because that's really where all the values created though, actually, right?
Like is in, well, I mean, if you're trying to get some sort of insight that requires compute, like that's actually where most of the values created when the data lands
downstream.
Yeah.
I mean,
It depends.
Like you're trying to like to build a service that it's more like Zapier, let's
say, right?
Where you want like to trigger something when something happens.
That's one thing, right?
You don't need to do like any crazy kind of like processing there, right?
Like it's more about like how many requests you can, like how much data
like in the unit of time, like you can, you can process internally the requests.
Now, if you want like to get the data and also do very complicated algorithms on the data,
now that's a different thing.
That's why we usually see in the Lambda architecture, you see you have the bots in the streaming
or real-time part of the architecture, where most of the huge cases around the real-time
come more to do with like notifications, because in notifications, usually you don't have like to go and
process like a lot of like different data, do like a lot of rambling around the data.
It's more about like taking a look into the data and see like, oh, is there something
like that I need to act upon because like the temperature is like higher than it
should be.
You know, yeah, I'm exaggerating a little bit, simplifying things too much, but
that's where you see like the, I think for everyone who wants like to understand better,
like the distinction between like the two paradigms, like studying inclementations of
the Lambda architecture and how like companies did that and for what reason, I think it's an excellent starting point.
Totally agree.
Totally agree.
All right.
While Brooks Chuttle's were at the buzzer, I could talk about this for a long time, but we have so many more shop talks to dig in.
And next time it's your turn.
So I can't wait to see what you ask.
You know, Costas, we learned so much from the data leaders that we talked to,
but I learned so much from picking your brain
and actually your questions really make me think really hard.
So I appreciate ShopTalk.
I think it makes me a sharper thinker.
Well, it's fun.
Like, I think it's good to just sit and chat about the stuff that we experience.
And yeah, I think like, I hope like people enjoy it.
That's why I'll keep asking for people to reach out.
Please do this.
Come on, fuck.
Like, you can do that.
Like, send an email.
Yeah.
And let us know how you feel and like,
what are your opinions of like,
your experience with the show.
So,
please do that
so me and Derek,
we can keep being happy.
Please.
Of course.
And of course,
we try to take the same types
of questions to,
you know,
data leaders from all sorts
of companies,
large and small.
So definitely subscribe to the main show if you haven't yet. Tons of really good episodes there
and tons of really good thoughts from data leaders, you know, really around the world. So
definitely subscribe if you haven't and we'll catch you on the next Shop Talk.