The Data Stack Show - Data Debrief: What Open Source Data Projects Have Come Out of Facebook, Whoops, *Meta?
Episode Date: October 29, 2021On this week's debrief, Kostas and Eric talk about the variety of open source projects that come from Facebook. ...
Transcript
Discussion (0)
Okay, welcome to the third edition of Data Sack Show Debrief.
And Kostas and I were just commenting on the blurred background,
which I actually think the fidelity is pretty good for both of us.
My background's slightly moodier.
Yours looks much more blurry than mine for some reason.
I think it's lighting.
Yeah, you're a more dark kind of person.
I'm the light of the show.
That's right, yes.
We're talking about like darkness in my heart.
Okay, so this is something that I couldn't wait for the debrief.
So we've talked with people from some really cool companies.
So Uber, Netflix, amazing technologies have emerged from those companies.
A lot of cool technology has emerged from Facebook,
but it seems like it's just not quite as... It doesn't show up as much in conversation at the
very least, at least on the show among our audience. And is that because they're embroiled
in so much controversy? I mean, of course, Uber has controversy, but there are some things from a
leadership standpoint, there's all of the on the ground, like legal implications of the actual
delivery of the service and employment. But Facebook is embroiled in all sorts of controversy
around content, they get deeply involved in politics, because of the nature of what's being
shared on the platform. And so just, and for me,
being a little bit of an outsider to the data space,
hearing that Presto came out of Facebook
was a little bit novel to me, I guess,
because I've come to expect to hear
that those technologies emerge from other companies.
But Kostas, you have seen the space emerge.
So what's your take on that?
Yeah, okay. Presto is a very interesting piece
of technology in terms of like how it has matured and it's not new it's been around for a while
and i mean okay i knew that it was coming from from facebook what i didn't know was all this
story around trino how trino came up, how the governance of the
project broke into two different projects and all that stuff.
So I would say that, okay, on one side, I think that probably Facebook didn't manage
that part very well, how they manage the governance of the project and all these things that are
very, very important
when it comes to large-scale open-source projects.
On the other hand, I think that one of the reasons
that we didn't hear until now that much about Presto
is because it took a while for Presto to become,
how to say that, something that makes sense
to be used outside of very large enterprises. And I think that's say that, something that makes sense to be used outside of very large enterprises.
And I think that's something that's from the conversation that we had with Justin.
And if you hear what Justin was saying was that we started with Hadoop, right?
And we had to wait until today to have the data lake at the level of maturity
where Presto or Trino or Starburst can actually be used on top of
that to become like the query engine that is going to do the analytics.
So I think because of the nature of the project itself that was stateless in a way, like it
didn't have storage, it never like was a complete database product right
as snowflake was it had to wait until all these data lake related technologies matured enough
to become much more available and i think that we will start hearing more and more about this
project especially as part of like the this data mesh movement where it naturally fits because
of the decentralized nature of the technology.
Yeah, well, it's certainly something we're hearing more among Redrack customers, right?
Like Presto's requirement is coming up more and more.
I think on the other side of the conversation, React actually is something that came out
of Facebook that has widespread adoption.
It's less in the realm of the show in terms of data processing, et cetera,
but certainly a technology that has seen widespread adoption that came out of Facebook.
Yeah, yeah, yeah.
They have quite a few, let's say, important open source projects.
I think RocksDB also comes from them.
But they have, I mean, if someone goes to their open source repositories,
there's a big wealth of very interesting projects
that come from them.
Okay, of course, not all of them are as relevant
as Presto or React, for example,
which is probably the dominant framework
to do front-end development.
But yeah, I think there was some kind of mismanagement around Presto.
Yeah, well, I was reading, of course, there's the famous quote,
someone at Facebook said, the greatest minds of our generation are figuring out how to try to get someone to click on an ad,
which is certainly a drastic oversimplification.
But great for a podcast debrief to bring that quote
up. But it is interesting to think about Presto and React and then a number of other things
that really the world is benefiting from in many ways as a result of, I guess, those great minds
trying to get you to click on a Facebook ad. That's true. That's true.
Okay. So second question, because we're running up on time here on the debrief. I don't know if we have a time on these, Brooks,
probably for our mental health and other people's mental health, we should keep it to like five
minutes, but data mesh. Do you have any updated thoughts on data mesh? I mean, subjective,
somewhat controversial. I feel I have a little bit more clarity, but what's the cost to stake?
We probably need like a full episode with Justin to discuss about it.
It's still a vague concept for me.
It would be interesting probably at some point outside of vendors who are part, let's say, of this data mesh pattern or whatever,
to also find someone to chat with who has actually implemented data mesh architecture and get some insights from there.
It's early. There is a reason that it exists, absolutely.
I'm not saying something against data meshes, but it's something that needs more clarity.
And I think we should try in this show to bring this clarity in this concept,
for this concept.
Yeah, absolutely.
Brooks, mark it down.
We need someone who has implemented a data mesh.
Kostas, any other takeaways?
I mean, that was a fun show.
Super interesting.
Ah, yeah.
I mean, it's amazing when you get to chat with people
that they were doing data 10 years ago
and they are like, okay, serial entrepreneurs in a way
because it's like the second company that's here.
Sure.
It's very interesting to hear the perspective from these people
that they have lived both eras, let's say, of this market.
So it was a very, very interesting perspective.
And I really want to thank Justin for that. description that he gave about the landscape talking about how snowflake you can think of it
as the teradata of today five trend like the informatica which makes sense yeah yeah for sure
those are great ones but like it takes for someone who has lived through all these iterations of the
markets to have this kind of insights yeah yeah for yeah, for sure. One thing, actually, we're just going to run this one long
since it's our only,
it's only our third one,
but the iterable Snowflake connection
that you mentioned
and Justin's insight
that they're probably
also running on Snowflake
was really helpful for me.
And I'm sure like Justin,
it'd be interesting to talk to him
just about that subject.
I think that that dynamic will create a massive market in and of itself just within the Snowflake ecosystem.
Right. So like if you think about all the companies that are using Snowflake as a data warehouse,
and then you think about being able to integrate with other companies who are however
they're doing it but presumably running snowflake on the back end yeah so that you have a almost
like a marketplace data mart that's readily accessible in your cloud data warehouse like
i agree with justin in that like that's not the end-all be-all in terms of the broader data stack when it comes to the complexity faced by Fortune 500 companies.
But without a doubt, that's going to be a huge market in and of itself and hot take on the debrief.
I think a huge way that Snowflake grows significantly in the next five years just because that type of functionality is pretty huge.
Yeah, actually, I would suggest to our listeners to go and read the first pages of the S1 filing
of Snowflake because that's exactly what they describe there. And the way that they describe it
is by using the term network effects. And that's
exactly what they're trying to create with this data sharing mechanism. Because suddenly you get
iterable that has its own customers, that they have a good reason to also use Snowflake,
and you create some very, very strong network effects there that if you manage to implement them and create them,
yeah, like it's going to be probably much more,
how to say that, much more impressive
compared to what like Teradata managed to do on the market.
Yeah.
I don't think that it's easy to do it,
especially when on the other side,
you have all this openness that comes
with all
these open source projects. And of course, companies, especially the big companies,
they know exactly what vendor-located means, right? So it remains to be seen if they are
going to succeed in this vision, but it's amazing to see how clear the vision is
and how the executor is of going from a data warehouse
to a data cloud that has network effects,
which is amazing.
Like it's very, very impressive
from like a business strategy perspective.
For sure.
Frank Slutman.
No wonder the RunnerStack CEO went to work for him
out of
doctorate school. One point on that, though, that I will say that's interesting, which is
in some sense, history repeating itself, or maybe the beginning signs of it is
marketing and marketing slash go-to-market data tend to be the tip of the spear when it comes to
data technology, because the needs there create a significant amount of demand
within an organization. We've talked about this with a concept of CDPs where it's like, okay,
well, the initial CDPs actually focus on like marketing execution and customer journey
engagement. When in reality, the original intent was customer data across the entire stack.
And my sense is that you're starting
to see that with the network effects that Snowflake is trying to create with the marketplace concept.
A lot of the big initial data set availability or integrations relate directly to marketing.
And I'm not surprised by that, but it's really interesting to see history repeating itself
from that regard in terms of marketing and go-to-market data being the tip of the spear.
Maybe that's just, I'm projecting my own experience on that.
So check me if that's not accurate.
Yeah, yeah.
I think, and okay, that's what I'm going to say.
It might sound like a little bit controversial to some people, but when it comes to technological progress,
there are two major drives behind it
that we humans don't want to talk that much about them.
One is marketing and the other is sport.
Which, that Venn diagram overlaps a lot.
You don't have to go into detail.
Yeah, exactly.
So there are like some very,
actually outside of like joking,
but like I mean what I'm saying,
like there is this very interesting fact
about Betacam versus VHS in the 80s
and VHS winning
because it was adopted
by the porn industry.
Sure.
So outside of like,
okay, like the moral or whatever,
like issues of like this conversation,
there are some very strong drives behind technology.
And yeah, marketing is one of them for sure.
Like it's the first reason
that someone is looking for data.
Yeah.
All right.
Well, you heard it here in this debrief.
That was probably longer than five minutes,
but great. Thanks for joining
us. Subscribe to the Data Sack Show. If you haven't yet, you can subscribe on your favorite
podcast network. And of course, subscribe to us on YouTube here where you're watching this video.
Lots more interesting content coming up. you