The Data Stack Show - 196: Why Big Query Was a Big Deal, Observability AI, and How AI is Like a Guy at the Bar, Featuring David Wynn of Edge Delta
Episode Date: July 3, 2024Highlights from this week’s conversation include:David’s Background and Career (0:49)Econometrics Work at UPS (3:14)Challenges with Time Series Data and Tools (7:15)Working at Google Cloud (11:28)...BigQuery's Significance (13:51)Comparison of Data Warehouse Products (17:23)Learning different cloud platforms (20:17)Coherence in GCP (23:04)Observability and data analysis (32:44)Support for Iceberg format in BigQuery (36:31)AI in Observability (40:25)AI's Role in Observability (43:39)AI and Mental Models (46:04)Final thoughts and takeaways (48:32)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies. Welcome back to the show. We're here with Dave Nguyen. Dave,
welcome to the Data Sack Show. We're excited to chat with you.
Absolutely. Glad to be here.
All right. Well, we know you work for Edge Delta and Observability,
but give us the brief overview of where you came from before that.
Oh man.
Well, if we want to go back far enough, there was a cold and snowy night in
February of 1984 and a cry rang out at four in the morning, which is
unusual for that time of year.
But if we fast forward a little bit from there.
So I've been a lifelong geek and I bounced around a number of different places from doing
econometrics at UPS headquarters in Atlanta to hopping around a few startups in Silicon Valley
with some ETL software and some observability software. And then I was at Google Cloud for a
number of years doing all things there, both on the compute and on the data side.
But what until I finally am here at a new startup, where we're doing observability. So a little bit of this, a little bit of that. Very cool. Nice. So one of the topics we talked about before the show
was your time at Google. And we talked a little bit about BigQuery. So I'm interested in digging
in a little bit more there because you were at Google, I think, during some of the crucial years where the best data warehouse products that exists in the clouds
because it's the closest you can get to a sql api with really not having to worry about any of the
back end right um no knocks on athena athena is great for what it does don't get me wrong
but if i'm gotta open another tab and start managing all of the S3 stuff and have my, all my parquet files in just the right format.
I'm not having the best day if that's what's happening.
So BQ just makes it super easy to dump data in there at the appropriate time
in the right increment in the right spot.
And you can just go about and start querying it, whether you're talking
to add the gig scale or at the petabyte scale, it just doesn't matter.
So I found that really slick.
Awesome.
Well, tons to talk about.
Let's dive in.
Yeah, let's do it.
Dave, there are so many topics that we want to dig into.
I'm actually interested about your, so you studied economics and then right out of school,
you did econometrics work at UPS.
What did you do there?
And what does an economist hired by UPS figure out for them?
So it's a great question.
I joined, actually, I was hired to be part of one group and was transferred to a different
group before my first day.
Yeah, nice. But the way this worked is that
I joined the forecasting team. And this is 2008, where we were coming into a giant recession.
And the thing that this team was responsible for was predicting as far out into the future as
possible. How do we know how much we need to ship where and when? Great, so we need to maintain all these time series
that gives us forecast around all these things.
In the past, the models didn't actually have to be
that sophisticated because UPS more or less tracked GDP,
period, within a percent or two.
So not that hard to predict.
You could just sort of hit the button.
Until 2008, wait a minute suddenly some
things aren't quite working right so a number looks slightly different so would they there
was a guy who thought that maybe we could bring in some more econometrics into the forecast to
figure that out they hired a phd in order to implement some of the key ideas. And they hired a grunt
to do all of the work in Excel to make scale happen. And I'm not going to say which role I
served, except that I don't have a PhD. Let's go with that. Wild. Okay. So, I mean, dig into that
a little bit more. What did you, I mean, what did you find? How did you address the problem of,
you know, sort of your core
metric that had been used to forecast the business changing? Oh my gosh. Well,
so we're almost thinking about it more on the operational side of what I was responsible for.
So we had a number of time series that we published as a team to the wider organization.
This is before we were using a database. Like someone passed around a CD of SQL, you know, SQL server 2005 as like, Whoa, maybe we should
try this.
And we were before that.
And in that sense, we were sharing Excel files about what were the different
series that we maintained the PhD who is still there and is doing tremendous.
Tremendous work.
As I understand it, her name is a Juana Mazzara and she's great.
She would literally read academic papers and books and things that were published
and translate the different processes that were available into stuff for
time for a time series that would matter.
And so that would work once.
And then I had to figure out how to do it more for all of them.
And this involved a lot of VBA that I was very grateful for IntelliSense.
No matter what level of intelligence you would consider 2008 IntelliSense to have just to
help me keep moving and get going.
But a lot of these operations were very annoying, right?
So you can't just fill down across the whole row
because you've got month resets, right?
And you've got different moving averages
and you've got different,
all different like little operations that,
if I was doing it today,
I probably would have functionalized everything
and gone about it that route.
But I wasn't quite that smart.
Get younger Dave that had fewer, I'm going to call these blonde hairs on my chin.
Younger Dave that had fewer blonde hairs didn't quite know that. So we were trying to do everything half in GUI land and half in VBA.
And that led to a lot of files with a lot of very particular changes.
And you know, it was a job.
Yeah.
Yeah.
Okay.
Just out of curiosity,
because I want to jump actually to the,
there's so much to talk about with Google,
and I know John has a ton of questions,
but time series data at that scale is very interesting, right?
And so the tool set that you're talking about
sounds, you know, particularly painful.
What stack would you use today, right?
I mean, there are like time series databases, you series databases like Influx and other tools like that
that are really good. Maybe that's not the right tool for what you were doing,
but what stack would you put together today to do that?
And before you answer that, what percentage of the time did you spend waiting for Excel
to become responsive again after you made a critical change and you didn't
hit save?
So I'll start with the latter question first.
None at all, because I was managing 250 Excel books.
Why would I put these all in one workbook
when I could do this copy and paste 250 times?
Got it.
So performance problem solved.
Yes.
That is the original brute force.
Wow. Okay. I is the original brute force. Wow.
Okay.
Yeah.
Wasn't expecting that.
To answer your former question, I've become a, and this was actually true with a different
issue that we had when I was there where we had a Microsoft Access application that wasn't
idempotent and it needed to be run on a scheduled basis. And there was a problem with it that we didn't notice because the non-idempotency
had messed it up and it wasn't until a different consumer that was rather
important noticed that our forecasts were the same week to week for a few weeks now.
And we were like, oh, so, so redoing that now, I mean, I've become much more on the
idempotent train, much more on the functional train where I would be trying to bake as much
of those as we can. Storage is cheap, which was even true enough then, but it's much more true
now. And so why would we bother trying to mutate state in place when we could just have a much
clearer lineage about how these things get transformed from place to place. So it's all of the really cool and interesting stuff
we talk about, like originizing a project correctly and making sure that your tables are well named
and all that other good stuff where you don't just hope that someone else can read VBA and has your
brain. I often wonder if weather forecasts ever work that way
where like somebody somewhere, it's like,
oh, like, you know, the weather forecast
is the same as it was last week
and somebody didn't run something.
Yeah.
You know?
I absolutely would believe that.
At a local level, that's gotta be possible.
For sure.
I often wonder if the precision is cut off
because there's like moss on the thermometer
and it's just like, sorry,
we can only get to one decimal point. Hopefully that's good enough. Yeah, it's like moss on the thermometer and it's just like sorry we could only get to one
decimal point hopefully that's good enough yeah it's like literally an instrumentation problem
yeah it could be it's probably why they did it at airports because you know they have to yeah
those instruments clean yeah good point yeah i mean you got to think about the meteorologist you
know who it's raining outside and the forecast comes in and it's different than the actual.
Absolutely. Have you guys read that article? Gosh, it was on Hacker News the other week about crazy real life bugs. And the bug was the Wi-Fi works when it's raining and not when it's not.
No.
Fantastic.
So it's worth looking up to hear the whole story. But basically, there was a guy who came back in
from college and he was always tech support first thing, but his dad was also very capable. And he was like, yeah, I don't
know why, but the wifi only works when it rains. I haven't looked into it yet. And right. Which
depending on where you live is like, well, you're not logging onto the internet that much.
For sure. Well, and also it's backwards than what you would expect in the whole host of things.
So the long story short is that they installed, they got their internet
from a microwave beam that they were getting from an opposing house. And what had happened 20 years
ago was that someone had planted a tree. And so when it rained, it weighed the leaves down enough
to be clear. And when it stopped raining, it was just enough to block most of the
signal. That's awesome. I love
that. So when it snows, it's
just like perfect internet.
I would imagine so.
As I recall, the
article wasn't there for wintertime, so I can't
speak to that, but I would imagine.
Man, that is so great.
Okay, well, just a reminder to
the listeners, we're here with David Nguyen from Edge Delta, and we're chatting about Wi-Fi signals, BBA, and forecasting, and breaking up Excel workbooks. But John, you had a bunch of Google questions. I have some too. But David, just give us a brief overview of your time at Google. Cause you worked on all sorts of stuff, but how long were you there
and what were the sort of the biggies?
Yeah, I was there for about seven years.
We started, I started when there were enough customer engineers to fit in one
training room across the entire world, which we did once, and we didn't have
enough training material to last the entirety of two days.
So half of a day was scheduled for five-minute lightning talks
from every person in the room,
which was fascinating
because it could be on any topic that you wanted.
And so that was the vibe.
It was young and it was fun.
We also didn't have all of the enterprise things
that people routinely demand when I joined,
like being able to directly peer to Google.
That was a pretty big one.
That didn't exist at that Google. That was a pretty big one.
That didn't exist at that time.
This was also before Kubernetes and before GKE and before various other things.
So I was out there talking to people directly about architecture, how they could migrate to the cloud, how they could re-architect so that things might be more effective across the entire suite of products.
I like to joke that the job was not that hard.
All you had to do was know the some 300 products that we had,
know the some 400 products that AWS had,
know some 500 open source offerings,
and how they all fit together in every conceivable scenario.
It's not that big a deal.
But that, you know, that led to an interest in
basically all different flavors of stuff.
Because at some points I was territorial, where I would cover the entirety of the West Coast, because that's how territories go when you're early, to smaller and smaller territories.
And then I started focusing on an industry because I have tried to quit video games several times.
And I'm sure I'll succeed one of these days. But I figured maybe I should make that more of a job thing, uh, because
we had several notable gaming customers at GCP Niantic, uh, makers of Pokemon
go was probably the biggest one earlier on, but there have been others, uh, like
unity and apex legends and various other things have also used different degrees
of Google cloud, which I may or may not have had a hand in.
And so, yeah, like,
that's where I was for most of that time doing the customer-facing architecture side,
and then also doing a little bit of partner stuff as well.
Nice. So, Google Cloud, I mean, that is probably the most, like, fundamental, like, if I picked,
like, seven years, you know, to be at at google cloud it seems like those were some of the most transformational years totally like what so we want to talk about
more about i would argue today is pretty well we'll see we'll see we'll see yes oh we could
probably have a chat about that for sure but yes it was definitely big times yeah for sure so so
we want to talk about big query but were there any other like
you know in your years there were like a product comes out you guys are introducing a new product
when you're like like wow this is going to be incredible or maybe just we could talk big query
or if there's some other product that you felt that way about inside the ecosystem i don't think
so i think bq is really the one that i'm the most enamored with, just because it delivers so well on the core promise that solves so's too complicated. And I tried to understand it for a couple of halves there.
I had it on my OKRs to try and figure this thing out. And usually when I had trouble,
I would go ask the team and they'd be like, go read the source code. And I'm like,
the last Java that I saw was at JL Mann High School and computer science AB and I cannot read that was
into like Java five or something.
And where they didn't have decorators and I don't know what any of
the syntax means anymore, so no, thank you.
But I also knew enough about it to advise people on what the architecture patterns
they needed and common pitfalls they were.
But I, even the Python SDK that they built, I think was just a little bit beyond
what's pretty reasonable for people to get. So I think that BQ hits the right thing. I think VMs
are very commoditized. I think GKE is great and is definitely probably the best Kubernetes platform,
but I mean, that's borderline commoditized as well because everybody's doing Kubernetes.
Well, but I think you bring up a good point that I think a lot of companies struggle with
is like they can have a brilliant solution to a problem
that is not accessible enough to enough people
to make a difference.
For sure.
Right?
And it seems like you're saying that the BigQuery
kind of hit that like brilliant solution to a problem
and very accessible to a large number of people.
Yeah.
The only challenge that you would really have is migrating the data and getting it in there.
That was really the only one. Because if you have a petabyte
capable system, your next problem is
getting a petabyte of data in there. Sure. In order to make use of it.
Yeah. Sure. What I'm interested, we actually
interestingly enough, we have not talked about BigQuery much on the show, which I love that in recent shows, we've covered a bunch of topics.
We got into the other day, we got into the details of SAP HANA.
Yeah.
Details of that.
Yeah.
You know, that was great.
Yeah.
Got some hardware.
Yeah, totally.
That was awesome.
That's good stuff oh yeah totally
it was great
but in terms of
you know when you
I think there's sort of this perception of like
you know you have Snowflake
you have Databricks
and then BigQuery is
you know the third one on the list
but all the headlines
go towards Snowflake and Databricks
and you know I mean part of that could be because Snowflake and Databricks. And, you know, I mean, part of that could be
because Snowflake and Databricks are sort of that's the main thing they do, whereas Alphabet
is gigantic and Google Cloud is, you know, a sprawling list of products only to be
dethroned by the AWS, you know, portfolio of products. But in terms of sort of the Snowflake, Databricks, BigQuery,
give your perspective on that.
I'd be interested.
And one other thing here,
like think about what we're talking about.
We should be talking about Microsoft Azure SQL,
AWS Redshift, Athena, whatever, and BigQuery.
That should be the conversation.
Man, I need to get out of the
social data sphere
and stop reading the headlines
about
Battle Royale.
I think it's a really significant conversation
that there's clearly
who should be. And then Oracle.
We just skipped over Oracle. Those are the four people
that should be in this conversation.
Only one of them is, which is a big deal like like you know snowflake and data bricks are great
too but it's a big deal that big queries in that conversation and i'd be interested if like if you
have any thoughts on like why how did that team win how did that team beat out you know all these
other products that should be just as viable, theoretically?
So first, I'm not going to dunk on Databricks or Snowflake. Those are both great products.
Oh, yeah.
And I've lost to, I'm not going to say which one, but I've lost to one of them more than
I would care to admit when I was studying BigQuery.
Sure.
The challenge that I think comes with it is, especially when you're talking about a hyperscaler,
there's a question of how much do I have to commit
in terms of getting a return on what this is, right?
Because if you're running most of your application
in AWS or in Azure,
you probably will just use whatever they have
and kind of suck it up and deal with it, right?
And GCP for most most people for better or
worse i would argue worse but that's not what we're here to talk about we'll not have gcp as
their default and so we'll miss out on what goodness that this could provide so i think
that really is what holds people back whereas you look at snowflake you look at databricks a big
core value prop is multi-cloud do the whole thing thing, it doesn't matter where. It's like, yep, that's not a thing that BigQuery
could for a long time talk about. They just recently
got towards that in the last couple of years I was there with federated queries and stuff.
But even then, now you have even less of a tie to this platform
that I don't want to know if I have to go learn and figure all out. And I have to give some
empathy to that because here's a little bit of humble pie that I'm
going to go ahead and talk about eating.
I was in Google Cloud for like a bunch of years.
I think I'm a pretty sharp guy, mostly.
I thought I understood what cloud was.
I hadn't really dabbled with AWS until about a month and a half ago, too terribly much. And I was very humbled at how different these two things were
in so many respects.
And I can see, perhaps, a lot of the architectural decisions
they've made of like, oh, I see how they got there.
I don't understand why I have to open so many browser tabs.
And I don't understand why all of the instructions are out of order.
I don't know if you guys already know AWS,
but trying to learn AWS in 2024 is insane
because there is no from ground zero tutorial out there
that is up to date.
They're all half old with APIs and stuff.
It's a monster.
It's an absolute monster.
All I have to say is that at GCP, that doesn't work. It's a monster. It's an absolute monster. All I have to say is that
GCP, that doesn't really exist.
Someone is in charge of making sure it all
works together.
Boy, is that a change. But I
recognize that other people don't
want to take on what is
they expect to be that madness times
two, if not more so.
I hear it. This was almost
10 years ago, but I had to buy Pluralsight classes to get through some of it. So I hear it. Yeah. This was almost 10 years ago, but I had to buy like
Pluralsight classes to like get through some of it. So I did a bunch of digital,
like modernization efforts, you know, almost 10 years ago now. And even then the documentation
was either fairly inaccessible or just like you said, I like, I don't know. So I just got a
Pluralsight subscription and like, you subscription and walked through some of the classes.
ChatGBT
definitely failed me
in terms of trying to get me up to speed
on AWS because it was
some number of versions behind me.
You should have asked Alexa.
I don't own an Alexa device.
Or
one of the Anthropic models maybe would be better.
Maybe they train those on the Amazon manual.
I should really sign up for Cloud.
I understand that one to be a bit more linguistically advanced,
though not technically advanced.
Yeah, that's what I've heard.
It is.
I mean, one thing that I just to return to what you said
about the system working together
and then also considering what you said about the system working together and then
also considering what you said about
I don't
remember the specific name but Google's
implementation of Beam
we don't have a lot of people on naming
they're not very good at it
I mean that's like the hardest
that's a very difficult thing to get really
good at especially with that level of
product you know catalog but I mean there is a lot to be said for, okay, this is in a combined platform. And if I were just going to go to market, and I could buy anything I wanted, and, you know, build this perfect thing, that's great. But the reality for a lot of people is like, whoa, and these things work together. And so even if it's not ideal, it's just a pipeline, right?
It's going to run. And so did you see that dynamic a lot where it's like the advantage of a connected
ecosystem can outweigh the challenge maybe or like the rough edges of an individual product. Yeah. I think a lot
of GCP customers can testify
to that for sure. And I
think that has to do with the different development approaches
that the different hyperscalers have. So
AWS famously built
on two pizza teams, right? You got
features and stuff need to be shipped on like a
relatively small number of teams. What that means
is that your interface boundaries
grow a lot and
what we see in 2024 if again you're coming to this new is there's so many check boxes they're so out
of order they so have different expectations around all these because this team built that
checkbox this team built that checkbox this one did this and you can feel it whereas in gcp like it just someone is in charge of the console
and the flow of it and it just makes so much more top-down type of coherent sense that yeah whether
or not the dashboarding solution inside of gcp is like the greatest thing since sliced bread
it definitely works and it definitely plugs straight into bigquery and takes advantage of
a ton of optimizations that they have under the hood
that keeps everything fresh in a way that is harder to do when you're not.
Shout out to all of the dashboarding solutions that do great stuff,
not trying to knock any of them,
but there's just more cohesion that you can take from that perspective.
Yeah.
Okay, one more question for me on Google Cloud.
Do you think that Google's,
and this is a, how do I want to ask this? So Google's different business units, you know,
at least I've never worked for Google, but just from my experience, you know, sort of building
some technology on Google in a previous life, like even product, like individual products can
have like parts of them that are pretty disconnected,
not to the level of the Amazon sort of two pizza checkboxes are out of order.
But one interesting thing, at least as a user of BigQuery, I use the Google Cloud to scaffold
out a bunch of personal projects.
And it is very approachable.
Just even using their different APIs and other things is very approachable.
And so you can build out a project really quickly, right? And just everything works
with BigQuery and it is super nice. Do you think that comes from Google's competency
in consumer-facing products, right? I mean, that's really where they came from was
deeply consumer-facing. Like Gmail.
Yeah, exactly. You know, search Gmail
where there's this significant emphasis on,
you know, sort of emphasizing like simplicity and flow.
Or is that disconnected?
Because you could tell me either way
and I wouldn't necessarily be surprised,
but I'm curious.
Obligatory disclaimer here
that all opinions announced here in this podcast
are solely the property of David Nguyen
and not of any particular analysis of any entity. And this does not constitute as investment advice.
This does not constitute investment, legal, or technical advice. Please consider everything I
say stupid. I don't think so. I think, because here's the really interesting thing about Google
Cloud. What was Google Cloud's first product? Do you guys remember?
Storage, but I actually don't know.
So that's S3 you're thinking of.
S3 was the first product for AWS,
which was released in 2004.
But not for Google Buckets.
But not for Google, no.
Google's first product was App Engine,
which is the entire development platform built in one.
Now, the reason for that is
that is how Google developers work internally.
And so the idea was down to the part
where they actually run it on infrastructure
inside of board, inside of the thing that runs Google.
This is the development model that we use here. Everyone should use the development model here.
That didn't catch on for a lot of reasons, partly at least because people would have
to rewrite a lot of applications they didn't want to rewrite.
So, oh, okay.
Maybe if we want to go get this market
faster and and more directly i think aws had a much better approach to that where it's like
let us give you exactly what you are familiar with it teams and we will slice it up for you
and charge you by the slice and have a nice little thing right here whereas google tried to bring a
bit more of the google way of doing things.
So when we talk about projects,
which I do think is a meaningful boundary that they drew in GCP early on,
I think that was more of a happy accident
from the way that App Engine was structured.
Because I do think it's a much more coherent way to organize stuff
than, I mean, does AWS have a project boundary now? I feel like you
can do some things with ACLs and stuff like
that, but mostly it's still just like, I hope you logged in
with the right account because here it goes.
I don't know.
Azure has resources that are
kind of like projects.
Google has the most clear boundary.
I mean, you can tag things
in AWS and you can have different accounts
and you can have a unified
account with sub accounts. But beyond that, I don't know. Yeah. It is really nice in Google,
though. Like the other day I had this, you know, 250 page PDF exported from a note taking app
on an iPad. And I was like, I don't know why I wanted to experiment with OCR stuff. But Google has some really, very cool products around that.
And I mean, spinning up a project,
because they're required because those are pretty heavy.
And so they require you to add billing.
Well, I mean, we could discuss why they require you to add billing.
That one makes sense because if you really hammer the system.
But I was like like this is unbelievable you know like i ran a test in a couple minutes you know it was like super cool
yeah it's good stuff highly recommended especially if you like light blue it's got a pretty tight
theme on there yeah it does i will i will go on record uh i assume sundar is listening to this
sundar i'm gonna go ahead and tell you something
that I didn't get a chance to tell you in person,
which is that the old logo for Google Cloud
was better with the rivets.
It should come back.
I recognize it didn't have all four colors
and that maybe is branding standards
and like as a thing, but it felt nice.
Anyway, that's my high horse.
I'll just step up and step off real quick.
And Sundar, the Data S sack show has a message for you.
We would love for you to come on the show and talk about data at Google.
The Google logo,
you know,
and the Google logo.
Really?
You set the agenda.
It's great.
Yeah.
Yes.
Yes.
Uh,
okay.
That was great.
That was great.
Just as a reminder to the listeners who were driving and trying to look at maps
and Twitter at the same time,
we are here with David Nguyen from Edge Delta,
and we're talking about all things Google Cloud.
What was next on the list, though?
Well, we got to talk some about observability, for sure.
Yes. I want to put this in here because i'm curious about your take on bigquery so open source table format
specifically iceberg is making a lot of splashes big splashes which and the concept is great right
like you can have this open storage concept that can be an s3 or GCP, like whatever storage you want,
and then you're less locked into all these products. So then that pushes all the battles
up to the compute layer, right? So you got a Snowflake engine, you got a Databricks engine.
It's good for the consumer.
It's good for the consumer, allegedly. So where does GCP stand with that? You think,
this could be a thing today. Can you use
GCP today to access data
in Iceberg? You know, that's
a great question, but I'm afraid
Iceberg came along a little bit
after I left GCP, so I
am not sure that I'm
really equipped to answer it.
I'm asking
Google. Okay, perfect.
I mean, directionally, it seems like directionally, right, that would be that I'm asking Google. Okay, perfect. This is obviously where we have to ask Gemini.
It seems like directionally, right?
That would be that, okay, Amazon, Oracle, Microsoft,
this is your chance.
Have a query engine that basically is just compute
and access data and iceberg, like ready go.
Do you think any of them will do it?
I mean, functionally, that's what Athena
and BigQuery already do, right?
They have separate compute stacks
on top of some either proprietary
or non-proprietary formats
that they can spin up at will.
But embracing a new open source standard
is really the question.
Obviously, they're capable technologically,
but will they play in the iceberg?
Yeah, I mean, they took on Parquet. I don't see any reason they wouldn't take on Iceberg. Obviously they're capable technologically, but will they play in the iceberg world?
Yeah, I mean, they took on Parquet.
I don't see any reason they wouldn't take on iceberg.
Yeah.
Right, like it's invariable.
The much more interesting questions to me are
as we evolve our understanding and our practice
of what we need to do as data analysts and data engineers,
how does that change what we need to do?
Right?
Cause I just came back from monitor Rama in Portland last week.
A shout out to the organizers.
There was a great conference, much appreciated.
They're one of the dominant themes because in observability, we've got
this concept of observability data, but it's not in tables, it's not an open
format, it's not in anything like that.
Like there is this concept of open telemetry, which does standardize the
line protocol a little bit and has a, an agent associated with it, but mostly
there's a whole bunch of all kinds of stuff floating around here from log
lines to time series data to trace information, which is sort of logs, but
with parent IDs and back and forth and stuff the old approach
was what i call the patrick star model of why don't we take all of the data over here and put
it over here so that at least then i don't have to go to 80 000 machines if i want to see if something
went wrong that that's like a level one improvement for sure and was viable at gigs of data per day
but now we've got terabytes of data per day now But now we've got terabytes of data per day.
Now we've got hundreds of terabytes of data per day.
Now we've got some of the big organizations
are generating a petabyte of observability data per day.
And it's like, we've got to take this one step back
and think about, okay, what are we doing here?
Yeah.
Because you just can't move that all across the
wire fast enough to matter and so edge delta is obviously helping to push this forward but there
are other people that are in the same vein of like we need to push that distributing down as far to
where the data can be as possible so we can do aggregations and filtering and routering and stuff where all of that data is created.
That's one method to think about it.
But you know, we might even have to think about what kind of data it is we're
making and how do we use it?
Because man, we've got to, we've got to tie the dots on these things.
I'll can I talk about my worst meeting at you don't have to say yes, but maybe
I'll go ahead and say it anyway.
When I was at UPS, one of the things that inclined me to get out of data
analysis was I had been given a charge and put together a dashboard and an
analytical report on, I honestly don't even remember what, I remember working
on it for, I think it was a week or something, you know, good chunk of time,
particularly as a young guy who didn't know what I was doing. And then I walk into the meeting and
I start talking and I can see the guy's face change right away. And within 30 seconds,
he stops me and he says, David, this looks great, but I wanted to let you know, this isn't what we're looking for.
I wanted to see this and this and this.
And I was like, huh, that was a week's worth of effort for a whole host of things that
I thought were interesting bubble up style that I'm now being told to do a little bit
more top down style in a different direction.
And it's like, huh.
Something about this went wrong that I
wasted a week here. Did he just ask for hard-coded
values? He's like, can you just code these values?
I want it to be up and to the right.
Man, I really wish I remembered
the specifics, but I mostly remember
his face.
Yeah, it was like, you know,
if you go in with data to
present something, and in the first 30 seconds, that big blinking red light on your internal dashboard is like, we've lost the audience here.
Yes.
There's such a thing that I'm always interested in ways that we can tie this type of stuff closer together. And I feel like as analysts
and engineers, we can get a little bit caught up in the properties of what this is without thinking
enough about how it ties back to the greater objectives of what we have that's actually going
on here, right? And I think the next wave you'll see in observability, but honestly,
a little bit from the analytical side as we start taking more and more control over our data via open formats or what have you is this needs to line up to the thing that we all need to do here so
how do we tie those things together so that we don't burn any cycles that we can miss on yeah
for those keeping score at home by the way talking about open table formats straight from gemini yes
you can query apacheberg data in Google Cloud using
BigQuery. BigQuery supports
Iceberg format through Big
Lake Metastore. Big Lake
Metastore. They're up in their naming
game. Big Lake. Yeah. That's kind of
cool though. Yeah, I kind of like it. I like that
they made sure you knew it was big.
Yes. And lake. They got the
data lake. Big, it's a lake.
What's the largest lake? The number of customers that wanted to change the name of data lake,
they're like, we don't want a data lake.
We want a data ocean.
We want a data galaxy.
And you're just like, yeah, man, absolutely.
Keep going.
So our conversation about observability reminds me of something
that we've talked about with
RudderStack and AutoTrack.
You remember?
Oh, AutoTrack.
Yeah.
Right.
So there's this problem.
I'll let Eric describe it.
But it's what you're saying.
We're like, hey, let's just like go in and collect everything.
Right.
And you have this decoupled like technical team, like, I don't know what's useful.
I'll collect and store everything.
And then like downstream, you know, and business team that's like, I don't know what's useful i'll collect and store everything and then like downstream you know and business team that's like i don't care about any of this stuff and it's not
like i might care about some of it is i literally will never care about this piece of it so every
moment wasted engineering collecting tracking storing retrieving is complete waste well the
interesting thing about that so the context is is, you know, so RutterSAT collects
user behavioral data, you know, so telemetry from like your website or app, etc. And early on in the
life of the company, they had experimented with an auto track feature, which is basically you
install the script on your site, and it just tracks every change in the DOM on your website,
and just sends that as a payload. Sounds noisy.
So noisy.
Now, something really interesting.
I don't know if you listened to this show,
but we had someone from the analytics company Heap
on the Data Stack Show.
And Heap's, one of their big differentiators
was AutoTrack.
And they stuck with it and actually ended up figuring out how to make it work.
But listen to this.
This astounded me.
It took their engineers, because they, like our model is very different.
We send everything, you know, to the warehouse or whatever.
We don't actually store any of the data.
We have, you know, sort of standardized schemas or whatever.
But Heap is an analytics tool.
So not only do collection, but they actually provide like an analytics have, you know, sort of standardized schemas or whatever. But Heap is an analytics tool. So not only do you collection, but they actually provide like
an analytics visualization, you know, layer or whatever. But I think the guy, I think the guy
said it took their engineering team like five or six years to build a system that had reasonable SAS cogs on AutoTrack.
Wow.
And then they did an immense amount of work
to reduce the noise.
And now it allowed them to do some very interesting things
because if you can actually solve those two problems,
then you do have an interesting data set to work with.
Right.
But that was astounding, right?
And it was actually, I mean, I seem to remember,
I can't remember the exact details of the conversation,
but the founders had to be extremely opinionated
both with operators and investors to say,
we are going to have really bad cogs
until we figure this problem out.
And so not only is there noise,
but the infrastructure impact
and underutilization of what is required
under the hood to even process that is significant.
Same with observability.
Absolutely.
Well, I mean, you're basically talking about
a different form of observability, right?
When you're zeroing in on user and behavior, that's not what we traditionally
think of in observability because we're looking at the application and its actions.
Yeah, sure.
Presumably, usually users initiate those actions.
Yeah, a lot of timestamp messages. Okay, so let's talk about, we can't
not talk about AI when we're talking about petabytes of disparate data.
I was just going to say, can you imagine how disappointed everyone's going to be that we've We can't not talk about AI when we're talking about petabytes of disparate data.
I was just going to say, can you imagine how disappointed everyone's going to be that we've gotten this far without putting A and I together?
It's like a game every week, honestly.
We're like, how long can we push this conversation without talking about AI?
We do pretty good.
Although we did disregard Microsoft and Oracle in favor know, in favor of the, you know, the darlings of the valley.
So we at least checked that box, you know, and we talked about Iceberg.
And we talked about Iceberg.
You know, so, okay.
But legitimate question.
I mean, when you think about petabytes of data, petabytes of different types of data in a context of observability, like, of course,
you go to, I mean, of course, the default is it can AI solve that problem, right? But it's a
machine learning application, right? Where you're looking for, you're looking for anomalies in like
a giant, you know, stream of data.
But how do you think about that at Edge Delta?
Yeah.
So we are using a very hybrid approach of traditional approaches with sort of your standard type of alerts and search and various other things that go on that people would expect,
as well as some machine learning driven sort of dynamic behavior.
But it just
makes alerting a little bit easier because we're re-baselining everything for you.
So in that sense, we're letting the model.
It is very hard for a constantly changing application to have fixed
alerts that make sense over a long enough period of time, because you get drift.
There just is no, there's not a great way to do it.
And currently, we're
solving it by the fact that SREs hopefully
remember what alerts they have. And if they
have gotten quiet for too long, they'll go in and check
them. Or if they've gotten too noisy,
they will fix them and
not just send them straight to spam.
We've
recently experimented as
well with putting some LLM AI
on top of our anomaly detection.
So that is some very high signal to noise type stuff.
I call it almost like the 2 a.m. checklist of if you get paged at 3 in the morning because something has gone wrong and you are like, oh my God, how did I ever keep this monitor this bright?
I just want to do a thing and get back to sleep. The AI, we've added a little LLMs in there to just give you,
hey, maybe you want to look at this.
Maybe you want to look at this.
Yeah, yeah, yeah.
It doesn't auto do anything on purpose because you say, sure,
there are people that think that's the way forward.
Not at 2 a.m.
It is.
Oh.
If it gets the alert to go away,
there are absolutely SREs that would push the button and do that. If there's no rollback
button, but there's an AI fix button, they would
absolutely do that. I know, that's the reason
it's not the solution.
So it just makes suggestions for you along the lines
of, hey, you might want to look at this, you might want to look at this,
based on the anomaly and the information that we could all
correlate across the different information.
So that's the direction that we are taking with it.
I personally, I am on record as thinking that AI is not going to quote,
unquote, fix observability, which I liken to, Hey, we've got a petabyte of data.
Let's dump it in there.
And it's like, yeah, are you going to train that model?
Do you, are you ready to spend all that information?
And that's just the cog side.
Even more importantly, if developers are good at one thing, first of all, any
developer listening to the pod to the podcast right now is amazing and never
makes these errors, but other developers, right?
All the other ones.
If you've met any other developers, they're really good at creating
new ways for software to screw up.
And the idea that you will have a dataset that has all of the errors that you could
want to track in the future is comical.
Yeah.
And I've told the story several times of my favorite database error when I was
working at the ETL company was a database error that I got that said you haven't paid us.
And I was like, what?
And it turns out we were using a Salesforce syncing tool to go from local
database writing.
You basically circum this force.com because you could write to your local
database and they would handle the syncing into Salesforce for you.
But we forgot to pay them.
So I was like, I've never seen that error again.
Is it useful to have any LLMs trade on that?
No.
And that analog holds two infinite ways that we can combine bits together.
So I'm very skeptical of the idea that AI is coming to fix observability in
particular. And similarly, I'm a little bit skeptical of it sort of in the broad too,
though that's a bit more of an open question. So the idea just in this one example would be like,
okay, we've got an AI in place. It is going to be able to, never having seen this before,
read this error message that says something very generic, like you haven't paid us, to know that there's some vendor out there that you're using to sync from a to b
and to prompt somebody like hey you need to go pay them like that yeah that makes sense that
that wouldn't work insane context requirements yeah exactly yeah yeah well i think of it i really
think one of the big challenges we have is it's like if we're not directly at the frontier of research we're getting a lot of
second degree assessments of what ai can do and what you can't do yeah and so i think what we
really need even for people in our position and i can't speak to how familiar you guys are with it
as well i'm not making any disparagements it's just we need better mental models of what it's
like and so the one that I give to everybody is this.
An LLM, our current understanding of LLMs
as of time of recording,
is it's a bit like a guy in a bar
who has overheard 10,000 hours of conversations
about motorcycles.
So he's never seen one, he's never touched one,
he's never ridden one,
but if you ask him any question about a motorcycle, he probably knows the answer,
but occasionally he might compliment how your torque smells.
And so you've got to...
He doesn't know the difference.
It's not his fault.
And the way they work is not,
at least based on our current understandings,
again, at time of recording,
is they don't reason, but they're associative. The reason that chain of thought works is that it snaps the words into a reason
like looking object. And so when a lot of AI products and startups pitch themselves on the
idea that AI will be able to think or reason or do the decision-making part, I'm pretty skeptical. But if it can do
some of the things that computers are very good at, like computers never get tired. So I think
they're very good at brainstorming, pulling together different associative ideas. You know,
a lot of baseline stuff that can help clear the blank page problem. I think, A, I'll be great for
like a hundred different paper cuts
in normal everyday life, just like data was,
you know, 20 years ago or something.
Like data is going to change everything.
It hasn't knocked out the economy.
It's just made all of the little things that we do a little bit different.
And I think we'll see that too.
And of the 10,000 hours of the guy listening at the bar,
he was drunk for several hundred of them,
but we don't know which ones, right?
Yeah.
He's not so
sure about...
He's got some hazy gaps, right?
Or he wasn't paying attention to who was drunk and who wasn't.
Well, yeah.
He keyed into every conversation about
motorcycles, including the one where he's like,
guys, I just had the most amazing
day. And he's like, okay, I just had the most amazing day.
And it's like, okay, that guy sounded like he had fun. I'm going to remember that.
Yes.
Training on Reddit data is the analogy there.
It's too good. All right. Well, we're at the buzzer. I think one of my big takeaways is that being an SRE is like having children in that, you know, that really bad gut feeling that you get when you're like,
our house is way too quiet.
Something's wrong.
This is an excellent analogy.
Unless there is a type of,
maybe a type of pet owner that has a very defined cage and fence where
they're okay with silence because they know exactly where everything is.
You can handle those SREs out there,
but not too many.
All right.
Well, Dave, thanks so much for joining us on the show.
It was an absolute blast.
And we'd love to have you back sometime soon.
Absolutely.
We'll do it.
Take care, guys.
The Data Stack Show is brought to you by Rudderstack,
the warehouse-native customer data platform.
Rudderstack is purpose-built to help data teams
turn customer data into competitive advantage.
Learn more at ruddersack.com.