The Data Stack Show - 84: Why Are Analytics Still So Hard? With Kaycee Lai of Promethium
Episode Date: April 20, 2022Highlights from this week’s conversation include:Kaycee’s background and career journey (2:34)Why analytics are hard (7:28)Defining “data management” (11:47)Defining “data virtualization” ...(15:57)The relationship between data virtualization and ETL (18:34)Where a company should invest first (21:40)Building without a Frankenstein stack (25:19)How Promethium solves data stack issues (27:53)Giving context to data (35:14)Cataloging: background, at Promethium, future (39:29)Who uses data catalogs (48:00)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
On April 27th, we have another Data Stack show
live stream coming up. We love doing these because we get to record a show live and you get to join
and ask your questions to the audience in real time. It's super fun. We are tackling the subject
of data quality on this show, and I'm super excited. We have people from BigEye, Metaplane, LightUp.
We're working on getting great expectations in the mix. But Costas, my question for you is,
data quality companies seem to be exploding across the data landscape. Why do you think
this is such a big deal and has become such a hot topic and such a proliferation of companies
starting in the space.
Yeah.
I mean, you know, like as we make it easier and easier for people like to get access to their data, they start like focusing on implementing
insights and reports and even like data applications, whatever we want
to call them on top of that.
And then like everyone realizes that, you know, we are still like, let's say
cost as usual, like the garbage in, garbage out kind of situation, right?
Like if your data is bad, no matter how good your outboards are, like your
application is at the end, the outcome is going to be also bad, right?
So making sure that we have a good understanding of the quality of the
data and how much we can trust the data that we have.
It's quite important.
And it's actually pretty tough also to do.
Even refining what data quality is, it's not an easy task.
It's not an easy task to understand or figure out where data quality should live.
Is it like a pipeline thing?
Is it the data warehouse thing?
Is it at the collection level of the data? Or is it like a pipeline thing? Is it the data warehouse thing? Is it like at the collection level of the data?
Or is it at the BI level?
Like, maybe it's on every place.
We don't know.
We're still like working on trying to figure out
the answers to these questions.
And it's a space right now where like
there's a lot of things happening,
a lot of innovation.
And I think it's going to be awesome
like to have all these great people in one place
and see exactly, hear from them,
their experience, what made them get
into this problem space, what they've learned
and what are the challenges.
So I think there's plenty to learn
and I'm very, very excited and looking forward
to chatting with all these people.
Absolutely.
You can register at datastackshow.com slash live.
That's datastackshow.com slash live.
And if you register, you'll be entered to win one of our nifty drop mechanical keyboards.
I just plugged mine in yesterday, Costas, and it is awesome.
So definitely make sure you register and we will catch you on the next Datastack Show
live stream.
Welcome to the Dat Sack Show. Today, we're going to talk with Casey from Prometheum
and a really interesting background. And I'm always interested in Casas by talking to people
who build technology based on not just sort of seeing like a market opportunity maybe, or,
you know, thinking of a cool technology, but who have worked in context around
the problem and just repeatedly experienced different kinds of pain that relate to the
same problem. And that's what Casey experienced. And that's why he built Parmethium. I'm really
interested. He talks a lot about in just some of his blogs and the materials, like analytics are
pretty difficult, even though we live in an age of like modern tooling.
And I want to ask him why that is.
I think it's something that, you know,
different people in different roles and companies
feel different pain around,
but it can be kind of hard to articulate,
like why are analytics still actually pretty hard,
you know, and why are they huge projects,
even at, you know, mid-sized companies?
So anyways, that's my question.
I think it's a great opportunity to learn more about a new term,
which is data fabric.
So I'd love to learn more about it and put some context around it.
Why we need the new term and what it means and how it relates to the rest
of the technologies that we use.
And also revisit and all the terms, which I have the feeling that it is related to the data fabric and not the data catalog.
We have talked about many different, let's say, data management, data governance-related tools so far.
I think data cataloging is not something that we have touched that much.
Although I think it's quite important.
And I'd love to hear from Casey about the data catalog, how to use, why to use,
and its evolution into the data fabric.
Casey Weadey- All right.
Well, let's dive in and talk with Casey.
Let's do it.
Casey, welcome to the Data Sack Show.
Casey Weadey- Hey, thanks for having me.
All righty.
Well, let's start where we always do.
I'd love to hear about your background and then what led you to Promethean.
Yeah, thanks. So my background, a little bit mixed. Got a little bit of go-to-market as well as product, as well as financial analysis kind of all mixed in.
And it probably explains how I got into the data management space and how I became a founder of a data analyst company.
So I started my career actually as a business and data analyst.
The guy crunching numbers, getting insights.
And as I like to say, the guy getting yelled at by my executives for always taking too long with those insights,
which led me to do everything from take a SQL class and learn how ETLs work and why data warehouses were structured the way they
were and why I couldn't get a data mart refresh every minute, why it had to be every three months.
So my journey kind of led from there into being more on the go-to-market side with sales,
business development, marketing, and then eventually back to product management.
And 20 years later, after being an analyst, I somehow ended up as president COO of
a data catalog company selling data management tools. And one of the things I realized when
doing that was that the problem not only didn't go away, it actually got a lot worse in 20 years.
So when I was a young guy, crunching numbers, I was lucky enough to have one data warehouse, one BI tool. And most customers
we talk to today, unfortunately, for them to be competitive and leverage their data,
they have to get data from multiple databases, SaaS applications, data warehouse, data lakes,
multiple clouds. And to make it all worse, they can't even standardize on a single BI tool.
And so this is a challenge that I saw a lot in my old job as president and CEO of Waterline
Data.
And it led me to want to find a way where, gosh, can we just make analytics easy for
people, please?
And can we make it so that way it doesn't matter what type of data source you have,
it doesn't matter what kind of BI tool you have, can we actually streamline this process so that way it doesn't matter what type of data source you have. It doesn't matter what kind of BI tool you have. Can we actually streamline this process so that way you don't
pay a tax just to try and use your data? And that's been sort of where I exercise the product
management background in me as well as kind of the go-to-market in terms of figuring out
that product market fit and how do you actually deliver a product that hasn't been built before because the old way was simply creating more of the same problems over and over again.
Definitely. Okay. I have so many questions there, but I have to ask a question. So I
snuck around on your LinkedIn and noticed that early, early in your career, you were an analyst at the federal reserve and so i'm just interested to know
what did you work on like what types of problems you're trying to solve and then
did you discover anything like that really interested you or surprised you in that role
i'm not sure i'm at liberty to say, Eric. I'm kidding.
Wasn't that exciting?
Trust me.
We work for the Fed.
You actually do a lot of macroeconomic analysis, right? Looking at housing trends and big trends, stuff like that.
Specifically, I was also looking at things that were affecting the banking landscape,
like things that were driving M&A, regulations, how some of those regulations can monitor and
enforce the monetary policies. So I would say that was the day job. I was in the statistics
department, so I was doing a lot of number crunching. Believe it or not, in my spare time,
I realized someone should actually build a database of all the different M&A activity that's happening.
And so I actually found time to actually do that.
And that's where I kind of really got interested in the whole.
No way.
Yeah, I know.
You know, we worked with government.
You actually have a lot of time.
It's not like that as the founder of.
That's so interesting.
I don't know what you're talking about.
Don't know what I'm talking about.
Thank you for entertaining me. Okay. i want to dig into a question i i think you know
you said analytics is hard and people experience that in so many different ways right i mean on
sort of the i'll use a marketing example you know because i'm a marketer by trade but i was like
okay i'm just trying to get like these events into Google analytics and get my Google analytics accurate. And you know, it's like, okay,
well that's painful. Right. But then on the other end of the spectrum, it's like, okay,
I have, you know, legacy systems. I have new systems. I have multiple lines of business. I
have, you know, all this sort of stuff. Right. And it's really fragmented. Yeah. Could you help
us understand what, like from your perspective, why is analytics hard? And I
agree with you. It seems crazy that it's still hard today because the tooling has gotten,
the different tools have gotten way better in many ways, but it is still hard for sure.
It is. I looked at it in a couple of different ways. One way I looked at it is that,
so the analytics landscape has changed a lot as we morph from like everyone just put everything
first in their databases and they said, hey, don't do it in database, put it in your data warehouse.
And then from there, we had new paradigms with data lakes, with Hadoop, and then with cloud
and so forth. And that's okay. I feel like, lakes, with Hadoop, and then with cloud and so forth.
And that's okay.
Like, I feel like, okay,
those are shifts that we can deal with, right?
The thing that made it worse,
in my opinion, is, you know,
the vendors, these damn vendors
who made that data management tools.
If you look, it's kind of crazy, right?
It's like, hey, I'm only going to do this piece
of the whole data management process. And I'm only going to do it for this platform, for this type of data.
And I don't know who started that trend, but it became Vogue to start doing that.
Like, well, hey, if those guys can only do it for RDBMS, I'm going to do it for HDFS, or I'm going to do it for time series, or I'm going to do it for, you know,
whatever, just stay on AWS. So I think the challenge has been a lot of the data management
tools only do part of the workflow and only address part of the environment or part of a data
type. Now, that may not be so bad if you never, ever change your data infrastructure. So if you
said, I forever this cloud, this environment, this data warehouse.
Awesome.
The problem is that never happens, right?
It's every day.
There's a batter, newer data warehouse, data lake, some SaaS application out
there, and I've never seen it.
So we're going to keep what we have.
Like we're never going to innovate and get something new.
The problem is once the business unit starts consuming analytics, consuming reports,
and it starts getting operationalized, good luck telling someone that you're going to shut that
down as you buy something new. Never happens, right? It never happens. You end up keeping it
and then you say, oh, for the new stuff, I'm going to move it onto a new platform. And so
what happens is you end up having to support
the legacy data management tools on top of the legacy
analytics structure and so forth.
So this is where it gets hard.
And then to make it worse,
the knowledge of the human being
doesn't necessarily go back 30 years,
especially the tech knowledge.
So today's grad, if you threw a mainframe at them, they would kind of look at you funny.
Like, what are you, what are you doing?
Why are you giving me this?
Right.
So, and the tech stack is moving so quickly.
So I also find that it's also very hard for data teams to even do this.
And so this is why it's become super challenging.
And then the last part of the nail on the coffin is I think, you know, 10, 20, 30 years ago, it was okay to make data-driven
decisions once in a while, right? And that was kind of the norm. But I think companies have
shown us that, hey, if you can be data-driven, like Amazon, like Facebook, like the Googles of
the world, you can go out and really kick butt and do really, really well. And so as companies
now realize, I have to be data-driven.
You look at the pandemic, it's actually kind of taught us a lesson.
You can't make a decision in three months.
You're going to be around in three months.
So it's forced people now into this mad rush of, oh my gosh,
I have to somehow make it work with all this legacy,
this stuff I have to deal with and the knowledge gap that I have.
So I think that in my opinion is one of the leading factors of why things are as hard as they are today.
Henry Suryawirawan- Casey, can we, we started the conversation and you used the term data management a few times.
Casey Weadey- Because I'm old. Well, you know, with it, it's also wisdom.
There's also wisdom there.
So I would say you're wise, but can we spend like some time like defining what
data management is based on like your experience so far, because you know, I,
I feel like one of the, not exactly problems, but like something that's very
interesting, like with this industry is that we use a lot of different terms, and the semantics are not very clear.
Everyone has a slightly different meaning of how we use it.
Drives me nuts. Drives me absolutely nuts. I know what you're talking about.
Yeah. So what define data management? Okay. My high-level definition of data management is all the stuff you do after the data lands in the database, the warehouse, the life.
So after it lands, the write has been committed.
It's all the stuff that you do to get the insights.
That's my rough overview definition of data management, right?
And so it is the ETL process.
It is the data cataloging process.
It is the prep.
It is the modeling.
It is the query, query optimization, query federation, the SQL, the process, and even
to some extent, the visualization process, right?
But I would say it's, I think the hard part when people think of
fork, the hard part when people have a negative reaction to the word data
management or a strong visceral reaction to data management, it's because they're
reliving some traumatic events they used to experience through those
processes that I just talked about.
Henry Suryawirawan, Yeah.
Is there like a minimum set of, let's say, activities that
every company needs to have?
I mean, I would assume that, okay, if you would like to consume data, some kind
of visualization tool, it's going to be, right, or a database system.
Well, what's in your experience is like, let's say, let's name it like
the minimum
viable data stack, right?
Yeah.
How do you define it?
David Pérez- Yeah.
Well, so I'm going to start at the two ends, right?
The first end is basically where the data originates, right?
And so this is SaaS applications, RDBMS, the data sources.
And I even would even include like data with like a data warehouse there. I know, yes include data with Lake and Data Warehouse there.
I know, yes, you pull data and put it in there,
but for the purpose of analytics,
I kind of think of data sources
as anything that stores, houses, or generates data.
So I kind of put that in one end.
And then at the end of the stack,
on the other end,
is your typical BI tool visualization.
But that's even evolved right
like i would say the last few years we've gone beyond that to the dashboard isn't enough anymore
people want narration people want storytelling people you know there's this trend that hey maybe
i don't want to have to go look at a dashboard every single time i figure out what's going on
maybe i just want you to tell me maybe i want the tool to tell me this is what I
should care about.
Right.
So, but I would say, you know, let's call that the insight part for the, for the
moment.
Right.
And so for the most part, every organization has somewhat figured that out.
Right.
They figured out where the data is coming from, where it's stored.
They figured out kind of how to visualize it.
And for the most part, anything in between, I think that's where it gets messy, right?
I've seen everything from like
data scientists and data engineers
doing crazy Python and, you know,
extraction with R and Scala
to, you know, cobbled up bespoke solutions
that you may have had three SIs over 20 years
come in and build for you.
I think the best practice has been to, you know, put in different data management
tools, like, you know, it might have a data catalog for discovery, for governance.
You might have a tool to do, you know, prep modeling, obviously an ETL or a
pipelining tool to get the data into a data warehouse or somewhere, somewhere
else, and, and from there, it's kind of, we've also seen kind of things like data
virtualization, uh, technologies as well.
So I would say for the most part, right.
I would classify it as you have your discovery governance layer, right?
You have your prep modeling layer and you have your, I'll call it access access
layer and the access is what I would lump ETL, what I would lump
the moving of the data,
the pipeline data,
as well as the query
of the data.
I would lump, you know,
virtualization there as well.
So I would say these are
the three broad categories,
right,
in the middle of,
you know,
the data sources
and the, you know,
BI visualization
and analytics tools.
Mm-hmm.
Mm-hmm.
That's very interesting.
And, okay,
I have another question
for another term Jeff mentioned, and that another question for another term. Yeah, Jeff mentioned that that's like
data virtualization. So yes, what's that?
Wow, how much time do we have? There's like it that
term is so misused, like, you know, the storage guys have what
have a definition of data virtualization. I'm sure the
virtualization, the VMware guys have a different definition. then there's data virtualization in more of the data management
space. It's been around since Cisco had a version of that a while back, popular ones,
Romeo, Starburst obviously talks about that. So I'll talk about the more recent as well as the
more relevant to analytics definition. And that is really the ability to use a layer to allow you to have virtual access to the data sources where you don't have to do the ETL first.
You don't have to load the data first before you can query it.
And being able to also do federated queries, because if you look at something like Starburst, you actually abstract out
the SQL query execution engine, right?
So away from the underlying data sources.
So when you can do that, then you can actually push a lot of the operations like joins, aggregation,
so forth away from the underlying data sources.
And you can actually do parallel processing or the SQL execution so you can get better
performance.
But it also means it is now possible to actually run the query
where you're joining data from multiple sources,
which before that, you know,
we would never think about that, right?
We would say, oh my gosh, no way.
I have to do the ETL, you know, transform everything,
land it into, you know,
one single data warehouse and do that.
And so I think, you know,
when I say data virtualization,
I'm really talking about kind of more recent
incarnations data virtualization, right? A la D kind of more recent incarnations, data virtualization,
right? A la Dremio, a la Starburst, those type of technologies. Did you see, and I'm assuming,
let's say the role of like a data engineer who is like pretty new in this discipline. I hear you
describing, let's say this data stack and you mentioned both data
virtualization and also ETA, but when I see you describing virtualization, it
makes me feel like we don't really need ETA, right, if we have, let's say, I
don't know, like an idea of virtualization there.
So how do you see the relationship there between the two?
And I want to ask you to give me like a definition that's like as
pragmatic as possible, right?
Like, what do you think at the end is possible out there?
Are we going to kick out the idea and give you our notes?
Yeah, well, it's, it's a good question, Kosas, because I think my
thinking actually has evolved, right?
I would say two years ago, I was probably, you know, if you, if you had a video of
me somewhere, you know, I was probably out there protesting the sign, you know,
that's the ETL, no ETL, we've got data virtualization networks are fast, you
know, CPU memory on servers are good enough. We've got data virtualization, networks are fast, you know, CPU, memory on servers are
good enough.
We don't need it.
Well, I have to say in the last year or so, I've had to change my mind and it comes from
just practical experience with customers, right?
So I'll tell you what I mean.
So I think data virtualization is fantastic when you're exploring, when it's ad hoc, when you have an idea, you're
not sure yet. What data virtualization allows you to do is get you a quick way to validate,
you know, is this the data you're looking for? Will it answer your question without waiting for
the complex task, waiting for the data to be loaded and so forth. So that's awesome, right? The harsh reality is that physics exists, right?
I love, I love my brothers at Starburst, right?
And all the data virtualization helpings.
But when you get into customer environment
and they say, hey, I've got 12 billion rows
across these two tables,
one's in the cloud
and one's an on-prem Postgres database.
And I need you to do this, show this query.
And I don't want it to take more than a minute.
Like it, physics, man, like, look, man, it's just no matter how many nodes they add, how
much memory, like you still, there's still that extra hop that you're going to take.
Right.
And so what I've kind of changed my thinking
as I've seen in customers' appointments is
the data virtualization is a good way to start off with
if it's an ad hoc stuff.
But when you're talking about operational pipeline,
operational jobs, operational analytics,
and you have SLAs to meet in terms of time
and in a big data sense,
you're not going to win versus a dedicated query.
You're just not going to win, especially if it's against the data warehouse that is being tricked out.
And the data engineers know exactly how to tune it for performance.
And so this is where I do see them coexisting.
I don't see it as a replacement.
I'd say, hey, look, the best practice I always tell our customers is use the data virtualization and make sure this is what you want very quickly.
And then if this is what you want, you know, build the pipeline, right?
But now you have full confidence that this pipeline is actually going to deliver exactly
what you want, which is a lot better than before you trial and error with the ETL, the
complex pipeline, right? Only have it break multiple times before you figure out, you know, this is the one that you want.
So it actually is a nice marriage. And I actually think that is a good way to actually combine the
two technologies to get the best of both worlds. Yeah, makes a lot of sense. Makes a lot of
sense. Where do you think that like a company should start from? Let's say, like you have like
an medium-sized, like small company that's at the point
where they want to start implementing some kind of data initiative and build their data warehouse,
wallet, all that stuff. Where should they invest first? Is it the annual virtualization or both?
Yeah. And then this is where I'm going to be a little controversial because I know I'm going to say is you're never going to find in any book or manual that you read out there.
I know every book, manual or consultant you talk to is going to say, start the day at
the warehouse, start the day late.
Move everything into one single place so you can find everything.
That's what everyone is going to tell you and that is the current core conventional
wisdom. The problem with that approach that we've seen time and time again is two problems.
One, take it from an old infrastructure guy, moving data sucks, it's hard, it's complex, things break.
So you're going to have to have a long project that probably won't even finish on time after you actually build a data warehouse in the end.
Two, take it from an ex-catalog guy, whatever you just moved in there,
you're probably not going to be able to find 80% of that stuff. So your users are now really mad
at you for the next 18 months, why they can't leverage data beforehand. So this is where
conventional wisdom breaks. And look, it was relevant 20 years ago when you didn't have that
many different data sources. It's got in that much data. But when you now't have that many different data sources. It's got that much data. But when you
now have millions and billions and tens of billions of tables and multiple data sources
and types in a single place, this is just the problem you're going to run into. And so my
suggestion is don't start with that. Do that last. Actually start with, if you can, a data discovery process, right?
And data discovery process meaning, you know, and some will use a catalog, but it doesn't
have to be, but a way to which you know where your data assets are, number one.
When I say data assets, I mean the whole gamut, right?
I mean tables, views, and queries, right?
Start with knowing what you have, where it is, and then start with knowing what people are actually using.
So you have a way to actually prioritize.
Because a lot of people, when they think about doing these types of data,
LinkedIn, migration, data warehouse migrations,
they think they have to move everything.
And I can tell you, nobody uses every single table.
Nobody actually uses every single query.
In fact, most people have a lot of orphan queries, stale queries,
or even stale ETL
jobs or enough stale tables.
Start with the ones that people actually care about and people are actually using and use
that as a basis to say, okay, this is what we want.
This is what we want.
Let's really make sure we know how to optimize that from a discovery, governance, and performance
perspective.
And if you can do that and you know people are going to use it,
then actually building your data,
like your data warehouse,
first with that set of data
is going to give you the best experience.
It's going to give you the fastest experience
of getting that data,
like data warehouse up and running.
And your customers are actually really happy with you
because they're not waiting 18 months
for you to tell them,
okay, it's ready to use.
So I would say start with that discovery process to rationalize what you have and where it
is and why people are using it and what are the most popular ones.
Then from there, like I said, the data virtualization is great for you to validate and then having
a data warehouse or data lake for that fast local performance of the next.
So that's kind of the three steps that I would recommend.
Casey, question on that.
So this is such an interesting topic.
So you talked a lot in the context of,
okay, you kind of already have these disaggregated sources
and the go-to conventional wisdom today is
just get everything collected into a warehouse or a data lake.
Let's just imagine a world
where you can start over from scratch, right?
You don't have the, which I know,
do you know, entertaining here, but yeah.
But let's say you are, you know,
you are starting out
and you didn't have to deal with,
you know, sort of this legacy,
you know, sort of integration debt
and technical debt, you know, from a Franken legacy, you know, sort of integration debt and technical
debt, you know, from a Frankenstein stack and all that sort of stuff.
Would that change the way that you approach, you know, sort of augment or building out
or sort of, you know, scaffolding the analytics, you know, infrastructure and practice inside
of a company?
Temporarily, yes.
And why I say temporarily, yes, is I've seen many examples
where people say, hey, we're building it from scratch. The problem is, where does the data
come from? And there's only so much of it you can control for that's your internal data.
The minute a business starts expanding, we have to take on new partnerships. Oh, hey, look,
their data is from another source and they have to pipe that to us.
Yep.
Um, the minute marketing starts going, Hey, I actually need this type of
third-party data, new social media feeds.
There's new data being added.
Right.
And, and what happens is this is just a cycle that has played itself out over
and over again is you can probably get that started for like your core main app,
your next native and the minute someone says we're growing as a company, we're coming out with a new line of
business, a new app, someone there in the IT and our development stack goes, well, I don't want to
be on the same stack as yours. No way, man. I want independence to be able to do my own thing.
So what happens is it's a temporary solution where you get to say, I'm going to read this iPhone from scratch.
Eventually you get into the world there.
Oh my gosh, I do have data in multiple places.
Sure.
They might be newer systems, right?
You could, you could say, Hey, all my data are now all cloud data, you know,
platforms and data warehouses, but it's still, there's still separate formats.
There's still separate APIs you got to connect to.
And if you try to do
cross-source analytics,
you're still going to run
into a problem
that I just talked about.
For sure.
For sure.
Temporary.
And you could be,
you could be living in bliss
for a little bit,
but eventually
you got to pay the pilot for a man.
Yeah.
It's like,
well, it's like,
you know,
you start a new job
and like your calendar's empty
and you're like,
your inbox is empty
and you're like,
wow,
I have so much time just to work on stuff.
And then, you know, two or three weeks later, you know, the train is off the tracks.
Okay.
I'd love to hear.
So we've talked a ton about the problem.
This has been super helpful.
How do you solve some or all of those types of issues with Prometheum? And how does the product do it?
Like, what's your approach?
Yeah.
Well, number one, I would say,
whatever you think you've figured out,
there's nothing more humbling
than actually going out to customers
and getting kicked in the face.
And so we've had the luxury of getting kicked many times.
You know, I used to be much better looking.
You have some aggressive customers.
Have you worked with
the new engineers?
No, I'm kidding.
No, no, it's all good.
I think, you know,
what you think you can do
in the real world,
it's always very different, right?
And so one of the things
that we realized very early on
was you got to connect
to every data source out there for the most part, right?
You get like just assume every customer you're going to walk into that has this problem probably has a smattering of relational data sources, data lakes, data warehouses, cloud, Hadoop, you name it, times two, right?
That's good.
Just assume that's going to happen.
So right off the bat, that means you do have to know how to connect everything very quickly.
And when I say connect, I actually mean being able to figure out what they have or being
able to show people what they have very quickly.
So the old way of, well, I'm going to load everything into me.
I am going to connect, but I'm going to copy everything into me.
Well, that's a horrible idea, right?
Because you're now actually creating yet another data silo.
And number two, the performance impact to actually go scan every system to do that,
it's gone awful.
So some of the earlier versions of data catalogs went through that problem.
And I can tell you, a lot of times, it would take six months to just finish scanning, right?
By which time, you're now behind by six months.
So Promethean, we've actually figured out not only how to connect,
but how to very quickly within minutes
kind of give you a logical view of every table,
every query, every view that you have.
And then you got to figure out
when you deal with enterprises
that not everyone's a good citizen,
not everyone puts all their data
in the database, this data warehouse.
So you have to figure out
how to get them for like Git repos.
I kid you not, right?
And being able to do the same thing. So just in the ability to connect and kind of give you this
normalized view is one way that Prometa can do. And literally in minutes, in literally one day,
we've had customers tell us that you've shown me in 15 minutes what my existing legacy data catalog
guys took a year and a half to show, right? So get that global visibility one day very quickly.
Then the next thing you need to do is
help them understand the meaning behind the data
and what it can be used.
I think this is where some of the drawbacks
of a lot of data catalogs have is like,
yeah, they can tell you the metadata information
and so forth, but like, is that really that, that helpful?
If I'm trying to know, can I use this table to answer a specific question?
Right.
Or is it more helpful if I tell you this table has been used to answer these five
questions that are actually very similar to the one that you're asking.
So that ability to actually extract context and how it's actually
being used is super important.
And then the last part that I think is even more important is the ability to actually
let you use the data.
So a lot of the metadata tools, they're only metadata only, or if they do have some
quote-unquote preview, it's very light.
It's a small subset, and you have to move data from them to preview, or you have to
pay huge tax.
So this is where Prometheum has actually figured out a very
lightweight view of actually seeing what's inside preview it, but then
as than when you need, let you actually work with the data, right?
Actually let you join, let you build queries, let you build this
visualization very quickly on the fly.
And that's a whole different experience because before what people are used to
is I found it, I needed to use it. Let me go call someone else and let's hope they have access to
it and let's hope they can get it for me.
Let's hope that they validate or I can't find it.
It hasn't been built.
Crap.
I can't do it.
Let me go get someone else to do it.
And so, you know, being able to actually do that actually all the way through
is where Promethean shines and then the part that I think a lot of folks
overlook is you have to make this into a seamless workflow because today these all exist as separate
processes potentially done by separate people using separate tools. I don't naturally assume
what I find in my catalog where I go fly, I can instantly build on the fly, query, virtualize. That doesn't exist today normally. And so how do you make that not just
easy and clear and intuitive, but also performance? You got to make sure that it performs fast so that
for us, our goal has always been get to the answer in three minutes. Try as hard as you can.
Have a single easy workflow, but get to the answer in three minutes.
And what we found is with analytics,
you don't necessarily need to answer the question right away.
With analytics, when I was an analyst,
most oftentimes, whatever you think that you're going to answer,
it actually wasn't what you're going to answer until you start working the data.
And we look at the data and be like, ooh, hey, I didn't think about that.
Or, ooh, wait a minute, there's something else.
Or, wow, this is wrong.
And so the faster you can get to those points of iteration, as I call it, the better your analysis will be.
The longer it takes, this is where things start getting hairy. It's really like, well, maybe I could convince my boss to just accept this.
All right, let's use the data from three years ago.
So those are the things that Promethean can actually help is not only giving you that fast connection understanding, but actually allow you to actually work with it all in a single platform
with end-to-end collaboration
between the business person and the data team.
Here's the thing.
As a data analyst,
I have no idea what the marketing guy
really is going to do with it
until the very end, right?
And as the marketing guy,
you have no idea
how to do the gnarly extractions, right?
And pull the data from different data sources.
So why do we actually make you wait
until someone finished that task
that only for Eric to say,
hey man, this ain't it.
This is not what I need.
Why not have them collaborate in real time together?
Right?
And that's kind of sort of this new era
of collaborative analytics
that we can just bring into the table.
Well, the dirty secret is that
the marketing person may not know
at all what they want to do
anyways.
I won't tell if you won't tell.
I know Costas
has a bunch of questions.
One specific question for me before
I hand the mic over.
You mentioned giving context to the data, right?
So show me everything that I have, right?
Which is really useful.
I mean, goodness gracious, like even in small companies like ours,
it's like there's nooks and crannies already in the warehouse, you know?
So that's helpful.
And then you talked about, you know,
this table has been used to answer five other
questions like this one. Yeah. That on the surface feels like it's, there's a very high
level of subjectivity and like context there. How do you do that? I mean, are you like, you know,
sort of diffing SQL queries that have been run on the table or like, how does that even work?
Wow.
How much time do you have?
This is actually part of the secret sauce, right?
Of Remedium and one that we actually have a patent on.
And so we figure out kind of both ways.
One is how do you actually figure out the semantic and also the context, right?
Of a query or table, right?
How do you figure out the relationship that has with other tables and other queries and then, you know, cause you also have to, it's almost kind of like.
Understanding it from a graph perspective, right?
A graph database perspective to understand these multiple relationships
could actually exist between multiple objects.
And the object could be a table, it could be a query, it could be a view, right?
It could be a tag, right?
It could be a BI tool, right?
And so figuring out how all these semantic objects actually map to each other is
hard, but actually it's very useful, number one.
But number two is also taking advantage of crowdsourcing, right?
Knowing what people have rated, reviewed, frequency of access, those type of metrics
come in play.
So one of the things I learned earlier on is that very rarely can you rely on one metric
to determine viability or relevance, right?
Oftentimes in an organization, we look for multiple data points.
We look for for has this been
used by someone i trust number one right so who actually uses frequency right when was last time
how often it was used that hence give people a level of comfort and crowdsourcing right four
stars thumbs up thumbs down believe it or not that gives people comfort and then you you know
some people want to get a little deeper.
They want to look at a minute.
Show, tell me what actually came from, show me what happened.
Show me the transformation actually happened and prove that way.
I can get a sense of comfort, how it was actually built.
And so with Promethean, we actually realized number one, every organization
has multiple things they look at and it's never just one, unfortunately.
But what we found is that everyone probably uses the same six or seven things and they might just
weigh them differently. And so we've actually figured out, number one, not only how to get
those things, but how to actually create an algorithm to rank, right?
Based upon those six or different things in terms of relative importance and then have it tuned.
So it's kind of like our own little page rank, if too well, right? That kind of determines the level
of accuracy and there's a scoring behind it. And if you don't like it, right, you vote it down. If
you don't like it, you don't access it. And so it's always live. And this is where customers
have actually started seeing a lot of value because it's not static.
The problem with data catalogs and data governance tools is
it's kind of static. It's what someone says or it's that profile
that data says. But if you don't actually know how it's actually being used
and not just itself, but also parts of it and so forth, you don't
really get a complete picture.
And so this is where we've been able to do this.
So you don't realize it, but as you're using Chromiki to answer questions and very quickly build things out,
you're actually contributing to the governance as well
because you're actually contributing to one of those factors
in the scoring.
Fascinating.
That's a very interesting case.
I don't remember if we talked in this show before about data catalogs.
So I'm going to ask you a bit more about it because I also understand that the cataloging
process is something that is quite important in general, but also it has like a central
role in the product itself, if I understand correctly.
Right. So can you give us a little bit of, let's say, background about cataloging? Because
it's a really new term, right? Like, as you said, you worked in the past with catalogs,
you had high expectations at some point in your career as well as catalogs.
You got hurt by them, probably have some kind of trauma.
How did it start?
How would you like and how would you like to play with the innovation from Promethean?
And also, if possible, talk a little bit about the future too.
Yeah, that's a good question. So I think, you know, catalog started decades ago as kind of just a way for,
yeah, DBAs to be able to annotate things or find things, right.
And we, you know, they heard terms like data dictionary, right.
From you to just putting in little things that, that, that helped me
understand what this term actually means, what does this column actually means.
And then people started adding in things like lineage, right,
to really understand how the, you know,
because as things move from the sources,
some into the data warehouse, transformation can take place.
And so you want to understand, hey, how did the transformation happen?
So lineage started becoming a big thing.
And then you have things like data quality score, et cetera,
that allow people to rank, you know, trustworthiness of the data and so forth.
So I would say they all kind of started with a very heavy governance influence for data catalog.
That's where most of them actually have that background.
The one thing that most catalogs have in common is really search.
We ask most people,
why do you want to have?
No, the number one reason is always search.
And I would tell you that once someone's actually bought and implemented a catalog, if you ask them,
hey, which feature do you actually use, right?
Search and tagging is like 80 to 90%.
The rest is like, and the reason is sometimes the rest
either doesn't work that well, it's hard to implement, but for the most part, I
would say you get a catalog to number one, find where things are, find where
they come from and a way to put some sort of information that helps you assess
whether or not this is good or bad.
That's the high level, you know, view of kind of how catalogs usually work.
And then what we find is that recently, people, because like I said, the problem of multiple
data sources, multiple data types and so forth, people are asking more from their catalog.
I want to profile, I want to see cardinality, I want to see statistics and so forth.
And you can do that.
Where the catalog stopped, unfortunately, was always,
am I either going to find new things that are already there?
That means good data sets that you can use, right?
Or I'm going to find new things that are raw,
tables which you probably wouldn't know what to do with.
And so a catalog never allowed you to actually build.
It never allowed you to actually experiment with the data other than saying, Hey, I found it in costa said, this is good, Eric said, this is bad.
Yeah.
They each gave it four stars and it comes from this source and so forth.
What happens is that the user has to make a lot of interpretations before they even know whether or not this, you can actually use it.
And so if you look at most usage, most catalogs are being used by a data governance team.
Really to, you know, quote unquote, manage whether or not something should be used, whether it poses a risk, who should access it and so forth.
And the reason why is because that next step of actually using it for analytics,
which requires you to actually work with the data, it's a separate persona.
It's a separate requirement that a catalog doesn't do, right?
The catalogs for it's like 80, 90% metadata, right?
It doesn't do that building part.
And so it doesn't worry about performance.
It doesn't worry about scale.
It doesn't worry about, you know, can you actually answer questions and query optimization and all that stuff.
So because of that, you know, take away the marketing.
It's actually not as useful as you might think for analytics, right?
Because if you can't do all that, how are you going to really help a data analyst, right?
Or a data engineer determine this is the data set to use to answer a question. Mm-hmm.
So that's kind of, I would say catalog circa 2017, 2018. And so where Promethean's
gone is we realized that the most natively intuitive thing to everyone, regardless of
function, regardless of creed, et cetera, is search. Nobody needs to really teach you how to use search.
It's the most natural thing to use.
So we didn't start out wanting to build a catalog.
What we realized was almost everyone can search.
Almost everyone's used to the notion of tags.
Almost everyone's used to the notion of ratings and reviews.
I know four stars means something's better than two stars.
I know thumbs up means it's better than two stars. I know thumbs up means it's better than thumbs down, right?
And so if you can leverage the catalog as an entry point
or use that capability as an entry point
and build on top of that.
So once I find something or once I think
these five things are what I want,
then add the building part to it, right?
Then figure out a way for people to prep and model,
query, and see the results. Then figure out a way for people to prep and model, query, and see the
results. Then you have a way
to very quickly get most people
to actually be able to work with
the data as opposed to
having to stop post-discovery
and then having to go and ask someone else to use another
tool.
This is where
you've seen the term data fabric come up.
I'll talk about data fabric, data pipeline, data mesh, and data catalog, right?
Because they're actually not the same.
But the problem is the marketing is so darn confusing, right?
So right now I'm seeing a lot of catalog guys go, ah, I'm a data fabric. Okay.
So the way to think about a data fabric in terms of what it should do and what it
needs to have is actually a fabric.
Yes, it needs to have the ability to connect to multiple data sources.
It needs to have a catalog-like functionality or metadata, in fact, metadata governance.
But it actually needs to have the data modeling and access layer that you can actually use
and then be able to have
some sort of coordination orchestration layer of saying, this is who uses it.
This is how you should use it.
This is what you should do next, right?
That's kind of the broad overall definition of what a data fabric is, you know, Gardner's
definition as well.
Now we've taken that and kind of modified it a bit in the sense that we think that the
access layer should be
both direct and federated, right? Because if you still require people to move data,
it's not going to be a good experience. And we also believe that you do need visualization because
for a lot of people, that is a better way to validate whether or not this is what you're
looking for or not. I challenge anyone to say,
I can send you a 50-page SQL query
and you can tell me this is the data you're looking for.
Me, in cost, because you look like a smart guy,
I think you can do it.
But if you gave it to me, I wouldn't be able to do it with you, my friend.
But I can look at a pie chart and I can say,
yeah, it's probably looking pretty good.
Or the narration and the storytelling to be able to tell you the value.
So we actually kind of go above and beyond
what the standard partner definition of a data fabric is. Now that means
it's doing all these things that cataloging is not doing. Cataloging is stopping at the metadata
management and discovery. It's not getting to the access layer, it's not getting to prep
or visualization. Now data pipeline is just the moving of the data, right? And if you think about,
you know, my dad was an English teacher,
so I probably spend way too much time
analyzing words and their meanings.
So a pipeline has a connotation of it's steel,
it's rigid, right?
Once I put it in, it's just going to move.
Whereas fabric, it's loose, it's flexible, right?
And that is because a fabric is as flexible as a question.
If I ask some questions, a wrong question, I ask another one.
I'm going to iterate.
I'm going to change a question.
A fabric allows you to kind of on the fly, very quickly change what you're looking for,
very quickly build what you're looking for, have that flexibility.
So this is kind of how I think we think about the world and data mesh.
Our friends at Starbirds I know have done a lot of work on the data mesh.
I think there's a fabric and a data mesh kind of co-exist together.
I think a data mesh is a framework, right?
That encompassed a lot of things and you can have a data fabric
in the data mesh framework.
So that's kind of how I see those things.
Yeah.
Yeah.
I agree and it makes sense.
And I think we are still in the process of properly defining all these terms and understanding them.
One last question from me.
So you mentioned at some point that the data catalogs were mainly used by the data governance people in the organization, right? After this evolution of the data catalog into the data fund, right?
Who are the people who use and consume this tool?
Yeah, so we're seeing data analysts and data engineers
now actively using the data fabric to be able to automate
the building of data sets, automate the building of on-demand SQL queries, et cetera, right?
I think the next evolution or the next iteration is,
as you later on, no code, actually later on, NLP and NLG,
a lot of business users, a lot of, you know,
right now I would say even the kind of fairly technical business analysts
could also use the data fabric.
But I think the goal, at least the goal I have is, how do we get the data fabric in
the hands of even the non-technical people at all, the guys that just want to ask a question
and get an insight?
And that's where a lot of work around NLP, NLG, and AI, and free text search and so forth
is really going to come to play and
kind of take that to the next level.
So that's the part that makes it very, very interesting because now the fabric can actually
span to laymen, citizens, business folks, business analysts, data analysts, data engineers,
and even the governance team.
And I think where you can have that, that's where things start to
make sense and that you can have governance analytics, BI all under the same framework,
which today it's a necessity because these crazy governance rules like GDPR, CCPA,
they're really hard. They're really, really hard to actually be compliant. And I can't think of a way to do it
if you still live in the cycled world
that most people have,
where everything is,
I only do this, I only do this.
It's going to be a nightmare.
And so I see the data fabric as,
it's finally, there is a way
to actually do this,
to drive velocity in decision-making,
but also do it in a way
that automatically takes care of governance.
I feel like we have more to chat about, to be honest.
But I know that we are close to time here, so I'd like to allow Eric to ask any last
questions that you might have.
So Eric, all yours.
Eric Boerwinkle I think we're at the buzzer.
I think, I think Brooks is telling us we have to close it down.
Casey, this is, whenever we run along, we know that it's a topic that is not only deeply interesting
and valuable to us, but also to our audience.
So thanks for digging in.
Thanks for letting us get a little technical.
And thanks for teaching us about not only data catalog,
but helping to further demystify data mesh, data fabric,
and all the other terms that marketers like me
proliferate across the industry.
Am I going to see a blog from you on the data mesh and data fabric now?
Oh, man, I'd have to dig pretty deep for that one.
But cool.
Well, thank you again for your time.
It's been great to have you in the show.
Yeah.
Thank you guys for having me.
I had an absolute blast.
So appreciate it.
Thank you, Casey.
Thank you.
I'm glad that Casey let me ask him about working as an analyst at the Federal Reserve.
I just had to know, and he said it was boring, which I kind of expected, but at least he built
a database in his spare time, which is pretty cool. I'm glad he was a good sport. I was really
interested by the sort of, it sounds like a system that learns about the value
in context of data over time that they've built, the Prometheum. And I mean, it sounds like they
even have a patent on it. That was pretty interesting and is a really compelling way
to think about the challenge of data governance and sort of a self-optimizing system.
And I don't know if we've talked to a guest who's brought up that approach yet, which is really interesting. So that was my thing to think about for the week.
Ah, yeah. I totally agree. I think this time I'm going to think about the same as you do. It is a bit of surprising that we haven't heard more about this ever-changing
nature of data and how this impacts the things that we do, the products that we
build, the infrastructure that we design and those things.
So yeah, I think it's something very, very interesting.
It's intuitively, I would say, very important. We see, I mean, probably
makes more sense to face this challenge when we are talking about data cataloging, because,
I mean, makes a lot of sense that this temporal nature of data is more obvious there. But
I think it's something that has a much broader impact with pretty much whatever
data and the products around
it. So I think we both should
keep the mental note to ask
more about that, I guess, from now on.
I agree.
All right. Well, many more great
episodes coming up. Subscribe if
you haven't, and we will catch you on the next show.
We hope you enjoyed this episode of the
Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new
episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric
at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rutterstack.com.