Drill to Detail - Drill to Detail Ep.105 ‘Data Catalogs, Data Discovery and Data Lineage for the Modern Data Stack’ with Special Guest Shinji Kim
Episode Date: June 1, 2023Mark is joined by Shinji Kim, Founder and CEO of Select Star to talk about their mission to re-invent data catalogs, data discovery and data lineage for the modern data stackData Discovery vs. Data Ob...servability: Understanding the Differences for Better DataOpsSelect Star and dbt Labs Partner for Better Data Discovery on dbtSelect Star : Free Trial
Transcript
Discussion (0)
Hello and welcome to another episode of Drill to Detail and I'm your host Mark Rittman.
So I'm very pleased to be joined today by Shinji Kim, founder and CEO of SelectStar.
So welcome to the show Shinji and it's great to have you join us. Thanks Mark, pleased to be joined today by Shinji Kim, founder and CEO of SelectStar. So welcome to the show, Shinji, and it's great to have you join us.
Thanks, Mark. Great to be here.
So Shinji, maybe to start with your role at SelectStar and what does SelectStar do?
Sure. So I'm the founder and CEO of SelectStar.
We are an automated data discovery platform that helps everyone to find and understand their own data in the organization. Before we get into, I suppose, the detail of SelectStar and what you do
and the way it works with the modern data stack and so on,
let's maybe start with a little bit about you.
So I think you worked at Akamai, first of all, and then Concord Systems.
Is that correct?
And maybe just tell us your route really into the industry
and how you got to be founder of SelectStar.
Sure. So SelectStar?
Sure. So SelectStar is my second company in the data space. The first company I started was called Concourse Systems back in 2014, where I was working at a mobile ad network. We were
growing very fast and we were dealing with a lot of events data. We spun out a technology to process these events data in real time.
And later Akamai bought my company to kick off their new strategic initiatives around IoT.
And that's how I ended up at Akamai.
And from there, I started working with a lot of enterprise customers where they had a lot of need around large scale real time events processing.
And I've noticed that a lot of these companies are getting now supported and also moving to the cloud to be able to collect, process, and also store a gigantic amount of data.
The new challenges that were starting to creep up and starting to happen, though,
was how the organization can actually utilize that data after the collection, processing, and storage.
So this was a very interesting problem for me.
I felt like as a data practitioner in the past, as a data engineer or data scientist
or as a product manager, when I had to use data, having a good context of data has been
always a key for me to make better data-driven
decisions.
And today, so many companies and so many people within the organization, they may have access
to the data warehouse or access to the data repository at the same time.
It's actually very hard to utilize and leverage the data that they have because
there are just too many and the data models constantly changing.
So this is really where I was coming from with SelectStar.
I wanted to solve the problem around data discovery so that anyone that has access to data should be able to utilize them in what they need to do without having to try to ask other single source of truth of the data context where we are trying to automate with SolexStark today.
Let's get into some of those topics a bit more really in this conversation.
And we talk about discoverability, data catalogs, data lineage, and why at the moment they're particularly hot topics in the modern data stack space.
So maybe start with discoverability and data catalogs.
Okay, so what do you mean?
First of all, let's start by what do you mean by discoverability, really?
What does that mean to you, really?
And what action is that and why is that valuable to people?
Sure.
The way that we define data discovery or data discoverability is really around users being able to or data consumers
being able to find and understand data so that they can actually utilize data fully as it was
intended. Data catalog is a more of a I guess like a feature that can help because through data catalogs, you can actually
see all the metadata or what is available as a data asset.
Whereas data discoverability, and in order to make discovery really work, data catalog
is like one of the pieces and i see the full data discovery and data discoverability as
a capability of a data platform that needs to be there in order for data to be truly
utilized within an organization does that make sense yeah yeah. So you mentioned data assets there. So are we just talking about tables in a database,
or is the definition of a data asset kind of wider?
I mean, what would be within the scope, really,
of what you consider for this problem to be solved?
Yeah, a lot of data catalogs do catalog all the assets.
So just like what you mentioned,
that this would include the table, dashboards,
but it could also include assets like documentation
or pipelines, diagrams, things like that
related to the data models.
What we also find important regarding data discovery
and improving the discoverability of data
is to also surfacing the data relationships between the assets.
Where was the data originated from? Are there any dependencies?
And where is the data actually being used? By whom? What type of analysis has already been done and how can this data be joined along with other data assets.
And I think that's where one of the key features or key aspect of discoverability,
because most of the time you don't just use one table or one dashboard.
You are going to look at multiple different data assets and then try to figure out what
would be the best way to slice and dice the data or create a model on top of it.
So I think this first getting all the assets in one place, which now I would say a lot of good cloud data warehouses and lakes are built for, is the start.
And being able to have some catalog that you discovery, you also need to have a way to define and also be able to see these relationships.
And this is one part that we try to detect automatically by providing automated data lineage, popularity, ER diagrams, so on and so forth.
Okay.
So the way I came across SelectStar a while ago was one of our clients
was looking to do exactly what you're saying there.
And the question came up at the time, really,
as to why we couldn't just do this by reading in, I suppose,
the information schema from, say, Snowflake and maybe connecting to Looker and maybe connecting
and bringing the repository, for example, for the project,
bringing that through from GitHub.
Why is it difficult or why is it a challenge or harder than you think
to just do this yourself by connecting to each of those stores' metadata layer
and bringing it in yourself?
What's the challenge in that?
What's the hidden kind of complexity, really?
With a lot of traditional data catalogs or many open source projects,
where if you were to try to connect all the metadata on your own,
this will give you a baseline of having all the metadata together.
But the aspect around understanding the relationship between these assets will be very hard.
Because in order to do so, you will need to also understand and, you know, and this can
come from many different ways.
But for SelectStar and many other companies,
they try to understand the relationships of the data by parsing through SQL queries,
activity logs, looking into the query plan, things like that.
And that stage of parsing and processing kind of like does require a significant amount of time of
r&d and effort to make it really work so i would say this is uh yeah one of the reasons why it's
hard to do this um on your own especially if you want to have a um really quality uh
in the data discoverability side.
Yeah. Okay.
What about the concept of this tribal knowledge really,
or the knowledge of the organization?
I mean,
so things that you can't necessarily work out from just looking at sort of
data dictionaries and so on, but the knowledge of the,
of the team and so on, is that,
is that part of also of the information you think is important to be
discoverable and be stored with, with what you're doing? Yeah, for sure. There's so much we can do
automatically just by looking at the queries and activities and the history. There is a lot of
business context and the semantic level context that the data models and metadata may not represent. So I think that is one aspect of the context
that we usually try to make it as easy as possible
for our users to add as a form of documentation,
defining metrics or referring to or mentioning other data assets
so that there is a linkage between multiple assets
within the same environment.
I would say this is an area that, like true data discoverability,
I think really happens from both automation side
as well as contribution of data producers and consumers utilizing the data.
So this is a space where SelectStar aren't the only solution, I suppose.
And there's been other attempts before SelectStar to try and do what you're talking about.
So when you came to look at this as an opportunity, what were the solutions that you saw on the market or the approaches you saw on the market that were there?
And why did you think that they weren't enough?
And what is the unique innovation or difference that SelectStar brought to the market, do you think?
How did you solve it in a uniquely new and valuable way, do you think?
Sure, that's a good question. So when I decided
to start SelectStar as a company, I was doing a lot of different market research, talking to
various sizes of the companies that had this problem around data discovery. What I found
was that there are usually like three camps. First and foremost, companies that have tried documenting everything
in their wiki or Google spreadsheets or some shared documentation. And this can be a good
start, especially for smaller companies. At the same time, as companies grow and as data models change, it's very hard to keep it up to date.
So this is one part that, you know, as companies grow out of their size, as their data teams grow,
and as they have more data consumers, even outside of the data team, they start looking,
look for more of a solution that will be integrated to their data warehouse or their data
pipelines.
So when they start go out and start evaluating different tools or,
or open source projects out there.
So usually the two,
the other two options that they come across would be one,
an open source project like Apache Atlas or Amundsen or DataHub,
I think were kind of like the main ones at the time that were also fairly, you know, new at the time,
but, you know, still was starting to get traction.
Or you can go with a proprietary solution under the category of enterprise data catalog.
And these are tools like Alation, Calibra, Talent.
Data catalog or enterprise data catalog has been around, I guess,
since database has been around.
The main parts of these tools, I think, for open source,
what I'm continuously hearing is that it's fine to start a POC,
but in order for the customer to actually use it in production
and make it available for everyone,
it actually does take a lot more effort to ensure that the ingestion of metadata actually happens regularly and without much trouble.
So there is a maintenance effort on the engineering side that needs to go in. And then in order for the catalog
to be actually used by the data team,
it also requires a lot of handholding
because a lot of the open source projects
are really built by developers for other developers.
So there are not as many user
or less technical user-friendly features built in.
That's kind of hard for adoption purposes.
And if users don't adopt the solution, then your discovery platform is going to get outdated
also very quickly.
On the enterprise data catalog side, I've just talked to a number of customers that have either did a POC or have signed a year or two of these tools that ended up just not being really adopted inside the company. First and foremost, because integration time just takes a long time, like about six to nine months minimum.
They need to generally have like a dedicated data governance team or some data stewards.
And then getting it to be really useful within the organization takes another set of time. So I think all of those, hearing a lot of these from multiple options that were
currently in the market got me thinking that, you know, why isn't there a solution that is easy to
use, easy to get started, and something that's more like native to the cloud and the new modern
data stack tools that are coming out in the market
and that's kind of like where i felt like you know there was a like an opportunity uh to fill the gaps
and uh yeah kind of how i uh wanted to position select star for okay interesting so so what's the
what's the in your experience you mentioned pocs there and and and adoption what's the, in your experience, you mentioned POCs there and adoption.
What's the kind of, what's the aha moment really with your product where people kind of,
if they do this thing, they see the value in it really.
What do you find that point is really?
Yeah, so for us, a lot of our differentiation comes from what we call fast time to value.
It's easy to set it up, usually within like 15 minutes or so, regardless of like
how much data set you have. And it's also easy to see the kind of like the initial value of the
platform without doing a lot of manual work on top. We are not just providing a tool, we are providing a platform that brings you some insights about the data and its relationship that you didn't know about.
So a lot of our customers feel this aha moment about their data model underneath or dual lineage, even though they've been utilizing their BI tool and their data warehouse
for a couple of years, for them to discover that, oh, here are some dashboards that I thought,
you know, I would be safe to deprecate, but here are some users that are still looking at this
dashboard. And also, you know, oh, I didn't know that there are so many people still querying this
raw data table versus our gold table, for example. So I think these little insights that they get to
find out throughout SelectStar as they navigate through is really the aha moment that we bring on
just within their first day of trying out SelectStar.
And something that is, you know, a fairly smooth process,
which is, I would say, you know, even today,
hard to find from a lot of other vendor solutions
because they would, yeah, you would have to wait
for your POC environment to be ready
if you want to test it out with your own data
and go through like a long sales process.
Okay. So, yeah, and you mentioned data lineage there.
And so when I did a POC with SelectStar, there was,
I suppose the context was we had, I think,
Snowflake, dbt, and looker as the as the stack and the business
problem was um trying to understand like you said which for a particular dashboard which what were
the tables and uh what were the sources and so on that were needed to populate that dashboard um
and then and then also kind of around as as well, sort of taking a source and saying, this source here, what objects does this enable downstream, really, in our sort of data model?
And that linking of the kind of metadata from, say, dbt cloud or dbt through kind of Snowflake and through Interlooker, that connection there was the bit where we kind of, something you couldn't easily do yourself using just the individual metadata stores.
So maybe let's talk about data lineage, right?
And which I think is the next,
you've got discoverability catalogs
and lineage is probably the thing
that draws people to this really.
So maybe just to outline,
what does data lineage mean?
And why is it important for people really
to understand the lineage of their dashboards and so on?
Sure. At core, what data lineage really shows is how different data sets are related to each other
because during ETL or during transformation, a lot of data, like, will be either picked or changed to create another set of data
that will be more, you know, will be more relevant for specific use cases.
So just like as an example, you may start with production database tables
with, like, users, companies, accounts, or transactions,
whereas I will have a table that's more focused on the analysis.
So it could be more like customer transactions
that may have joined data between customer user data,
as well as the transactions and where the transaction happens,
and so forth.
In Lineage, you will be able to see all the source tables here
that makes up the customer transaction table.
And what that means is you'll be able to understand the dependencies
because customer transaction table, if it's out of date or something's off,
that's probably because one of the source table
is also out of date, for example. So yeah, I think that's kind of like the very simple version
of data lineage just really gets to for you to understand where the data is coming from,
where it's going. For like the example that you gave around the the Snowflake with dbt, Looker, everything combined together, we look at data lineage as really just this end-to-end data flow.
Where was data originated?
And where is it going?
Where are all the places it's going?
And how are they arriving at different places?
So in order to compose that lineage, we look at, you know, first and foremost, any of the SQL queries that makes up the creation of the tables or views, or if there are ways that the data is being entered, I'm talking
about like when you're doing insert, update, merge, that selects other databases, database
rows and columns, like those will be all considered in order to generate lineage and market as
these are the ones that are sources of the tables.
And on top of that, for something like dbt,
which is more considered on the transformation layer,
we will map which model or which part of the pipeline
this is actually connected to by reading through
also the metadata of the dbt like the manifest json
for looker side or and also a lot of bi tool side that we have a separate integration with the
api so we will first map out all the assets of those bi tools and then we will start looking at
how these dashboards are generated.
Are they coming from some LookML query?
Is there a custom SQL query involved?
And if we are looking at LookML, what does the model actually look like?
So we have a separate LookML parser that connects the LookML models
as well as based on the connection that Looker has.
And this is like a very intricate, like, you know,
part of the integration that we get into a lot because every BI tool has their own ways of treating data sets. So yeah, something like, you know,
like Tableau, they will have a virtualized version of data sources.
And a lot of the dashboards are reading off from that published
data sources or data extracts for Power BI or like mode, like they also have different ways of how
they pull out data. Some of them are like all live connections. Some of them have their own data models built in. So the important part here,
and in order for lineage to truly make sense to our end users, we really try to bring out the native models of the BI,
how the BI tool looks at and treats the data assets
into SelectStar platform,
rather than just say that,
oh, here are the dashboards and here are the tables.
We want to show what type of hoops or processes
it goes through between the tools as well, because
this also gives not just more detailed information, but opportunities for our customers to be
able to remodel, optimize, and make their data model between the data warehouse and
BI tools better.
Okay.
Because one of the things that the client at the time we were working with
wanted to be able to do was to understand not only, I suppose,
which sources this information – so a metric on Inlooker, for example,
not only to understand which tables and which sources
and wherever this came from, but how the data's been transformed and what had been done to it on route to actually being displayed to them.
So what had been filtered out or how had this been calculated?
Is that something that, I mean, certainly I can see how that could be a challenge.
Is that something that SelectStar can surface or is it too hard?
Or what's your view on understanding what's happened to the data on the route
as well as where does it come from?
Yeah, so this is a level of detail we started surfacing out since,
I would say, second half of last year.
We've noticed that, you know, beyond just metadata being mentioned within the query,
we can also introspect whether this field is being,
or this value of the data is being used as is,
or was it aggregated?
And it's going to transactions to like total transactions column,
or if there were any transformation that has done in,
last but not least, has this field been used as a filter
or actual value?
And those differences all, you know, make a slight,
or those details all make a difference in the customer's eyes because
when they are deprecating a column, they don't necessarily want to see just the ones that are
being used as a value, but they also would want to know if these fields or these values of the
fields are translated into any of the
dropdowns that acts as a filter within a dashboard.
Another way that we are also leveraging these kind of like usage of the column
during transformation is by indicate,
by understanding which are the as-is transformation,
or there's no transformation done, we are just using the data as-is.
And in those cases, we can safely also propagate documentation,
like column-level documentation or column-level tags,
downstream and upstream
so that if you define your, you know,
or if you have documented a one table on your source table
that will be translated and propagated all the way downstream
wherever that field is being used.
And this specific feature actually increase our column documentation fill rate for our
customers usually two to five times on average, which is pretty cool to see because nobody
likes doing detailed documentation, especially on the column level all the time. Whereas like with this feature, yeah,
customers can document once and have them updated everywhere
without having to manually find where else that can be used or applied to.
Okay. Interesting. Interesting.
And I suppose a new thing, something that's new to the market recently
and has certainly been quite an impact is this concept of metrics layers
and semantic layers from DBT labs and from other places.
How does that fit into your thinking, really?
Is that something that is separate to what you're doing,
or can you imagine the DBT semantic layer being part of the metadata you ingest as well?
I wouldn't call it a competitor.
We already have a lot of DBT customers that integrate their dbt docs with and supporting
dbt metrics and the semantic layer for the dbt cloud is in our plans this year.
We already have a feature called metric where we allow customers to define whether it's a column or
a measure from your bi tool and we will start pulling all the you know suggested dimensions
and dashboards that are related to that metric automatically so that you don't have to manually
define them and you can just add high-level documentation
that defines what it means semantically
and what part of the business process this fits in.
And so I'm quite excited about the new release
and also updates that dbt is making.
And for us as more of a discovery platform
something that we will integrate closely with okay okay so so just i mean to get to the last
part the last pillar of the product really um sort of data governance um so um what what do you
what do you define or what do you what do you think of us when we talk about data governance and how does it relate to the other capabilities of SelectStar?
So data governance is a fairly large word to say.
I think I used to call it more of a data management. But governance also makes sense more mainly because the purpose of governance is trying to keep data in control.
With data democratization and modern data stack overall, many companies now give and whether they realize it or not, a lot of people have access to data.
And it's not about just giving the access as long as it's being used correctly.
But now with the access, you would want to make sure that the data is also being used
correctly.
People are finding the right data as well as people are utilizing it correctly and want to make sure that you have a good quality of data so that people actually trust the data that they're using. especially for enterprises, but more for data access perspective.
And then it has evolved towards meeting compliance or privacy requirements of treating sensitive data.
The way that we think about data governance is that when you are trying to, in today's era of modern data stack
and modern organizations, how companies work and treat data,
it's important to have a good understanding
and have a good landscape of what your data warehouse,
data lake, and your data tools look like
and your models in order for you to run an effective data governance program.
So that's really coming from for data managers or, you know,
directorcy levels or it could be the data governance team
to be aware of what are the
most important data sets that we need to make sure it has quality checks, has the right owners,
and has the up-to-date documentation. What are the data sets that should probably be deprecated so that it doesn't confuse the data consumers and
it also can save a lot of cost. Who should generally be the owners of these data sets
and tables? And what are the kind of like defining what are the kind of like, you know, defining what are the levels of sensitivity of data and what will be the general process for people to get access for?
A lot of it, I feel like, you know, it's a balance between making sure that it is in control.
But then you would, you know, you have collected and stored all this data
so that you can actually use it, so more people can actually use it.
And I think that is the part that is important,
and that's where we see data discovery really come into play.
Okay, okay.
So what's next then for SelectStar?
It sounds like you've got a pretty sort of solid set of features for doing discovery, cataloging, and data lineage and so on.
What's the next thing you're trying to solve really?
And what do you think the next problem to be solved in the industry is then really?
Yeah, it's been a really exciting last couple of years, building features, working with customers, scaling our platform.
I would say we are still, you know,
still there's like a lot of different problems to solve.
Where we are heading is really, you know, we have really good set of these pillars of data discovery,
including like lineage, usage analytics,
like popularity and like access control,
so on and so forth.
What we are planning to do, you know, going forward and more longer term is to have features
that are more specific to the use cases that we see. So, for example, like, you know,
what I just discussed about data governance,
we have all the capabilities to support that,
but it's not necessarily, you know, like, you know,
packaged as here is the data governance for you.
So these are aspects that we are starting to look into so that it's a lot easier
for our customers to leverage SelectStar for specific purposes. And then there are always,
you know, more integrations and kind of like next level features that we are working on that leverages these baseline pillars as well so yeah
i'm very excited for this year and all the releases that we have planned good and you must be
integrating chat gpt into it as well it seems everybody everybody is trying to put a conversational
element into their into their product and yeah so it's a yes this this this month's kind of a big
thing really um so so um
how do people find out more about select star how would they how would they read about it trial it
and so on yeah i mean everything's at selectstar.com you can start a free trial and you can also read
about um different white papers blog posts uh different different interviews that we've done and conference talks that we
gave in the past from the website on their resources. And if you join our newsletter,
we send a monthly newsletter that highlights some of the things that we've done over the last month,
as well as like, you know, places that we will be at like different conferences or events
so yeah i think that's the you know our website basically has everything we try to just make sure
that you know it's all included there so that people can easily find those uh use and events
coming up fantastic that's really good well shinji thank you very much for coming on the show it's
been great speaking to you and uh great talking us talking us through the space in general and what SelectStar are doing.
So thank you very much and appreciate that.
And, yeah, good luck with the product going forward.
And thank you.
Thanks, Mark, for having me on the show.
And, yeah, this was a great chat. Thank you.