Screaming in the Cloud - How Data Discovery is Changing the Game with Shinji Kim
Episode Date: September 22, 2022About ShinjiShinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you to understand & manage your data. Previously, she was the Founder &... CEO of Concord Systems, a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the strategy and execution of Akamai IoT Edge Connect, an IoT data platform for real-time communication and data processing of connected devices. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB.Links Referenced:Select Star: https://www.selectstar.com/LinkedIn: https://www.linkedin.com/company/selectstarhq/Twitter: https://twitter.com/selectstarhq
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the morning. Software problems should drive innovation and collaboration,
not stress and sleeplessness and threats of violence.
That's why so many developers are realizing the value of AWS AppConfig feature flags.
Feature flags let developers push code to production,
but hide that feature from customers so that the developers can release their feature when it's ready.
This practice allows for safe, fast, and convenient software development.
You can seamlessly incorporate AppConfig feature flags into your AWS or cloud environment and ship your features with excitement, not trepidation and fear.
To get started, go to snark.cloud slash appconfig.
That's snark.cloud slash appconfig.
I come bearing ill tidings.
Developers are responsible for more than ever these days.
Not just the code that they write, but also the containers and the cloud infrastructure
that their apps run on because serverless means it's still somebody's problem.
And a big part of that responsibility is app security from code to cloud.
And that's where our friend Snyk comes in.
Snyk is a frictionless security platform that meets developers where they are, finding and fixing vulnerabilities right from the CLI, IDEs, repos, and pipelines.
Snyk integrates seamlessly with AWS offerings like CodePipeline, EKS, ECR, and more, as
well as things you're likely to actually be using.
Deploy on AWS, secure with Snyk, learn more at snyk.co.
That's s-n-y-k dot c-o slash scream.
Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I encounter a company
that resonates with something that I've been doing on some level. In this particular case,
that is what's happened here, but the story is slightly different.
My guest today is Shinji Kim, who's the CEO and founder at SelectStar.
And the joke that I was making a few months ago was that SelectStars should have been the name of the Oracle ACE program instead.
Shinji, thank you for joining me and suffering my ridiculous, basically amateurish and sophomoric database-level jokes
because I am bad at databases. Thanks for taking the time to chat with me.
Thanks for having me here, Corey. Good to meet you.
So SelectStar, despite being the only query pattern that I've ever effectively been able
to execute from memory, what you do as a company is described as an automated data discovery platform. So I'm going
to start at the beginning with that baseline definition. I think most folks can wrap their
heads around what the idea of automated means, but the rest of the words feel like it might mean
different things to different people. What is data discovery from your point of view?
Sure. The way that we define data discovery
is finding and understanding data.
In other words, think about how discoverable
your data is in your company today.
How easy is it for you to find data sets,
fields, KPIs of your organization data.
And when you are looking at a table, column, dashboard report,
how easy is it for you to understand that data underneath?
Encompassing on that is how we define data discovery.
When you talk about data lurking around the company in various places,
that can mean a lot of different things to different folks.
For the more structured data folks, which I tend to think of as the organized folks who are nothing like me, that tends to mean things that live inside of, for example, traditional relational databases or things that closely resemble that.
I come from a grumpy old sysadmin perspective.
So I'm thinking, oh yeah, we have a
Jira server in the closet and that thing's logging to its own disk. So that's going to be some
information somewhere. Confluence is another source of data in an organization. It's usually where
insight and knowledge of what's going on goes to die. It's one of those write once, read never
type of things. And when I start thinking about what data means, it feels like even that is
something of a squishy term. From the perspective of where Select starts and stops, is it bounded to data that lives within relational databases?
Does it go beyond that?
Where does it start?
Where does it stop?
So we started the company with an intention of increasing the discoverability of data and hence providing automated data discovery capability
to organizations.
And the part that where we see this as the most effective
is where the data is currently being consumed today.
So this is like where the data consumption happens.
So this can be a data warehouse or data lake,
but this is where your data analysts,
data scientists are courting data.
They're building dashboards reports on top of,
and this is where your main data mart lives.
So for us, that is primarily a cloud data warehouse today,
usually has a relational data structure.
On top of that, we also do a lot of deep integrations with BI tools.
So that includes tools like Tableau, Power BI, Looker, Mode.
Wherever these queries from the business stakeholders,
BI engineers, data analysts, data scientists run,
this is a point of reference
where we use to auto-generate documentation,
data models, lineage, and usage information
to give it back to the data team and everyone else
so that they can learn more about the data set
they are about to use.
So given that I am seeing an increased number of companies out there talking about data discovery,
what is it that SelectStar does that differentiates you folks from
other folks using similar verbiage in how they describe what they do?
Yeah, great question. There are many players popping up
and also traditional data catalogs
definitely starting to offer more features in this area.
The main differentiator that we have in the market today,
we call it fast time to value.
Any customer that is starting with SelectStar,
they get to set up their instance within 24 hours.
And they'll be able to get all the analytics and data models, including column level lineage, popularity, ER diagrams, and how other people are top users and how other people are utilizing that data.
Like literally in a few hours, max to like 24 hours. And I would say that is the main
differentiator. Most of the customers have pointed out that setup and getting started has been super
easy, which is primarily backed by a lot of automation that we've created underneath the
platform. On top of that, just making it super easy and simple to use,
it becomes very clear to the users
that it's not just for the technical data engineers
and DBAs to use.
This is also designed for business stakeholders,
product managers, and folks to start using
as they are learning more about how to use data.
Mapping this a little bit toward the use cases that I'm the most familiar with,
this big source of data that I tend to stumble over is customer AWS bills.
And that's not exactly a big data problem, given that it can fit in memory
if you have a sufficiently exciting computer.
But using Tableau to wind up slicing
and dicing that because at some point Excel falls down. From my perspective, the problem with Excel
is that it doesn't tend to work on huge data sets very well. And from the position of Salesforce,
the problem with Excel is that it doesn't cost a giant pile of money every month.
So those two things combined, Tableau is the answer for what we do. But that's sort of the
end all for us. That's
where it stops. At that point, we have dashboards that we've built and queries that we run that spit
out the thing we're looking at, and then that goes back to inform our analysis. We don't inherently
feed that back into anything else that would then inform the rest of what we do. Now, for our use
case, that probably makes an awful lot of sense, because we're here to help our customers with
their billing challenges, not take advantage of their data to wind up informing some giant model
and mispurposing that data for other things. But if we were generating that data ourselves as a
part of our operation, I can absolutely see the value of tying that back into something else.
You wind up almost forming a reinforcing cycle that improves the quality of data over time and lets you understand what's going on there. What are some of the outcomes that
you find that customers get to by going down this particular path? Yeah, so just to double-click on
what you just talked about, the way that we see this is how we analyze the metadata and the activity logs, system logs, user logs of how that data has been used.
So part of our auto-generated documentation for each table, each column, each dashboard,
you're going to be able to see the full data lineage, where it came from, how it was transformed in the past, and where it's going to.
You will also see what we call popularity score.
How many unique users are utilizing this data inside the organization today?
How often?
And utilizing these two core models and analysis that we create,
you can start looking at first mapping
out the data flow and then determining whether or not this data set is something that you
would want to continue keeping or running the data pipelines for.
Because once you start mapping these usage models of tables versus dashboards, you may find that there are recurring jobs that creates all these materialized views and tables that are feeding dashboards that are not being looked at anymore. So with this mechanism, by looking initially data lineage as a concept, a lot of companies
use data lineage in order to find dependencies.
What is going to break if I make this change in the column or table, as well as just debugging
any of issues that is currently happening in their pipeline.
So especially when you have to debug a SQL query or
pipeline that you didn't build yourself,
but you need to find out how to fix it.
This is a really easy way to instantly
find out where the data is coming from.
But on top of that,
if you start adding this usage information,
you can trace through where the main compute is happening,
which largest raw table is still being queried instead of the more summarized tables that should
be used versus which are the tables and data sets that is continuing to get created, feeding the
dashboards, and then is those dashboards actually being used
on the business side?
So with that, we have customers that have saved
thousands of dollars every month
just by being able to deprecate dashboards and pipelines
that they were afraid of deprecating in the past
because they weren't sure
if anyone's actually using this or not. But adopting SelectStar was a great way to kind of do a full screen claim of their data
warehouse as well as their BI tool.
And this is an additional benefit to just having to declutter so many old, duplicated,
and outdated dashboards and datasets in their data warehouse.
That is, I guess, a recurring problem that I see in many different pockets of the industry as a whole.
You see it in the usability space. You see it in the cost control space.
I even made a joke about Confluence that alludes to it.
This idea that you build a
whole bunch of dashboards and use it to inform all kinds of charts and other systems, but then
people are busy. It feels like there's no and then. Like one of the most depressing things in
the universe that you can see after having spent a fair bit of effort to build up those dashboards
is the analytics for who internally has looked at any of those dashboards
since the demo you gave showing it off to everyone else. It feels like in many cases,
we put all these projects and amount of effort into building these things out that then don't
get used. People don't want to be informed by data. They want to shoot from their gut.
Now, sometimes that's helpful. We're talking about observability tools that you
use to trace down outages. And well, our site's really stable. We don't have to look at that.
Very awesome. Great. Awesome use case. The business insight level of dashboard just feels
like that. That's something you should really be checking a lot more than you are. How do you see
that? Yeah, for sure. I mean, this is why we also update these usage metrics and lineage every 24 hours for all of our customers automatically.
So it's just up to date.
And the part that more customers are asking for where we are heading to, earlier I mentioned that our main focus has been on analyzing data consumption and understanding the consumption behavior to drive better usage
of your data or making data usage much easier.
The part that we are starting to now see is more customers wanting to extend those feature
capabilities to their stack of where the data is being generated. So connecting the similar amount of analysis and metadata collection for production databases,
Kafka Qs, and where the data is first being generated is one of our longer term goals.
And then you'll really have more of that up to the source level of whether the data should be even collected or whether it should even enter the data warehouse phase or not.
One of the challenges I see across the board in the data space is that so many products tend to have a very specific point
of the customer lifecycle
where bringing them in makes sense.
Too early and it's data.
What do you mean data?
All I have are these logs
and their purpose is basically to inflate my AWS bill
because I'm bad at removing them.
And on the other side, it's great.
We pioneered some of these things
and have built our own internal enormous system
that does exactly what we need to do.
It's like, yes, Google, you're very smart. Good job.
And most people are somewhere between those two extremes.
Where are customers on that lifecycle or timeline when using SelectStar makes sense for them?
Yeah, I think that's a great question. Most of the time, the best place where customers would use SelectStar for is after they have
their cloud data warehouse set up.
Either they have finished their migration, they're starting to utilize it with their
BI tools, and they are starting to notice that it's not just like, you know, 10 to 50
tables that they are starting with. Most of them have
more than hundreds of tables. And they're feeling that this is starting to go out of control
because we have all these data, but we are not 100% sure what exactly is in our database. And this usually just happens more in larger companies, companies at 1,000 plus employees.
And they usually find a lot of value out of SelectStar right away because we'll start
pointing out many different things.
But we also see a lot of forward-thinking, fast-growing startups that are at the size of a few hundred
employees, you know, they now have between five to 10 person data team, and they are really creating
the right single source of truth of their data knowledge through SelectStar. So I think you can
start anywhere from when your data team size is like beyond five and you're continuing to grow.
Because every time you're trying to onboard a data analyst, data scientist, you will have to go through like basically the same type of training of your data model.
And it might actually look different because data models and the new features, new apps that you're integrating, this changes so quickly.
So I would say it's important to have that base early on and then continue to grow.
But we do also see a lot of companies coming to us after having thousands of data sets or tens of thousands of data sets that it's really like very hard to
operate and onboard to anyone. And this is a place where we really shine to help their needs as well.
Sort of the, I need a database to the help. I have too many databases pipeline,
where at some point people start to wanting to bring organization to the chaos. One thing I like about your model
is that you don't seem to be making the play
that every other vendor in the data space tends to,
which is, oh, we want you to move your data
onto our systems at the end.
You operate on data that is in place,
which makes an awful lot of sense
for the kinds of things that we're talking about.
Customers are flat out not going to move their data warehouse over to your environment just because the data gravity is ludicrous.
Just the sheer amount of money it would take to egress that data from a cloud provider, for example, is monstrous.
Exactly.
And security concerns.
We don't want to be liable for any of the data. And this is like a very specific decision we've made very early on on the company to not access data, to not egress any of the real data, and to provide as much value as possible just utilizing the metadata and logs. And depending on the types of data warehouses, it also can be really efficient
because the query history or the metadata systems tables are indexed separately. Usually it's much
lighter load on the compute side. And that definitely has worked well for our advantage,
especially being a SaaS tool.
This episode is sponsored in part by our friends at Sysdig.
Sysdig secures your cloud
from source to run.
They believe, as do I,
that DevOps and security
are inextricably linked.
If you want to learn more
about how they view this,
check out their blog.
It's definitely worth the read. To learn more about how they view this, check out their blog. It's definitely worth the read.
To learn more about how they are absolutely getting it right from where I sit, visit sysdig.com and tell them that I sent you.
That's S-Y-S-D-I-G dot com.
And my thanks to them for their continued support of this ridiculous nonsense.
What I like is just how straightforward the integrations are.
And it's clear you're extraordinarily agnostic
as far as where the data itself lives.
You integrate with Google's BigQuery,
with Amazon Redshift, with Snowflake,
and then on the other side of the world
with Looker and Tableau and other things as well.
And one of the example use cases you give
is find the upstream table in BigQuery
that a Looker dashboard depends on.
That's one of those areas where I see something like that and, oh, I can absolutely see the value of that.
I have two or three DynamoDB tables that drive my newsletter publication system that I built because I have deep-seated emotional problems and I take it out on everyone else via code.
But there's a small contained system that I can still fit in my head, mostly. And I
still forget which table is which in some cases. Down the road, especially at scale, okay, where
is the actual data source that's informing this? Because it doesn't necessarily match what I'm
expecting is one of those incredibly valuable bits of insight. It seems like that is something
that often gets lost. The provenance of data doesn't seem to work. And ideally, you know,
you're staffing a company with reasonably intelligent people who are going to look at
the results of something and say, that does not align with my expectations. I'm going to dig,
as opposed to the, oh yeah, that seems plausible. I'll just go with whatever the computer says.
It's an ocean of nuance between those two, but it's also, it's nice to be able to establish
the validity of the path
that you've gone down in order to set some of these things up. Yeah, and this is also super
helpful if you're tasked to debug a dashboard or pipeline that you did not build yourself.
Maybe the person has left the company, or maybe they are out of office,
but this dashboard has been broken and you're, quote unquote, on call for data.
What are you going to do?
You're going to, without a tool
that can show you a full lineage,
you'll have to start digging
through somebody else's SQL code
and try to map out
like where the data is coming from.
If this is calculating it correctly.
Usually it takes a few hours to just get to the bottom of the issue.
And this is one of the main use cases that our customers bring up every single time
as more of like, this is now the go-to place every time there is any data questions or data issues. The first golden rule of cloud economics is step one, turn that shit off.
When people are not using something, you can optimize the hell out of however you want,
but nothing's going to be turning it off.
One challenge is when we're looking at various accounts and we see a Redshift cluster,
and it's, okay, that thing's costing a few million bucks a year,
and no one seems to know anything about it. They keep pointing to other teams, and it turns into
this giant finger-pointing exercise where no one seems to have responsibility for it.
And very often, our clients will choose not to turn that thing off. Because on the one hand,
if you don't turn it off, you're going to spend a few million bucks a year that you otherwise would not have had to.
On the other, if you delete the data warehouse and it turns out, oh yeah, that was actually
kind of important.
Now we don't have a company anymore.
It's a question of which is the side you want to be wrong on.
And in some levels, leaving something as it is and doing something else is always a more defensible answer.
Just because the first time your cost-saving exercises take out production, you're generally not allowed to save money anymore.
This feels like it helps get to that source of truth a heck of a lot more effectively than tracing individual calls and turning into basically data center archaeologists.
Yeah, for sure. I mean, this is why from the get-go,
we try to give you all your tables of your database,
just order by popularity.
So you can also see overall, like from all the tables,
whether that's thousands or tens of thousands,
you're seeing the most used,
has the most number of dependencies on the top,
and you can also filter it by all the database tables
that hasn't been touched in the last 90 days.
And just having this high-level view
gives a lot of ideas to the data platform team
of how they can optimize usage of their data warehouse.
For more, I tend to say an awful lot of customers are still relatively early in their data journey.
An awful lot of the marketing that I receive from various AWS mailing lists that I've found
myself on because I had the temerity to open accounts has been along the lines of, oh,
data discovery is super important.
But first, they presuppose that I've already bought into this idea that, oh, every company
must be a completely data-driven company, the end, full stop.
And yeah, we're a small bespoke services consultancy.
I don't necessarily know that that's the right answer here.
But then it takes it one step further and starts to define the idea of data discovery as, ah,
you'll use it to find a PII or otherwise sensitive or restricted data inside of your data sets.
So, you know exactly where it lives and sure. Okay. That's valuable,
but it also feels like a very narrow definition compared to how you view these things.
Yeah, basically, the way that we see data discovery is it's starting to become more of an essential capability in order for you to monitor and have understand how your data
is actually being used internally.
It basically gives you the insights around, sure, like what are the duplicated
datasets? What are the datasets that have that descriptions or not? What are something that
may contain sensitive data and so on and so forth? But that's still around the characteristics of the
physical datasets. Whereas I think the part that's really important around data discovery
that is not being talked about as much
is how the data can actually be used better.
So have it as more
of a forward thinking mechanism.
And in order for you
to actually encourage more people
to utilize data
or use the data correctly,
instead of trying to contain this
within just one team is really where I feel like data discovery can help.
And in regards to this, the other big part around data discovery, hence, is really opening
up and having that transparency just within the data team.
So just within the data team,
they always feel like they do have that access
to the SQL queries and you can just go to GitHub
and just look at the database itself.
But it's so easy to get lost in the sea of metadata
that is just laid out as just a list.
There isn't much context around the data itself. And that context,
along with the analytics of the metadata is what we are really trying to provide automatically.
So eventually, this can be also seen as almost like a way to monitor the data sets, like how you're currently monitoring your applications
through a data dog or your website
as if with your Google Analytics.
This is something that can be also used
as more of a go-to source of truth
around what your state of the data is,
how that's defined,
and how that's being mapped to different
business processes so that there isn't much confusion around data. Everything can be
called the same, but underneath it actually can mean very different things. Does that make sense?
No, it absolutely does. I think that this is part of the challenge in trying to articulate value that is specific to this niche across an entire industry. at one or two pre-imagined customer profiles. And that has the side effect of making customers
for whom that model doesn't align
look and feel like either doing something wrong
or makes it look like the vendor who's pitching this
is somewhat out of touch.
I know that I work in a relatively bounded problem space,
but I still learn new things about AWS billing
on virtually every engagement that I go on, just because you always get to learn more about how customers view things and how they view
not just their industry, but also the specificities of their own business and their own niche.
I think that that is one of the challenges historically with the idea of letting software
do everything. Do you find the problems that you're solving tend to be global in nature,
or are you discovering strange depths of nuance on a customer by customer basis at this point?
Overall, a lot of the problems that we solve and the customers that we work with is very
industry agnostic. As long as you are having many different data sets that you need to manage,
there are common problems that arises
regardless of the industry that you're in.
We do observe some industry-specific issues
because your data is either,
it's a non-structure data
where your data is primarily events
or depending on how the data looks like.
But primarily because most of the BI solutions
and data warehouses are operating
as a relational databases.
This is a part where we really try to build
a lot of best practices and the common analytics
that we can apply to every customer
that's using SelectStar.
I really want to thank you for taking so much time to go through
the ins and outs of what it is you're doing these days. If people want to learn more, where's the
best place to find you? Yeah, I mean, it's been fun talking here. So we are at selectstar.com.
That's our website. You can sign up for a free trial. It's completely self-service, so you don't
need to get on a demo, but we'll also help
you onboard and happy to give a free demo to whoever that is interested.
We are also on LinkedIn and Twitter under CollectStarHQ.
Yeah, I mean, we're happy to help for any companies that have these issues around wanting
to increase their discoverability of data and want to help their
data team and the rest of the company to be able to utilize data better.
And we will, of course, put links to all of that in the show notes. Thank you so much for
your time today. I really appreciate it. Great. Thanks for having me, Corey.
Shinji Kim, CEO and founder at SelectStar. I'm cloud economist Corey Quinn, and this is Thanks for having me. with an angry comment that I won't be able to discover because there are far too many podcast platforms out there and I have no means of discovering where you've said that
thing unless you send it to me.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need
the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less
horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business
and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.