Screaming in the Cloud - How Data Discovery is Changing the Game with Shinji Kim

Episode Date: September 22, 2022

About ShinjiShinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you to understand & manage your data. Previously, she was the Founder &... CEO of Concord Systems, a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the strategy and execution of Akamai IoT Edge Connect, an IoT data platform for real-time communication and data processing of connected devices. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB.Links Referenced:Select Star: https://www.selectstar.com/LinkedIn: https://www.linkedin.com/company/selectstarhq/Twitter: https://twitter.com/selectstarhq

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the morning. Software problems should drive innovation and collaboration, not stress and sleeplessness and threats of violence. That's why so many developers are realizing the value of AWS AppConfig feature flags.
Starting point is 00:00:58 Feature flags let developers push code to production, but hide that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig feature flags into your AWS or cloud environment and ship your features with excitement, not trepidation and fear. To get started, go to snark.cloud slash appconfig. That's snark.cloud slash appconfig. I come bearing ill tidings. Developers are responsible for more than ever these days.
Starting point is 00:01:39 Not just the code that they write, but also the containers and the cloud infrastructure that their apps run on because serverless means it's still somebody's problem. And a big part of that responsibility is app security from code to cloud. And that's where our friend Snyk comes in. Snyk is a frictionless security platform that meets developers where they are, finding and fixing vulnerabilities right from the CLI, IDEs, repos, and pipelines. Snyk integrates seamlessly with AWS offerings like CodePipeline, EKS, ECR, and more, as well as things you're likely to actually be using. Deploy on AWS, secure with Snyk, learn more at snyk.co.
Starting point is 00:02:22 That's s-n-y-k dot c-o slash scream. Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I encounter a company that resonates with something that I've been doing on some level. In this particular case, that is what's happened here, but the story is slightly different. My guest today is Shinji Kim, who's the CEO and founder at SelectStar. And the joke that I was making a few months ago was that SelectStars should have been the name of the Oracle ACE program instead. Shinji, thank you for joining me and suffering my ridiculous, basically amateurish and sophomoric database-level jokes because I am bad at databases. Thanks for taking the time to chat with me.
Starting point is 00:03:09 Thanks for having me here, Corey. Good to meet you. So SelectStar, despite being the only query pattern that I've ever effectively been able to execute from memory, what you do as a company is described as an automated data discovery platform. So I'm going to start at the beginning with that baseline definition. I think most folks can wrap their heads around what the idea of automated means, but the rest of the words feel like it might mean different things to different people. What is data discovery from your point of view? Sure. The way that we define data discovery is finding and understanding data.
Starting point is 00:03:50 In other words, think about how discoverable your data is in your company today. How easy is it for you to find data sets, fields, KPIs of your organization data. And when you are looking at a table, column, dashboard report, how easy is it for you to understand that data underneath? Encompassing on that is how we define data discovery. When you talk about data lurking around the company in various places,
Starting point is 00:04:24 that can mean a lot of different things to different folks. For the more structured data folks, which I tend to think of as the organized folks who are nothing like me, that tends to mean things that live inside of, for example, traditional relational databases or things that closely resemble that. I come from a grumpy old sysadmin perspective. So I'm thinking, oh yeah, we have a Jira server in the closet and that thing's logging to its own disk. So that's going to be some information somewhere. Confluence is another source of data in an organization. It's usually where insight and knowledge of what's going on goes to die. It's one of those write once, read never type of things. And when I start thinking about what data means, it feels like even that is
Starting point is 00:05:03 something of a squishy term. From the perspective of where Select starts and stops, is it bounded to data that lives within relational databases? Does it go beyond that? Where does it start? Where does it stop? So we started the company with an intention of increasing the discoverability of data and hence providing automated data discovery capability to organizations. And the part that where we see this as the most effective is where the data is currently being consumed today.
Starting point is 00:05:38 So this is like where the data consumption happens. So this can be a data warehouse or data lake, but this is where your data analysts, data scientists are courting data. They're building dashboards reports on top of, and this is where your main data mart lives. So for us, that is primarily a cloud data warehouse today, usually has a relational data structure.
Starting point is 00:06:05 On top of that, we also do a lot of deep integrations with BI tools. So that includes tools like Tableau, Power BI, Looker, Mode. Wherever these queries from the business stakeholders, BI engineers, data analysts, data scientists run, this is a point of reference where we use to auto-generate documentation, data models, lineage, and usage information to give it back to the data team and everyone else
Starting point is 00:06:40 so that they can learn more about the data set they are about to use. So given that I am seeing an increased number of companies out there talking about data discovery, what is it that SelectStar does that differentiates you folks from other folks using similar verbiage in how they describe what they do? Yeah, great question. There are many players popping up and also traditional data catalogs definitely starting to offer more features in this area.
Starting point is 00:07:14 The main differentiator that we have in the market today, we call it fast time to value. Any customer that is starting with SelectStar, they get to set up their instance within 24 hours. And they'll be able to get all the analytics and data models, including column level lineage, popularity, ER diagrams, and how other people are top users and how other people are utilizing that data. Like literally in a few hours, max to like 24 hours. And I would say that is the main differentiator. Most of the customers have pointed out that setup and getting started has been super easy, which is primarily backed by a lot of automation that we've created underneath the
Starting point is 00:08:01 platform. On top of that, just making it super easy and simple to use, it becomes very clear to the users that it's not just for the technical data engineers and DBAs to use. This is also designed for business stakeholders, product managers, and folks to start using as they are learning more about how to use data. Mapping this a little bit toward the use cases that I'm the most familiar with,
Starting point is 00:08:31 this big source of data that I tend to stumble over is customer AWS bills. And that's not exactly a big data problem, given that it can fit in memory if you have a sufficiently exciting computer. But using Tableau to wind up slicing and dicing that because at some point Excel falls down. From my perspective, the problem with Excel is that it doesn't tend to work on huge data sets very well. And from the position of Salesforce, the problem with Excel is that it doesn't cost a giant pile of money every month. So those two things combined, Tableau is the answer for what we do. But that's sort of the
Starting point is 00:09:03 end all for us. That's where it stops. At that point, we have dashboards that we've built and queries that we run that spit out the thing we're looking at, and then that goes back to inform our analysis. We don't inherently feed that back into anything else that would then inform the rest of what we do. Now, for our use case, that probably makes an awful lot of sense, because we're here to help our customers with their billing challenges, not take advantage of their data to wind up informing some giant model and mispurposing that data for other things. But if we were generating that data ourselves as a part of our operation, I can absolutely see the value of tying that back into something else.
Starting point is 00:09:40 You wind up almost forming a reinforcing cycle that improves the quality of data over time and lets you understand what's going on there. What are some of the outcomes that you find that customers get to by going down this particular path? Yeah, so just to double-click on what you just talked about, the way that we see this is how we analyze the metadata and the activity logs, system logs, user logs of how that data has been used. So part of our auto-generated documentation for each table, each column, each dashboard, you're going to be able to see the full data lineage, where it came from, how it was transformed in the past, and where it's going to. You will also see what we call popularity score. How many unique users are utilizing this data inside the organization today? How often?
Starting point is 00:10:38 And utilizing these two core models and analysis that we create, you can start looking at first mapping out the data flow and then determining whether or not this data set is something that you would want to continue keeping or running the data pipelines for. Because once you start mapping these usage models of tables versus dashboards, you may find that there are recurring jobs that creates all these materialized views and tables that are feeding dashboards that are not being looked at anymore. So with this mechanism, by looking initially data lineage as a concept, a lot of companies use data lineage in order to find dependencies. What is going to break if I make this change in the column or table, as well as just debugging any of issues that is currently happening in their pipeline.
Starting point is 00:11:44 So especially when you have to debug a SQL query or pipeline that you didn't build yourself, but you need to find out how to fix it. This is a really easy way to instantly find out where the data is coming from. But on top of that, if you start adding this usage information, you can trace through where the main compute is happening,
Starting point is 00:12:08 which largest raw table is still being queried instead of the more summarized tables that should be used versus which are the tables and data sets that is continuing to get created, feeding the dashboards, and then is those dashboards actually being used on the business side? So with that, we have customers that have saved thousands of dollars every month just by being able to deprecate dashboards and pipelines that they were afraid of deprecating in the past
Starting point is 00:12:41 because they weren't sure if anyone's actually using this or not. But adopting SelectStar was a great way to kind of do a full screen claim of their data warehouse as well as their BI tool. And this is an additional benefit to just having to declutter so many old, duplicated, and outdated dashboards and datasets in their data warehouse. That is, I guess, a recurring problem that I see in many different pockets of the industry as a whole. You see it in the usability space. You see it in the cost control space. I even made a joke about Confluence that alludes to it.
Starting point is 00:13:23 This idea that you build a whole bunch of dashboards and use it to inform all kinds of charts and other systems, but then people are busy. It feels like there's no and then. Like one of the most depressing things in the universe that you can see after having spent a fair bit of effort to build up those dashboards is the analytics for who internally has looked at any of those dashboards since the demo you gave showing it off to everyone else. It feels like in many cases, we put all these projects and amount of effort into building these things out that then don't get used. People don't want to be informed by data. They want to shoot from their gut.
Starting point is 00:14:01 Now, sometimes that's helpful. We're talking about observability tools that you use to trace down outages. And well, our site's really stable. We don't have to look at that. Very awesome. Great. Awesome use case. The business insight level of dashboard just feels like that. That's something you should really be checking a lot more than you are. How do you see that? Yeah, for sure. I mean, this is why we also update these usage metrics and lineage every 24 hours for all of our customers automatically. So it's just up to date. And the part that more customers are asking for where we are heading to, earlier I mentioned that our main focus has been on analyzing data consumption and understanding the consumption behavior to drive better usage of your data or making data usage much easier.
Starting point is 00:14:52 The part that we are starting to now see is more customers wanting to extend those feature capabilities to their stack of where the data is being generated. So connecting the similar amount of analysis and metadata collection for production databases, Kafka Qs, and where the data is first being generated is one of our longer term goals. And then you'll really have more of that up to the source level of whether the data should be even collected or whether it should even enter the data warehouse phase or not. One of the challenges I see across the board in the data space is that so many products tend to have a very specific point of the customer lifecycle where bringing them in makes sense. Too early and it's data.
Starting point is 00:15:51 What do you mean data? All I have are these logs and their purpose is basically to inflate my AWS bill because I'm bad at removing them. And on the other side, it's great. We pioneered some of these things and have built our own internal enormous system that does exactly what we need to do.
Starting point is 00:16:06 It's like, yes, Google, you're very smart. Good job. And most people are somewhere between those two extremes. Where are customers on that lifecycle or timeline when using SelectStar makes sense for them? Yeah, I think that's a great question. Most of the time, the best place where customers would use SelectStar for is after they have their cloud data warehouse set up. Either they have finished their migration, they're starting to utilize it with their BI tools, and they are starting to notice that it's not just like, you know, 10 to 50 tables that they are starting with. Most of them have
Starting point is 00:16:46 more than hundreds of tables. And they're feeling that this is starting to go out of control because we have all these data, but we are not 100% sure what exactly is in our database. And this usually just happens more in larger companies, companies at 1,000 plus employees. And they usually find a lot of value out of SelectStar right away because we'll start pointing out many different things. But we also see a lot of forward-thinking, fast-growing startups that are at the size of a few hundred employees, you know, they now have between five to 10 person data team, and they are really creating the right single source of truth of their data knowledge through SelectStar. So I think you can start anywhere from when your data team size is like beyond five and you're continuing to grow.
Starting point is 00:17:47 Because every time you're trying to onboard a data analyst, data scientist, you will have to go through like basically the same type of training of your data model. And it might actually look different because data models and the new features, new apps that you're integrating, this changes so quickly. So I would say it's important to have that base early on and then continue to grow. But we do also see a lot of companies coming to us after having thousands of data sets or tens of thousands of data sets that it's really like very hard to operate and onboard to anyone. And this is a place where we really shine to help their needs as well. Sort of the, I need a database to the help. I have too many databases pipeline, where at some point people start to wanting to bring organization to the chaos. One thing I like about your model is that you don't seem to be making the play
Starting point is 00:18:49 that every other vendor in the data space tends to, which is, oh, we want you to move your data onto our systems at the end. You operate on data that is in place, which makes an awful lot of sense for the kinds of things that we're talking about. Customers are flat out not going to move their data warehouse over to your environment just because the data gravity is ludicrous. Just the sheer amount of money it would take to egress that data from a cloud provider, for example, is monstrous.
Starting point is 00:19:19 Exactly. And security concerns. We don't want to be liable for any of the data. And this is like a very specific decision we've made very early on on the company to not access data, to not egress any of the real data, and to provide as much value as possible just utilizing the metadata and logs. And depending on the types of data warehouses, it also can be really efficient because the query history or the metadata systems tables are indexed separately. Usually it's much lighter load on the compute side. And that definitely has worked well for our advantage, especially being a SaaS tool. This episode is sponsored in part by our friends at Sysdig. Sysdig secures your cloud
Starting point is 00:20:11 from source to run. They believe, as do I, that DevOps and security are inextricably linked. If you want to learn more about how they view this, check out their blog. It's definitely worth the read. To learn more about how they view this, check out their blog. It's definitely worth the read.
Starting point is 00:20:25 To learn more about how they are absolutely getting it right from where I sit, visit sysdig.com and tell them that I sent you. That's S-Y-S-D-I-G dot com. And my thanks to them for their continued support of this ridiculous nonsense. What I like is just how straightforward the integrations are. And it's clear you're extraordinarily agnostic as far as where the data itself lives. You integrate with Google's BigQuery, with Amazon Redshift, with Snowflake,
Starting point is 00:20:55 and then on the other side of the world with Looker and Tableau and other things as well. And one of the example use cases you give is find the upstream table in BigQuery that a Looker dashboard depends on. That's one of those areas where I see something like that and, oh, I can absolutely see the value of that. I have two or three DynamoDB tables that drive my newsletter publication system that I built because I have deep-seated emotional problems and I take it out on everyone else via code. But there's a small contained system that I can still fit in my head, mostly. And I
Starting point is 00:21:25 still forget which table is which in some cases. Down the road, especially at scale, okay, where is the actual data source that's informing this? Because it doesn't necessarily match what I'm expecting is one of those incredibly valuable bits of insight. It seems like that is something that often gets lost. The provenance of data doesn't seem to work. And ideally, you know, you're staffing a company with reasonably intelligent people who are going to look at the results of something and say, that does not align with my expectations. I'm going to dig, as opposed to the, oh yeah, that seems plausible. I'll just go with whatever the computer says. It's an ocean of nuance between those two, but it's also, it's nice to be able to establish
Starting point is 00:22:04 the validity of the path that you've gone down in order to set some of these things up. Yeah, and this is also super helpful if you're tasked to debug a dashboard or pipeline that you did not build yourself. Maybe the person has left the company, or maybe they are out of office, but this dashboard has been broken and you're, quote unquote, on call for data. What are you going to do? You're going to, without a tool that can show you a full lineage,
Starting point is 00:22:35 you'll have to start digging through somebody else's SQL code and try to map out like where the data is coming from. If this is calculating it correctly. Usually it takes a few hours to just get to the bottom of the issue. And this is one of the main use cases that our customers bring up every single time as more of like, this is now the go-to place every time there is any data questions or data issues. The first golden rule of cloud economics is step one, turn that shit off.
Starting point is 00:23:10 When people are not using something, you can optimize the hell out of however you want, but nothing's going to be turning it off. One challenge is when we're looking at various accounts and we see a Redshift cluster, and it's, okay, that thing's costing a few million bucks a year, and no one seems to know anything about it. They keep pointing to other teams, and it turns into this giant finger-pointing exercise where no one seems to have responsibility for it. And very often, our clients will choose not to turn that thing off. Because on the one hand, if you don't turn it off, you're going to spend a few million bucks a year that you otherwise would not have had to.
Starting point is 00:23:46 On the other, if you delete the data warehouse and it turns out, oh yeah, that was actually kind of important. Now we don't have a company anymore. It's a question of which is the side you want to be wrong on. And in some levels, leaving something as it is and doing something else is always a more defensible answer. Just because the first time your cost-saving exercises take out production, you're generally not allowed to save money anymore. This feels like it helps get to that source of truth a heck of a lot more effectively than tracing individual calls and turning into basically data center archaeologists. Yeah, for sure. I mean, this is why from the get-go,
Starting point is 00:24:27 we try to give you all your tables of your database, just order by popularity. So you can also see overall, like from all the tables, whether that's thousands or tens of thousands, you're seeing the most used, has the most number of dependencies on the top, and you can also filter it by all the database tables that hasn't been touched in the last 90 days.
Starting point is 00:24:54 And just having this high-level view gives a lot of ideas to the data platform team of how they can optimize usage of their data warehouse. For more, I tend to say an awful lot of customers are still relatively early in their data journey. An awful lot of the marketing that I receive from various AWS mailing lists that I've found myself on because I had the temerity to open accounts has been along the lines of, oh, data discovery is super important. But first, they presuppose that I've already bought into this idea that, oh, every company
Starting point is 00:25:31 must be a completely data-driven company, the end, full stop. And yeah, we're a small bespoke services consultancy. I don't necessarily know that that's the right answer here. But then it takes it one step further and starts to define the idea of data discovery as, ah, you'll use it to find a PII or otherwise sensitive or restricted data inside of your data sets. So, you know exactly where it lives and sure. Okay. That's valuable, but it also feels like a very narrow definition compared to how you view these things. Yeah, basically, the way that we see data discovery is it's starting to become more of an essential capability in order for you to monitor and have understand how your data
Starting point is 00:26:19 is actually being used internally. It basically gives you the insights around, sure, like what are the duplicated datasets? What are the datasets that have that descriptions or not? What are something that may contain sensitive data and so on and so forth? But that's still around the characteristics of the physical datasets. Whereas I think the part that's really important around data discovery that is not being talked about as much is how the data can actually be used better. So have it as more
Starting point is 00:26:54 of a forward thinking mechanism. And in order for you to actually encourage more people to utilize data or use the data correctly, instead of trying to contain this within just one team is really where I feel like data discovery can help. And in regards to this, the other big part around data discovery, hence, is really opening
Starting point is 00:27:19 up and having that transparency just within the data team. So just within the data team, they always feel like they do have that access to the SQL queries and you can just go to GitHub and just look at the database itself. But it's so easy to get lost in the sea of metadata that is just laid out as just a list. There isn't much context around the data itself. And that context,
Starting point is 00:27:48 along with the analytics of the metadata is what we are really trying to provide automatically. So eventually, this can be also seen as almost like a way to monitor the data sets, like how you're currently monitoring your applications through a data dog or your website as if with your Google Analytics. This is something that can be also used as more of a go-to source of truth around what your state of the data is, how that's defined,
Starting point is 00:28:24 and how that's being mapped to different business processes so that there isn't much confusion around data. Everything can be called the same, but underneath it actually can mean very different things. Does that make sense? No, it absolutely does. I think that this is part of the challenge in trying to articulate value that is specific to this niche across an entire industry. at one or two pre-imagined customer profiles. And that has the side effect of making customers for whom that model doesn't align look and feel like either doing something wrong or makes it look like the vendor who's pitching this is somewhat out of touch.
Starting point is 00:29:16 I know that I work in a relatively bounded problem space, but I still learn new things about AWS billing on virtually every engagement that I go on, just because you always get to learn more about how customers view things and how they view not just their industry, but also the specificities of their own business and their own niche. I think that that is one of the challenges historically with the idea of letting software do everything. Do you find the problems that you're solving tend to be global in nature, or are you discovering strange depths of nuance on a customer by customer basis at this point? Overall, a lot of the problems that we solve and the customers that we work with is very
Starting point is 00:29:56 industry agnostic. As long as you are having many different data sets that you need to manage, there are common problems that arises regardless of the industry that you're in. We do observe some industry-specific issues because your data is either, it's a non-structure data where your data is primarily events or depending on how the data looks like.
Starting point is 00:30:21 But primarily because most of the BI solutions and data warehouses are operating as a relational databases. This is a part where we really try to build a lot of best practices and the common analytics that we can apply to every customer that's using SelectStar. I really want to thank you for taking so much time to go through
Starting point is 00:30:46 the ins and outs of what it is you're doing these days. If people want to learn more, where's the best place to find you? Yeah, I mean, it's been fun talking here. So we are at selectstar.com. That's our website. You can sign up for a free trial. It's completely self-service, so you don't need to get on a demo, but we'll also help you onboard and happy to give a free demo to whoever that is interested. We are also on LinkedIn and Twitter under CollectStarHQ. Yeah, I mean, we're happy to help for any companies that have these issues around wanting to increase their discoverability of data and want to help their
Starting point is 00:31:27 data team and the rest of the company to be able to utilize data better. And we will, of course, put links to all of that in the show notes. Thank you so much for your time today. I really appreciate it. Great. Thanks for having me, Corey. Shinji Kim, CEO and founder at SelectStar. I'm cloud economist Corey Quinn, and this is Thanks for having me. with an angry comment that I won't be able to discover because there are far too many podcast platforms out there and I have no means of discovering where you've said that thing unless you send it to me. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less
Starting point is 00:32:27 horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.