The Data Stack Show - 110: How Can Data Discovery Help You Understand Your Data? Featuring Shinji Kim of Select Star

Episode Date: October 26, 2022

Highlights from this week’s conversation include:Shinji’s background and career journey (3:35)Defining “data discovery” (6:03)The best conditions to use Select Star (8:45)Where Select Star fit...s on the data spectrum (13:38)Why Select Star is needed (17:35)How Select Star uses metadata (21:02)Exposing data queries (27:04)Composing queries into metadata (33:27)Automating BI tools (37:28)Limits to data governance (41:39)Maintaining economies of scale (48:56)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. We are going to talk with Shinji from SelectStar today. And Costas, this is a really interesting, they describe the company in a really interesting way. They call it data discovery,
Starting point is 00:00:38 but they sit in the data governance space. And so I'm really interested to ask about what data discovery means and then how that plays into data governance, because you don't, you know, when I think about the term governance, I don't think about terminology like discovery. So that is, you know, that's an interesting, not a juxtaposition of terms, but, you know, an interesting, an interesting set of terms to describe the product. So that's what I'm going to ask about. How about you? Yeah, I guess like before you govern the data, you have to discover like what kind of data you have, right? So it's probably like requirements for governance. I think historically also like data governance was all built around data
Starting point is 00:01:29 catalogs and dictionaries. So I guess there is like an overlap there. But anyway, yeah, I mean, it's like a very, very interesting topic. I think we see many companies out there, like new startups, that one way or another they implement like a small part of like data governance and they try like to create a new category of products. But at the end, like, okay, well, these are like things that one way or another, they are already in use
Starting point is 00:02:05 out there in the industry. And so, yeah, what I really appreciate with SelectStar is how they are not trying to create a new category. Right? Like they say, they are the governors of what we are doing. And yeah, I'd love to hear what are the challenges of building a product like this today, right? I thought these different pieces of infrastructure out there, itself and what kind of challenges they are facing during the building the product. Absolutely. Let's dive in and talk with Shinji.
Starting point is 00:02:55 Yeah, let's do it. Shinji, welcome to the Data Sack Show. We are so excited to chat with you today. Thanks for having me. Excited to be here. All right. Well, let's start where we always do. Can you tell us about your background, which is fascinating, by the way, you've had heavy duty data engineering roles. You are a successful founder with a previous company, which is really cool. So we want to hear about that. But tell us about your journey and what led you to starting SelectStar. Sure. Yeah.
Starting point is 00:03:32 I'm very excited to be here. I have a computer science background, worked as a software engineer, data scientist, product manager, data engineer in the past. Primarily, yeah, doing sales forecasting, modeling, or looking into a lot of SQL, or building ETL pipelines, I would say is something that I used to do 10, 12 years ago. Fast forward, I started a company in 2014 called Concours System.
Starting point is 00:04:03 We were having a lot of trouble scaling our real-time data pipeline at a mobile ad network that was growing very fast, hitting about 10 billion events a day. You know, Blink over Spark streaming didn't really exist back then. Built our own solution and spun it out as a company
Starting point is 00:04:20 and that's Concourse Systems. So that company, you know, after two years, I sold it to Akamai. Now it's a product called IoT Edge Connect, where it's designed to process billions of data points coming from different IoT devices around the globe, utilizing Akamai's CDN Edge network. And from that experience at Akamai,
Starting point is 00:04:44 starting to work with global enterprises, especially in automotive and consumer electronics space, I've noticed that the next frontier of issues that are starting to happen was around utilizing data, utilizing data that companies have collected and have processed and is sitting in the data warehouse, being able to find and understand and hence being able to actually pull insights and analysis out of data is actually starting to become a lot harder than it should be. So this is why I started SelectStar so that we can give more context around data so that any data analyst, data scientist, or anyone that has access to data can easily find and utilize data as they need.
Starting point is 00:05:38 Love it. And I'm going to steal one of Costas' classic questions here because he loves to talk definitions, but could you give us your definition? You call SelectStar a data discovery tool. What's your definition of data discovery? You kind of covered a couple of the pain points there at a high level, but would love
Starting point is 00:05:59 your definition of data discovery. Yeah. I mean, just utilizing the term discovery in a way, like we believe that data discovery is all around increasing the discoverability of your data. So that means basically being able to find and understand data such that you have access to. So what that means is that
Starting point is 00:06:24 regardless of where you're starting forum, meaning even if you may not know what the field or the dataset may be called by just typing a certain keyword or something that's related, you should be able to find the right datasets that you're looking for. And on top of that, whenever you are looking at a dataset, it's not, you know, sometimes you can try to query the data to see what's inside, but more often, you, the data may, even though you may see the data, it may not be exactly what you're looking for because it might have been filtered from before or was aggregated differently than what you expected, so on and so forth. So truly understanding the data is the other component of, I would say, data discovery.
Starting point is 00:07:14 And a lot of that will come from understanding where the data came from, who's using this inside the company and where it's being used and how this is being used. So we call that a data context. So data discovery can really happen, well, cannot happen without context. So having a lot of context, providing that context, I would say is what data discovery platforms do. Yeah, super helpful. So can we dig into some of the conditions? I would say both from a technology standpoint, but also from maybe sort of a team or company standpoint that are big drivers for companies who needed a tool like SelectStar. And I'm thinking, you know,
Starting point is 00:08:07 and I think our listeners probably span a wide range, right? We probably have some that say like, okay, well, data discovery for me is, you know, we're running Soflake and I can just scroll through all the tables and I have a pretty good sense of what's going on, you know, to companies that have, you know, probably a mixture of on-prem and cloud infrastructure across data lakes, warehouses, production databases, and a pretty complex setup.
Starting point is 00:08:31 So could you help describe where on that spectrum do you see a lot of your customers in terms of their stack? Sure. So I would say traditionally, enterprise data catalog and other mechanisms to increase data discoverability was introduced mainly because companies have so many different sources of data. So just putting them all into one place
Starting point is 00:09:02 as a place that is an inventory that you can search through, that was primarily the need. But today, that's really not always the case. There are still a lot of different sources of data, but all of those data are arriving at a single location, whether that is a data lake or data warehouse. So the main issue around data discoverability, it's not that you don't know where to go. It's just that even when you are in Snowflake, even when you have access to all the data, you have to sift through hundreds and thousands of different schemas and database tables in order to locate what you are looking for. And even then, you may not be 100% sure whether this is something that you're truly
Starting point is 00:09:52 going to use or not. So to me, there are three main levers that I see when companies really need the data discovery tools. So first and foremost is once you have a single source of truth and you have like a data warehouse where all of your data is arriving in one place and you're recognizing that the number of tables are growing, number of schemas are growing, and it's starting to become a little bit more cumbersome to just refer to, well, you can just like use this schema and, you know, there are only like 10 tables or 20 tables there. So as the number of tables go beyond the hundreds, I think this is like one area that you would want to put in some type of like data dictionary, some documentation to start putting together more context around the data sets.
Starting point is 00:11:07 The other part that we see as a lever is data team's growth. So this also happens as data grow, but sometimes, you know, you may have just a small data team that manages like large volume of data sets. And from here, as you are trying to bring on new data analysts or data scientists or data engineers, for them to ramp up, it just takes a lot longer time than you anticipate because there's so much tribal knowledge within the data sets. And you may also realize that this tribal knowledge is also very hard to transfer without significant effort. And you may realize this because, you know, someone decides to leave the company and you're realizing, oh, like shit, like how are you going to make sure that,
Starting point is 00:11:46 you know, all the pipelines and materialized views that they built is going to, you know, stay sound even after they leave. So I think just overall training, onboarding, and making sure all the data team, all the data team members are on the same page. This is one of the goals that customer trust to customers try to achieve. Last but not least, there is also, I would say, some companies that start thinking about data discoverability because it gets triggered by data governance. So data governance initiatives where you may need to understand the actual data flow and the model of how the data pushes through different stack or different models. You need to start putting together either documentation or start tracking where the
Starting point is 00:12:39 data is heading to. And this is very hard to do manually. So having an automated discovery tool is another reason why companies start looking for solutions like SelectStar. Yeah. Super interesting. Yeah. It's funny. I mean, I just, I think earlier this week or maybe the end of last week, you know, there was a Slack message floating around
Starting point is 00:13:05 and it was a new analyst asking people who'd been around for a long time, what is this table? And it's like, okay, you gotta, you know, find the people who've been around the longest in the company and who've been plus the stack and the data and the reporting. You know, so even, you know, we're not a huge company, but we see that, we see that every day.
Starting point is 00:13:31 One more question from me before I hand it over to Costas. So you mentioned data governance and, you know, one of the funny things about the data industry is there's all sorts of terminology, right? So you have data governance, data observability, data quality, et cetera. Where do you see SelectStar fitting into that spectrum, right? Because you do, you know, discoverability, you know, isn't necessarily, you know, it's related to quality, it's related to governance,
Starting point is 00:14:07 like there are some components there, but I would just love to know where did you see SelectStar on that spectrum? Yeah, I think that's a great question. So we see ourselves as more of a horizontal platform that thinks and interacts with different stack of data, whether that is on the data like generation side to the data storage side to BI side. So, or the transformation side, we, you know, want to plug in and bring out all the metadata together in one place and also give you more of a comprehensive analytics
Starting point is 00:14:46 of that metadata so that you have one place that you can find out what is happening and how different data assets are related to another, regardless of the tool that you are looking at. So we see data discovery as more of like a capability that supports data demo protection because this allows everyone to be able to easily find and understand data. Secondly, data governance.
Starting point is 00:15:15 So governance regarding if you want to, and the main difference that I see data governance and data, which both goes hand in hand. A lot of the metadata management really is around being able to collect all the metadata, house it in one place, having a unified metadata model that you can collect and utilize, including operational metadata as well as the catalog type of metadata. Governance is really kind of like a layer on top of that that allows you to add ownership policies and ways to put a taxonomy on top of that. So we see discovery as a supporting capability
Starting point is 00:16:00 for all those use cases because once you have this really good amount of discoverability, which is backed by the auto-generated context that we review, whether that is including like data lineage, popularity, who's using more top users, and different query things, during run, or like entity relationships that we see, like these are the components that you can use to drive the policies, to define which ones that you want to measure the quality for,
Starting point is 00:16:36 and to also give access to whom. So that's kind of how we see data discovery playing in those use cases. Super helpful. All right, Costas, I have been monopolizing. Please take the mic. Thank you. Thank you so much.
Starting point is 00:16:54 All right. So, Cindy, you were going through how things worked when you were doing with Eric. And you were saying that the data has to be collected into one place. Usually this is like the data warehouse. And on top of like the data warehouse, then we can stop like creating like the data catalogs and enable data discovery. And I'm wondering like, when we have like all the data in the data warehouse, why do we need like another service on top of the catalog that the data warehouse has, you know, to like, to understand what kind of data we have there and what we got to do with this data, like why the database itself is not enough?
Starting point is 00:17:34 Diana Kanderkiaiheva- Sure. The catalog that the database has, or what we call information schema, is really just the structure of the physical metadata, and that is pretty much like where we are starting from as well. But what data discovery platform does on top of that is way more than just showing the metadata. It is, I mean, we will go through all the SQL query history, all the activity logs, and also connections that other tools may have with that data warehouse to try to put together where the data is flowing into and how the
Starting point is 00:18:15 data is being used. So really the thing that we do at SelectStore is really parsing through different SQL query history and then piecing that together into a form of different data models that everyone can consume. Like column-level lineage, regardless of where you're starting from, you can see all the upstream sources of data as well as the downstream effect that one metadata change may have. You can see like how many people were utilizing this table or column in the last 90 days and how often has it been used, which rich dashboards or other tables or queries like this is flowing into. So you can see the impact of like where this data is going to. And going further, we are now showing more like an ER
Starting point is 00:19:11 diagram so that even if data warehouses or data lakes have lost the property like primary key and foreign key, those can be detected. Or if they are already in there, that plus any of the joins that we recognize can be put together as more of a data model so that you have a visual of how different tables are connected to each other. These are all the parts that I would say is more of a... It's not a core feature of the databases today. And I think also when you look at data discovery, it's a lot more powerful as you are starting to connect,
Starting point is 00:19:57 you know, different tools you're using on top of the data warehouse. So being able to see your Looker dashboard, which upstream tables it's using, or which look at mount views, like upstream tables it's using, or which look at our views, like where this is coming from, and vice versa. By changing this production route table or this specific field, how many dashboards may crash in your Tableau server? Things like that, I think that is when it becomes a lot more interesting as a data discovery platform. Mm-hmm.
Starting point is 00:20:27 All right. That's awesome. You mentioned a couple of different sources of, let's say, information or metadata that is used to create, let's say, the user experience around SelectStar. The first one you mentioned was the metadata itself that's coming from Data Warehouse or the database system, right? Can you tell us a little bit more about that?
Starting point is 00:20:51 Like when you say metadata from the Data Warehouse, what kind of metadata we are talking about and how do you use them? Yeah. So there are like the, like main, I guess, like base level, quote unquote, like metadata catalog data, including like names of the tables, columns, schema, database, the comments of that. So hence, like this kind of gives us the structure of like which table belongs to where, things like that. And then there's a part of operational metadata that we collect. So when was this last created? When was this last updated?
Starting point is 00:21:31 What may be the DBL or DML that's related to this? Like who's queried this the last time? And some databases also provide things like the row count, how big the table is, things like that. So those are things that we utilize to basically try to give a snapshot of what that table looks like and what the current state of the table to them is. So that's kind of what we we consider as like the kind of like a core main metadata. And then there is the aspect around, I think I mentioned the table column comments, but this is the description side. So description, usually, you know, we consider as like a part of the metadata because it's
Starting point is 00:22:21 already baked into the database itself. And then on top of that, there is the logs that we collect. So the logs, I wouldn't call it like a fully metadata, but it is something that we can basically parse through to generate metadata because I define metadata as data that describes the data set. So yeah, so the query logs will tell us, who's querying what, how long does it take, and what are the resources that it uses from that query? And from that query itself, which tables and columns are being queried?
Starting point is 00:22:56 Is it being queried directly? Is it getting, you know, and whenever we are looking at the actual transformations, is it being transferred, being transformed, or is it getting aggregated, or is the data actually being used as it is? These are all like details of kind of like, you know, extracting the info, metadata information about the query so that we can analyze it under the same umbrella to make the, like what we call our like a table page, page column page, our object data asset page
Starting point is 00:23:34 to be much richer. Does that make sense? Yeah. And then last but not least, there's the user metadata as well. So it's, it's more lightweight, but just more of like, you know, how many users, like which users, like how is it, you know, how are they logging in? When's the last time this person was logged in? Like things like that.
Starting point is 00:23:53 Henry Suryawirawanaclapurnawakilapurna. Yeah. Yeah. You actually like answered like some of my follow up questions, to be honest, because I wanted to also like, I was thinking about the other sources. Actually I will, but before we do that, about the metadata that each data warehouse like is exposing out there. Have you noticed like any considerable difference between the different technologies that you are integrating with?
Starting point is 00:24:19 Is there like, are all the systems out there on par? Or is, you wish is there something more out there in terms of the core metadata, right? We're not talking about queries and logs and stuff like that right now. I think the core metadata are primarily similar from one database to another.
Starting point is 00:24:40 But the way to retrieve metadata and the type of access that we need just to get the metadata can be drastically different. Okay. default, we try to carve out or give recommendations to our clients that give us access only to the metadata, but not to their actual data. And every database
Starting point is 00:25:15 is slightly different in this way where, for example, with Snowflake, we may have just the access to their metadata database called just Snowflake database, account usage. Whereas for something like Oracle or Postgres, there are very, or Redshift, there are very specific types of tables that we would require for access. Instead of, I guess, getting quote unquote, like metadata database access, because those will be generally only available for admins.
Starting point is 00:25:49 Yeah. So it's yeah, that's where I think things get very tricky. The other part is also how the logs are generated. So yeah, some data warehouses, like this will be already in a table, like for a query history, whereas some other data warehouses, we would need to ask the customer to like, you know, enable the logging and then enable like other types of logging too. And then also have them to point to like, you know, their CloudWatch logs to go to like, you know, this bucket or, yeah.
Starting point is 00:26:23 So there can be like more of this integration setup that could be required depending on the database we are working with and for BI it's a whole another story. Alex Hidalgo- Of course. Yeah. So it's different. Yeah. Alex Hidalgo- Yeah. We'll get there.
Starting point is 00:26:38 We'll get there. All right. And the next thing that I found like quite interesting, you mentioned is that like there's a lot of metadata that's generated by parsing the queries. My first question is, have you seen any kind of like resistance from like your customers so far to expose these queries to your service? The reason I'm asking is because you mentioned already that, okay, you are a metadata platform. You don't want to get access to the data itself. But queries many times can reveal, let's say, through the logs, also important information that shouldn't be leaked. It's always a little bit tricky. So how do you see that? Like so far, like how do you see, uh, customers and like the
Starting point is 00:27:29 companies out there react to that? Yeah. So few things. First and foremost, this has always been a concern of mine when I first started SelectStar. So my last company, we had an on-prem software where we just deployed to our customer's environment. So we didn't have to worry about this.
Starting point is 00:27:50 But with Akamai, as I was building more of a platform as a service product, this, you know, provenance of data and security on the data side is something that I just like grew up, had a lot more interest in. So with SelectStar, I, from the get-go, we got our type 2 audit and making sure that everything is treated as confidential information of customer. Even though if it's just the metadata, we will first make sure that we're not getting as much data as possible. And then anything that we bring to our system, we will treat it with the best amount of enterprise security that we can add in. So that's the first thing.
Starting point is 00:28:36 Second part is with logs, it really kind of comes down to depending on what kind of queries you're running. So first of all, we have a very specific types of queries that we process, you know, mostly around the creation of the queries like DDL, DML, things like that, but also a lot of select queries. So what we allow our customers to do is allowing them any sensitive fields in SelectStar. And if it's already tagged as PII or sensitive, from our parsing perspective, we will strip out all those values before we fully process it in SelectStar. So anything that you might come across through SelectStar, whether that is a query or something else, because you can look up different queries from SelectStar. You can look at how this people was created
Starting point is 00:29:36 or what are the popular queries that people use to utilize this table. You can look up other people's queries by going into their profile pages or team pages. So when these queries show up, we will strip out the values if the field was already defined as it, if it's a PII or something. So that's like another way
Starting point is 00:30:02 where we are ensuring any of the sensitive data itself does not enter our platform. And last but not least, we are starting to allow more customers initially just as a trial. And this is something that will be available in the future. But for them to just like load the metadata themselves or load the models themselves. So if they want to strip out any specific parts or if they give us a certain configuration as more of like, we don't want any of the queries of like this user to come through, then those are fairly straightforward settings that we can adjust so that like,
Starting point is 00:30:46 you know, we, we basically filter out or we don't touch any of those queries from the get go. Does that make sense? Henry Suryawirawanaclueyen- Oh, absolutely. It's like super interesting. Like to see like the the complexity of like building a product around that. And like, it's many things that like people don't realize, they didn't have reason to go through this kind of process or using this kind of products or establishing this kind of
Starting point is 00:31:16 processes, right? And it's a big part of the product itself, the complexity of building like such a product. It's not just the metadata that you have to process. It's also like all these processes that there are for a good reason there around like security, around like auditing and all that stuff that we need like to pay attention and make sure that we provide all the functionality that our customers need. Right?
Starting point is 00:31:43 Yeah. Yeah. So this is like something that we also see, like, you know, not just to like strip out the data, like, you know, marking something insensitive or PII. It is something that customers can really leverage to specify any of their sensitive data sets to the rest of the company.
Starting point is 00:32:01 And because these data sets are not exposed to the intellects whatsoever, even if the end user may have access to it, they can actually understand that, oh, this is not something that I should easily share or freely share with others, right? The older data, when and where is it being used? And where did it go?
Starting point is 00:32:25 So any of these reporting around GDPR, CCPA has been one of the areas that some of our customers are starting to use SelectStar for. And this happens, one from the usage perspective, but the other also as a lineage perspective, because then you can follow the trail of where the data exactly ended up in, and then get kind of like the usage information or audit logs of all of those fields altogether. And okay. About like parsing the queries, what kind of information you are looking for there? Like how do you decompose, let's say, the query into a metadata that can be used on a data catalog?
Starting point is 00:33:11 Like, what are you doing there? That's probably one of, like, I wanted to ask this question, like, since we started. So I'm really happy that I can do it now. So your question is, like, what are we actually doing in the parser with the SQL? Yeah. Yeah. So I think that in the high level, there are a few things we actually doing in the parser with the SQL stuff? Yeah. Yeah. So I think that in the high level, there are a few things we are doing.
Starting point is 00:33:29 I mean, we are not trying to like, you know, like reverse engineer the query. So I think there are like many different things that have happened until it gets executed in the database. Like if you are using DBT or if you are using any like a template, like some things may happen. But what we care about is at the end of the day, how did they execute and how does that look like on the database perspective?
Starting point is 00:34:01 So a few things that we are looking for is that first of all, we are looking for is, first of all, we are looking at, like, what are the fields and tables that are actually being selected? And so that is coming from, like, just looking at, you know, it can be through different CPEs and the different, like, net set queries, but we will look at, like, you know, so how are, like, the result of this query being mapped and where, where, where did that source come from?
Starting point is 00:34:32 Around that source, we will like, you know, we will try to define, you know, to match the existing metadata that we have. Because we already, and that's kind of almost like a natural picture. Sometimes we may not find everything, but as I have this all loaded into SLEPT, we should be able to find it. And from there, as we are looking at the field level connection, that's when we will try to determine
Starting point is 00:35:02 whether we can tell what the general relationship is. Is the field being generated as it is? It's just like, or is this field that is being aggregated or transformed? These are some of the details that we try to build in. There are more things that we plan to do and want to do, but that's kind of like more of the extent of how we look at the queries. So a lot of these information then we, our parser exports that will go into kind of like our backend model. So different backend models.
Starting point is 00:35:40 So this would like, would be used. So for any of the select queries, this will go into more of like the usage and like we'll put on, we have a popular is for, for every field, every table. So this is like where I just get added to. If this was a DML, doDL queries, then we will add this as more of information for our lineage. So that's like determining for this specific field, this is a source of the field. And then here is a target and it's, it's been, you know, propagated as, you know, in this manner, like things like that is basically what our like core parsing does.
Starting point is 00:36:22 Yeah. Super interesting. Does that answer your question? FELIPE HOFFAEYEN. Yeah. Yeah. I mean, okay. Like, why can't keep like discussing about that for a long time?
Starting point is 00:36:31 LORINA RENATO MANGINI- Yeah. We're going definitely deep here. FELIPE HOFFAEYEN. Yeah. We have a... LORINA RENATO MANGINI- Yeah. And then we have a, I was going to say, we have like a similar parsing that we use for DBP or for ML, like YAML files, like, you know, and, you know, for ETL, it's a little
Starting point is 00:36:47 bit different, right? So we try to basically follow through the data model each stack has and map it to like the meta data model that we have of lineage, popularity, entity relationship, things like that. Interesting. And okay, let's leave the databases behind for a little bit and let's go to the BI tools, right? Because you also have like to deal operate with them and like pull out like information and okay. Database systems obviously have been developed from day one to access information.
Starting point is 00:37:27 But, okay, BI tools are mainly like visualization tools, right? So how can you automate this process? How can you go to a system like Looker and pull out useful metadata? Yeah, so it doesn't happen overnight. We have to go in and look at the BI tool's data model, and we try to map out
Starting point is 00:37:55 how, like, our integration process requires mapping out, like, which metadata of the BI tool will map to the BI metadata model that we have. And if there are parts that our BI metadata tool may not have, what are the other models that we may need to augment in order to support that BI integration? And every integration that we have will have lineage,
Starting point is 00:38:27 popularity, top users, and the kind of like the integration into like any of the data, the actual, like the database connections. And all of these can, they are, happens very differently per BI tool. So for example, with Looker, we get a lot of the Looker, like it's for dashboards, user information directly from their API.
Starting point is 00:38:56 But your Looker API does not expose LookML information. It, it just gives you maybe just the explore information but it does not tell you like you know the the actual view which view it's coming from and which specific connection does that view have so for those like we usually either get a snapshot of the LookML repo from the customer, or we connect directly to their repo as a read-only mechanism to load and parse the LookML views so that we can bridge the connection between the data warehouse and Docker. On the other hand, for something like Tableau, Tableau has a metadata API that will expose a lot of this through a GraphQL API, but we also use their REST API
Starting point is 00:39:57 because it gives the other information that we need. But for something like Tableau, companies use it to run dashboards through the Tableau data model, which can come from the API. But some of the workflows and dashboards will run through our parser against the connection that we see on Tableau and then bring out the result and add it to the language. So it's yeah, it really kind of depends on the BI tool, but we try to basically look at how does this like BI tool actually work and what is their view of the data and how, and hence, how are they defining the dashboard and charts and metrics. And we try to basically map them because eventually the view that we have and some of our customers have like multiple BI tools. When they want to see all dashboards together, like by just typing a keyword, we do have areas that we want to consolidate this under like the same umbrella too, right? So yeah, so those are, that's the exercise that our team has to go through
Starting point is 00:41:27 to ensure that integration is done correctly. Yeah, yeah. That's super interesting. Okay, one last question from me and then I'll give the microphone back to Eric. Okay. Many times, you know, like, especially like people with, that are coming like with an engineering background, like we tend to, let's say, abstract things to a point that become too ideal, right?
Starting point is 00:41:51 Like we are talking about like the data stack and we have like the ETL pipelines, the data warehouse, then we have some consumers that are going to be some BI tools and like all that stuff. But the reality is many times like it's a little bit different, like, right? Like every company is doing things just a little bit like in a different way. Users are not exactly always following their rules. I mean, I do that like many times, like I get frustrated with the tool, and I just export the CSV and push it into a Google Sheet and just do what I want to do,
Starting point is 00:42:29 and that's all the day, right? And I'm sure you see that also in companies out there. In your opinion, how much tools that have to do with governance, that they have to do with data discoverability, and providing, let's say, this kind of infrastructure for the users inside the company and the company itself to have the best possible visibility on how data is getting used. What are the limits there, in your opinion? Do you think that we will... Is it a problem? Do you think that we have to still work to do with products like SelectStar
Starting point is 00:43:15 to provide more coverage around governance? Or is it like at the end, it's okay? We just have to accept that people will not always follow, let's say like the rules and like there are always going to be exceptions and that's fine, it's part of like designing the product and we are doing like to cover, let's say 80, 90% of like the use cases out there and always keep in mind that like some things might be a little bit different. Did I confuse you because I feel like I just talked too much, but let me know. Like I can rephrase the question.
Starting point is 00:44:00 Yeah, no, I can see where you are coming from. Like, I guess one question from here is, like, do we get to just, you know, accept the fact that people are going to do their own thing and it's not all understandable, discoverable, you know, whatnot. I think that given that we are still working with systems, still working with something that has like parsed and hence like, you know, has compiled and is out there. We see it as something that we can process. And our job is to process it and show you what it actually looks like so that's additional, like in the same level of like, you know, like if, like meaning if we can parse everything, what you've done, then what's the point of trying to like augment the automation by adding more, adding something manual, especially if you cannot maintain it. The big part of automation in the beginning is so that you can save a lot of time and you don't have to do this manually. At the same time, I think that actually the bigger ROI of automation is that you don't have to maintain it because you don't have to go and update or delete or add anything because your data model changes so fast. And the part that we recommend our customers to maintain more manually would be more around business processes, documenting business processes, you know, making sure that those are clearly defined
Starting point is 00:46:07 and has the domain data model that's connected to it. These are things that we will be able to automatically define for you. But in terms of lineage or usage, things like that, these are, I think, much better for us to lead it to the machine to figure it out. That's how I think about it. But it also means like, yes, we do have a lot of work to do to ensure that everything, you know, is being processed fully and correctly. Yeah.
Starting point is 00:46:44 Yeah. But that's a good thing because it means that there is an opportunity there, right? Like to build products and businesses. So that sounds like, like pretty big opportunity actually. So Eric, all yours. I think we are close to the buzzer as we usually say, but... Eric Bozidarzak Oh, I stole one of your typical questions. All right.
Starting point is 00:47:06 You stole my line. Oh, yeah, yeah, yeah, yeah. I'd like to take my revenge. Yeah. So, all yours. Shinji, actually, my question is sort of getting maybe to some of the practical implications of what you and Koss has just discussed.
Starting point is 00:47:26 And it's around, I would say, at what point you reach economies of scale in terms of coverage, and then what the relationship between ongoing work and new data coming in looks like typically because you know the one of the blessings and curses of modern data tools and sort of the you know way way more cost efficiency in terms of data storage right like it's it's never been cheaper to pull in a huge amount of any type of data and store it, right? And so at a lot of companies you see, even smaller companies, you see, you know, call it whatever you want. Well, I'll say like maybe large amounts of potentially like unnecessary data flowing into the system in part just because
Starting point is 00:48:25 it's technically very easy to fold the data in right and so when you think about you know cataloging and discovery it it seems like you kind of hit an economy of scale where you get like a certain amount of coverage but then there's always new data flowing into the organization. So what does that look like? Like, how do you manage that relationship? And is it a lot of ongoing maintenance? And, you know, what do you see on the ground with companies that are dealing with that? I mean, the fact that there are new data sets
Starting point is 00:48:59 being added every day and new models being created or dashboards being created every day is the pure reason why somebody would adopt SelectStar than other tools or other ways to manually track this metadata. Because every 24 hours at minimum, we will update all your metadata, all your lineage popularity, the way that it's being used. And you can always trust that whenever you're looking at SelectStar, it is up to date. And it has all the latest and greatest information.
Starting point is 00:49:37 It's really the only way to keep your metadata in one place. Only when it's really automated, in my opinion. And I think in terms of cost perspective, cost is moving through compute, right? It's cheaper to store it, but now you are running a query and that's going to cost you some amount. It's easy to schedule the query and it's going to be, you know, I don't know, $1, $2, but it's going to add up. So this is another reason why it's important to be aware of the usage patterns because we actually see a lot of customers ending up saving their cloud data costs because they end up finding these pipelines and these
Starting point is 00:50:32 materialized views and chemicals being created, fueling dashboards that are not being looked at anymore. Yep. And that can really only come through by having both lineage and usage model popularity together. So I think that's more of an ongoing effort and whenever it's an ongoing thing that you want to monitor or just refer to as the right place to go, then you would want that tool to be automated and is up to date with the state of the data, like, you know,
Starting point is 00:51:07 storage and, you know, consumption, wherever that's happening. Yep, that makes total sense. And so do you see a lot of select star customers, really, the automation runs and they're essentially just reviewing new information, right? It seems like there's just very little adjustment to the model. Twofold. First six months of a lot of customers using SelectStar, they use it if it's their Google for data.
Starting point is 00:51:41 They already have a lot of data. And through SelectStar, they are finding information about this table or that dashboard or this pipeline that they didn't know about before. You know, even you hear customers all the time saying that they've been using this data warehouse or they've been using Rooker for the last two years. And SelectStar is telling them things that they didn't know about for their own data model. And this is more of the value that we drive in the beginning
Starting point is 00:52:15 so that you are actually, you didn't have to do any work after connecting your database, but you're seeing a lot of new things that you didn't know about your data. And usually after three to six months, a lot of our users start using this data to either deprecate old dashboards and tables. They start putting some taxonomy together. They may add some documentation.
Starting point is 00:52:40 And this is where the real data management, data governance really starts happening. And they have a lot of cues, a lot of directions and hints to start off because they're getting this visibility and insight of how their data is being used today. Yeah, what a great answer. I don't think I asked that question very well, but you answered it perfectly in terms of sort of the lifecycle of usage and like when you get over the learning curve of sort of the initial economies of scale,
Starting point is 00:53:19 sort of understanding or learning about your data and then sort of taking action on it? Yeah, I think a lot of it is because like they get to find something new every time they come back to select star. So this actually fuels like really good user engagement cycle. And as they start inviting others, as they are referring to different select star links, they realize, oh, as I'm going to start sharing this with more of my data teams, this product manager and whatnot, I better also put some more documentation
Starting point is 00:53:55 or I better put some more context. Yep. Because we are referring to this anyway, because this already has more than half of the documentation automatically filled in and is being updated. Yep. Love it. Very cool. Super helpful.
Starting point is 00:54:13 Well, this has been such an interesting show. Love the concept of Google for data. You know, Google for your own data. I think that's, I mean, that sounds exciting, you know, even to me. So I know a lot of our listeners will be interested in it. So Shinji, thank you so much for taking the time to talk with us and teach us about data discovery. Awesome. Thanks for having me here.
Starting point is 00:54:37 This was fun. I think we went pretty deep on multiple subjects, but yeah, you guys have been great. What an interesting show. I think one of my big takeaways was I just appreciated how Shinji was, she had strong opinions about what types of things a machine should handle. And I know that's come up on the show a number of times,
Starting point is 00:55:02 but I'm just a huge fan of, you know, if you can remove like the unnecessary laborious parts of working from data from, you know, say the data engineer, you know, the analyst role, like it allows them to not only probably enjoy their work more, but be more valuable because they're spending time doing more valuable stuff. And so I appreciated that at the same time, it's a really challenging problem as evidenced by some of the technical things that you discussed, right? So just ending something to a machine, you know, doesn't always work out perfectly. So anyways, that was my big takeaway and what I'll be thinking about.
Starting point is 00:55:50 How about you? Yeah. I mean, like, it's very interesting to hear like how complex of a problem data governance is, and there are like so many moving parts and you have to synthesize like information from... and integrate information from so many different sources. So it's pretty hard to make something, to make a product that can actually help you in data governance, the experiences that are on the product that you have to build and like the, say the reliability of the technology underlying, it's kind of crazy.
Starting point is 00:56:31 I think it's a very big challenge and I have a feeling that like more and more companies out there that they position themselves like one way or another, like, but they are close to data governance, they will end up up in this at the end, in this category. Even if we want to take the broader category of governance and break it down in those smaller pieces, it'll be the same thing, and you need all of them in order to, at the end, deliver what the company is looking for, what the customer is looking for. Yeah.
Starting point is 00:57:01 So anyway, I'm looking forward to have like another episode with you in the future, like discuss more about all these challenges with like integrating with all the different sources and applications and how to pass all this information and how you integrate information and present it at the end to the user, that's like something that we didn't do that much, but I'd love like to continue the conversation with you in the future. Absolutely. All right. Well, thank you again for joining us on the Data Stack Show. Subscribe if you haven't, tell a friend and we will catch you on the next one.
Starting point is 00:57:37 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.