Drill to Detail - Drill to Detail Ep.49 'Trifacta, Google Cloud Dataprep and Data Wranging for Data Engineers' With Special Guest Will Davis

Starting point is 00:00:00 So hello and welcome to another episode of Drill to Detail and I'm your host Mark Ripman. A few weeks ago you might have noticed a post on my Medium blog about Google Cloud Data Prep, a new data wrangling tool I've been working with in the day job and at home in my own data feeds. So I'm very pleased, therefore, to be joined this episode by Will Davis from Trifactor, the vendor who many of you will know and actually who partnered with Google to bring out Cloud Data Prep. So Will, welcome to the show. And why don't you introduce yourself to our listeners? Thanks, Mark. Thanks for having me on the show. It's great to be able to speak with you today. Yeah, so my name is Will Davis. I head up product marketing at Trifacta. I've been

Starting point is 00:00:52 here for almost four years now, so quite a bit of time. I'm one of the elder statesmen in terms of tenure at the company and have been in the data and analytics space for, you know, the past 10 years. So I've been involved in the market anywhere from data infrastructure to, you know, analytics and visualization. And then now it's my time at Trifacta, you know, kind of play in between, in between, you know, data platforms and the downstream consumption or visualization of data. And yeah, happy to talk to you today. Excellent. So looking through your LinkedIn profile, as I'd always do when I have guests on here, you have quite an interesting work history.

Starting point is 00:01:32 You worked at ClearStory Data, GreenPump, GoodData, and so on. So quite a kind of, I suppose, an interesting set of companies there, and all pretty cutting edge as well. Yeah, so I've been, hopefully I don't get pigeonholed in the data, even though it is my area of expertise but yeah so um started my career at uh good data in the data space um you know at the time good data was getting started with business operations in the u.s and had a engineering and development team based in the Czech Republic in two different locations. So I got my start.

Starting point is 00:02:05 Good Data was really a software as a service BI and data warehousing company and saw just the struggles that organizations have to simply leverage data to make decisions and to improve the efficiency of their business. And then from there, moved on to Greenplum and headed up the GoToMarket that Greenplum had into the big data space. And the company had been acquired by EMC and then was moving into the parallel processing space

Starting point is 00:02:35 with not only their parallel database, but also entering into the Hadoop space. So spent a good amount of time there working both with Greenplum as an individual entity, but also with the broader EMC and VMware team. And now that company has spun into what's now called Pivotal, which has been doing very well. I think they've pivoted a lot more towards cloud at this point. Then from there, I went to ClearStory Data, which was a Spark-based cloud data visualization. And now they do a little bit of data prep in their product.

Starting point is 00:03:08 And that was a great experience as well, learning from their CEO, Sharmila Mulligan, on launching a company and a lot about my function, which is in product marketing and marketing. And then from there, I've been at Trifacta for quite some time now. Excellent. I was about to say, maybe introduce who Trifacta are and what you do, but I noticed you've been all over the press recently the last couple of days with your funding round.

Starting point is 00:03:32 So why don't you just tell us who Trifacta are and what's the funding you've had recently and what's the purpose of that and so on? Yeah, so Trifacta, we are a data wrangling company, or I think what is also referred to as self-service data preparation or data preparation. The company was founded out of joint research that was taking place at Stanford and UC Berkeley. So the three founders, Joe Hellerstein is a professor, well, we're not a professor in parallel systems and database technology at UC Berkeley. He partnered with one of the experts in data visualizations, Jeffrey Heer, who was a professor at Stanford. And Jeff was, you know, one of the inventors of d3.js, which if you're doing any

Starting point is 00:04:17 data visualization in the browser, you're probably leveraging d3. And they had a PhD student, Sean Candell, who, you know, was part of this PhD project, came up with this prototype called Data Wrangler. And it was an interactive web-based data cleaning product that, you know, he had brought out during his PhD work at Stanford. And in the matter of a few months, that product was accessed by tens of thousands of people and gained a lot of notoriety within the data space. And so then they went on to, you know, raise some money from Excel and Greylock, two of the top tier venture capital firms in Silicon Valley and started a company. And so, you know, Trifacta has really been the commercialization of that joint research that Joe, Jeff, and Sean were working on. And dated back even before Data Wrangler, Joe Hellerstein had a project called Potter's Wheel that was initially started in 1997.

Starting point is 00:05:22 And the real focus of that was, how do you make data cleaning and data preparation, structuring, all the work you need to do to get data ready for any type of analysis, how do you make that more intuitive, more efficient, and also more interactive? So they were looking at existing methods to do data preparation, whether it was based in code or whether it was based in existing technologies such as ETL, and really focused on making a more visual, intuitive, and efficient way to do that. So that's really our focus. And the company, upon our initial go-to-market, was really focused on the big data space.

Starting point is 00:06:00 So we were primarily focused on the Hadoop ecosystem and going to market with the leading vendors in the Hadoop space, whether it's a Cloudera, Hortonworks, MapR, and companies such as that. Still a huge focus of our company, but we've continued to expand into cloud. We have a desktop version that's for free. We have a hosted cloud version as well for smaller team departmental use and recognizing that the needs around data preparation and more efficiency and getting data ready to do something with it spans across any type of user, any type of data, any type of environment, not just the big data world. I think the nice thing about starting with big data is because you're tackling the hardest, most difficult environments and ecosystems to take on. And so we've taken those learnings,

Starting point is 00:06:53 working with some of the world's most advanced organizations and how they utilize data in some very large scale, Hadoop based environments, and then applying those learnings to ongoing development and work and spreading out the product to different ecosystems. So I know you asked about the funding that we had or the announcement we had recently. So we did announce actually yesterday, so I've been around the clock the past few weeks, we did announce around, $48 million, which is going to be able to fuel us to accelerate our growth over the next few years. And it was exciting to include it in the round. We did

Starting point is 00:07:35 have a number of strategic investors, which I think was especially unique with this fundraise. Companies such as Ericsson, Deutsche Force, New York Life, and Google were investors in the round in addition to some other venture capital and private equity firms. But what's nice about those strategics is actually a few of them started as customers. So New York Life, large scale insurance company, started as a customer of Trifacta's and then recognized the opportunity that we were going after and the value that we were creating for their team and actually wanted to move forward with an investment. Same thing happened with Deutsch Bourse, a company based in Germany that manages the stock exchange there. Similar to that, started as a customer, recognized the value of what we're doing and the market opportunity and decided to invest.

Starting point is 00:08:21 And then the other piece of that was Google. So Google is a company that you mentioned earlier we've been partnering with and have a collaboration around cloud data prep with them. And we started that relationship as collaborating on a joint product that Google was bringing to market within their cloud platform called Google Cloud Data Prep. And through that experience,

Starting point is 00:08:42 they also were interested in investing in the company too and we're part of this round of financing fantastic well i know from the product marketing people where i work that you must be pretty busy at the moment with the funding round going on so thanks very much for um for coming on and you mentioned their um cloud data prep now i want to go into that a bit more detail later on but just again for anybody that hasn't heard of that product just kind of i suppose paint a. What is that? And how does it relate to the other data integration, data loading tools that you get with Google Cloud, like, say, Cloud Dataflow and that sort of thing? Yeah, so Cloud Dataprep is a product that is a service that you can use through Google Cloud. So it's essentially the ability to access data that is in the Google

Starting point is 00:09:26 Cloud ecosystem. So the product supports Google Cloud storage, so the file system on Google Cloud, it also supports access to BigQuery. So you have the ability to actually access data that's in Google Cloud, explore data through, you data through Trifacta's interface. So we've essentially embedded the Trifacta interface into the Google Cloud ecosystem. So you're able to actually access, explore, and start wrangling data that lives within Google Cloud. And so that product is allowing you to sort of build up a wrangling workflow. If you have a multitude of data sets that are living within Google Cloud that you want to explore, clean, prep, join together, and then create some sort of output for doing some analysis, let's say in BigQuery, you'd be able to do that within our products. And then we support Cloud Dataflow as a processing engine. So essentially, you're accessing data

Starting point is 00:10:25 through the interface that we've developed and brought to the Google Cloud ecosystem, build up a workflow of transformations that you want to apply to that data, and then those set of transformations will run an infinitely scalable job through Cloud Dataflow on Google Cloud, and then we'll be able to output

Starting point is 00:10:42 to Cloud Storage or BigQuery. So it's essentially a more visual, more intuitive way to clean and prepare data within the Google Cloud ecosystem. Yeah, I mean, I use it every day at home as well in my spare time, which I always think is a great endorsement of a product, really, if you do use it voluntarily. And the great thing is that every job you run, it just effectively, the charging is just charged on the data flow jobs you run in the background. And it interfaces with BigQuery.

Starting point is 00:11:08 It's about the only tool I can see around that's easy to use that links in with that. So it's a really easy tool to use and quite pleasurable, really. Yeah, so how's your experience been with using the tool? I'd love to hear about it. It's been good, yeah. I mean, I've been using it to bring in feeds from things like Strava, bringing in feeds from all different places, really. And I suppose we're going into this later on, but actually making sure the data is standardised.

Starting point is 00:11:36 When it's things like, I suppose, fitness feeds, you've got things like maybe weight readings that don't have a reading every day. And so you're doing things like filling in the gaps between data and then doing things like rolling up to the month and then looking at what the change month on month is. I mean, it's been brilliant. And that's the reason I was quite keen to get you guys on board because it's on this show because it's a tool I use every day. So, you know, I'm very impressed with it. Great.

Starting point is 00:11:58 Yeah. I always love to hear what types of data customers are using and how they're leveraging the tool. So it's great to be able to hear your experience and even more, even better that you're actually enjoying it too. Yes, yes, yes. I mean, we'll talk later on about that because I think it's an interesting tool. It'd be interesting to see where it's going and so on. But I mean, looking at data preparation as an industry or as a sort of a market sector,

Starting point is 00:12:22 I mean, this seems to come out of nowhere in the last couple of years. And prior to that, it was just enterprise ETL tools, but there was no tooling, I suppose, that suited, I suppose, more power users, more business users. I mean, what led to the idea of this and what market niche does it serve and what user persona does it serve?

Starting point is 00:12:41 Yeah, it's funny. We have a lot of conversations with companies. So the first few years at Trifacta, we were really focused on evangelizing what data wrangling or data preparation tool or you're a self-service ETL. And so it took a lot of work to sort of build the category or create the category and also create clarity within the market of what we were trying to do or what tools or vendors in this space were really trying to do. And I think there, because, you know, when you have technology trends and then, you know, data prep becomes this hot thing and then every vendor claims they

Starting point is 00:13:26 do data prep yeah yeah yeah and there was a general awareness that in in data science work and that sort of thing a lot of the work a lot of the time went in preparing the data so i think that there was there was a kind of a um a niche there to be filled but but certainly you know it was a market dominated by etl or scripting wasn't it really? Yeah, no. So our focus is really we want to go after the people that know the data best to do this work. So I think if you look at ETL technologies and how that process works within an organization, essentially you have some business person who has requirements around some data that they want to analyze or some end dashboard that they want to analyze or some end dashboard that they want to be able to develop. And they essentially have to go to their IT or

Starting point is 00:14:10 ETL developer with a set of requirements, hand them over, and then have that person, when they find the time, implement those transformations and then build a data mark or build some sort of end analysis that they can access. And there was just so much broken within that sort of handoff and process organizationally that, you know, we saw a huge need. Analysts, data scientists, data engineers, or even, you know, business people that, you know, are data savvy, that understand how to use Excel or Tableau or tools like this, that wanted to be able to explore, prepare, and bring together data themselves and do it in a very intuitive, efficient, and visual manner. And the use cases you were talking about earlier in terms of recognizing nulls in your data

Starting point is 00:14:58 or data quality issues, and to be able to do that really quickly and easily in a visual interface, we saw a huge need in the market for that. So I think the way we differentiate from legacy or traditional technologies would be, one, that our users are different. So that's probably the biggest difference. The people that are using Dataprep or using Trifacta are not going to typically be your ETL developer. They're going to be data analysts, a more self-service vision for how this work is done that sort of broadens this bottleneck that organizations face

Starting point is 00:15:31 where you have only a few people doing the data prep work. We want to sort of broaden that out and reduce this 80% stat that we use a lot, but 80% of any analysis is spent on data preparation. And the other piece of this is also the data is different. So if you look at the data today that's coming in, it's, it's, you know, multi-structured, you can't, you can't sort of manage the schema. So you're sort of taking in data from outside sources, and it's always different, it's always coming in a different structure, and it's more diverse. So you're handling data from all sorts of different files, databases, APIs, different maybe third-party sources as well. And so the ability to quickly understand what's in that data and gain context for it so you can then define how it can be leveraged for analysis is really critical. And that's one of the things

Starting point is 00:16:20 that we focus on a lot. We have an, we have an internal name for a use case that we think about as this concept of data onboarding, sort of taking external or unfamiliar data, cracking it open in Trifacta, and then setting up rules of how you want to prepare that data or blend it together with other data sets that you might want to use downstream for analysis. And I think the other thing that's really different today

Starting point is 00:16:43 is that the speed of business and the speed of how you need to react to data is just so much faster than it was maybe 10, 5 years ago. And so organizations are prioritizing speed and they're doing that in any means necessary. And that's essentially what our tool is developed for is we're trying to make the process of taking something that's raw or diverse and putting it in some sort of standardized format that you can then use for know use for data visualization use for machine learning or use for data science down downstream yeah so i guess i guess data lakes and startups and all those kind of you know use cases and companies that aren't aware the obvious kind of uses this would be but

Starting point is 00:17:19 i suppose the other thing is the rise of the idea of data engineers who want to code everything themselves i mean is that something that is that is that is that a kind of a competitor to your mind is that idea a competitor to what you're doing or is it complementary i mean what's your view on data engineering in that sort of area yeah i think i think we see um the role of the data engineer becoming more critical within the organization and we do see use cases of Trifacta for them. I think the one thing we differentiate or how we view data engineers, it'd actually be interesting to get your input on that, is I think we view data engineers as individuals

Starting point is 00:17:55 within organizations that move big blocks around, whether it's systems, whether it's big databases or even data sources, move those big blocks around and then provision data, provision systems so that end users can have self-service access to them. In a lot of cases, data engineers will need to do some provisioning of data into a certain format that their end users can leverage. So there might be some initial cleaning or preparation that then they can provide to their teams that then their teams can go on and begin using and, you know, doing their

Starting point is 00:18:33 work in a self-service fashion. But, you know, I think that it depends case by case and organization by organization, the skill set of the team that they're working with. But I think the biggest differentiation would be that the data engineers are the ones that are handling large-scale systems or large-scale databases or large-scale system and provisioning all of that so that end users can have access to that. So, you know, a lot of cases we do have data engineers using our technology.

Starting point is 00:18:58 We love that. I mean, we're not trying to say that we don't want them to use it. We see definitely use cases and value for them, and they see it as well. But moving those sort of provision data into something that's going to be useful downstream is probably more where you'll see our sweet spot with the end users of the data

Starting point is 00:19:19 that the data engineers are provisioning. Does that make sense at all? Yeah, I mean, to take my kind of use case where I work at the moment, Qubit, I mean, I as a product manager, technical product manager, I would be I would use Cloud Data Prep to maybe do something that's more tactical or more being driven by business requirements. Or maybe it's to do with a new customer coming on board, and we're bringing on some new files from them, that it's more of a kind of one off job really where we want to be using bigquery and uh and cloud data flow in the background but we don't necessarily want to

Starting point is 00:19:48 be coding it and so on whereas the engineers would be more likely to use i don't know airflow or something or something like that or that data data flow itself building something that's more of a kind of i suppose an engineering requirement that's going to last for a long time and so on i mean it's so it's more suppose, tactical and business-focused versus engineering-focused and maybe a system that's going around for a while, really. Yeah, it's interesting you say that because I think our focus initially has really been those ad hoc exploratory types of use cases, right? And one of the things we are looking to really not only develop in the product more effectively, but also evangelize more in terms of something that we view as something we can handle is the operationalization of workflows.

Starting point is 00:20:36 So it's funny that you said, hey, I view trifactor or cloud data prep as this ad hoc exploratory type of thing. And, you know, that's exactly what, you know, we get used for a lot. But we also want to make sure that once you actually define a workflow or define some job that is really valuable for your organization, that you can actually set that on a schedule and you can parameterize that. You can version that. You can get monitoring and alerting on that. You can get performance statistics on how those jobs run. So sort of the enterprise hardening and operationalization of transformation workflows is definitely something that we want to be able to take on beyond just the ad hoc nature of our technology as well. Yeah, definitely.

Starting point is 00:21:20 I mean, I actually do use your product to a schedule at home. So for my own data flows, my own kind of aggregation of data putting it into a fact table you're doing things like that I actually use a scheduling feature to run that I think it runs every overnight or every kind of few hours or whatever so yes that is there as well really I guess probably in fact I've got one of the most fairly complex kind of I suppose workflow there where it has multiple steps going in there each bit then aggregates another bit as well I mean coming from my background I understand that but certainly it can do that as well and and the fact that it can interface in with with cloud storage is useful

Starting point is 00:21:53 as well I mean it's a good product it's a good product for you know and especially the way it's kind of charged for the fact that it's only charged at the cost of the data flow job underneath there is is fantastic really yeah I mean we during the the private beta and public beta with cloud data prep we wanted to make sure that we were um pricing for mass adoption and um and you know making sure that we get feedback and get people using the technology and so far that's been tremendous and i guess that's how the um the whole genesis of this conversation started right i saw your blog and reached out and glad we're here today. Yeah, fantastic.

Starting point is 00:22:26 But one thing I'm conscious of though, is you're more than just cloud data prep. And I think it's what also got me interested was I knew of your company name beforehand. I knew, obviously knew of the market before. And, you know, looking at what you do and your products beyond that, it's interesting to think what your differentiators

Starting point is 00:22:42 and what other kind of product areas you work in as well. I mean, just for the benefit of the listeners, what are the, I suppose, the unique differentiators for Trifacta's technology compared to the competition? I mean, things like the pluggable engine, that sort of thing. How does it work, really? Yeah, so I would start with architecture is one thing. You talk about cloud data prep and why Google selected Trifacta. It started with architecture.

Starting point is 00:23:07 So one of the unique things in our architecture is we are abstracting the logic you're generating in the application. So when you're building wrangling recipes and different transformation steps, that gets abstracted into our own language, which is called wrangle. It's a domain-specific language for data transformation. And so that, the interface, the language are consistent across

Starting point is 00:23:30 any environment. So you can use Trifacta on your desktop running against a single desktop machine. You can run Trifacta in a completely parallel environment. And the interface, the workflow, the logic you're creating as part of that, as part of using the product is completely consistent. And we just are able to plug into different environments depending upon where your data resides or depending upon where you're using the product. So the same recipe or workflow you generate using our free products in Wrangler would be completely transferable to using in an infinitely scalable environment

Starting point is 00:24:12 on cloud data flow. And so that's one of the unique aspects of our architecture. And it was one of the compelling points when Google was evaluating different data preparation and ETL providers to partner with around this cloud data prep product is they saw that our architecture was so unique and that it fits so well into the Google Cloud ecosystem when we were able to simply plug into cloud storage, BigQuery, and Dataflow so seamlessly and quickly that it was a huge differentiator for us. So I would say architecture is definitely

Starting point is 00:24:48 one of the key elements of that. And we're able to take recipes and run them on a desktop using our own photon engine or in Amazon using Spark on EMR, in a on-prem Cloudera cluster leveraging Spark, or in Google with Cloud data flow. And we'll be able to support, we support Azure as well. So in the environment, same logic, same metadata,

Starting point is 00:25:08 same workflow, it's just completely pluggable. And as the world becomes more cloud-centric, more hybrid, multi-cloud, this interoperability is really key in terms of allowing organizations to have confidence that regardless of what happens on the computing side or on the downstream analytics side, that we're able to plug in and be able to future-proof their investments in

Starting point is 00:25:30 Trifactor, which is really nice. I mean, I would also say I would be ashamed if I didn't mention the user experience and how we leverage machine learning to sort of guide users through the transformation process. I mean, one of the light bulb that went off with me when I first saw the product was the ability to simply interact with data through dragging, clicking on different elements of your data.

Starting point is 00:25:56 And then those simple interactions with elements of your data, whether it's the delimiter, whether it's a data quality issue, kick off all of these suggestions of, hey, do you want to delete this element? Do you want to drop this? Do you want to extract this? Do you want to split here? We prompt all of these suggestions based on simple interactions in the interface that users can then choose from and then, you know, build their workflow through just clicking,

Starting point is 00:26:20 interacting with data, which I think is a huge differentiator. And to be able to get feedback and previews of how each of these transformations are impacting the data in real time immediately. And so that's a huge difference from some of the other approaches to this problem. And if you even look at ETL processes, where you have to sort of set up a whole process, run the job, and then view the results at the end of that, you're actually constantly validating every single step you're building in our interface through immediate feedback

Starting point is 00:26:50 of how each transformation step would actually impact the data. Yeah, definitely. I mean, just to say kind of how that works, I mean, you can take a, you know, you've got a column of data and there's maybe sort of a few characters of it you want. There's a space there or something,

Starting point is 00:27:04 or there's some kind of delimiter. You just kind of drag your mouse over that and just highlight the bit you want. And then you get a series of suggestions back saying, do you want to split on this column? Do you want to split it into these things here? Do you want to do this? Do you want to do that?

Starting point is 00:27:15 And particularly over things like date data types or anything really where you can see visually how you should split it, how you should work with it. But to actually code that as SQL, particularly when you're working with BigQuery, when you've got legacy, you've got standard SQL and the difference in the syntax there. I mean, it's just fantastic the way that that works, really. Yeah, I mean, that's definitely one of the unique aspects of that.

Starting point is 00:27:37 And if you think about an efficiency gain from the process in its own right, I mean, you are constantly iterating, constantly iterating. And that fast iteration has been proven to be the key to efficiency. If you look at test-driven development or any other approaches to whether it's building software or things of that nature, this constant iteration, constant testing, constant feedback loop has been proven to be both more efficient and also providing higher quality. And so we're providing this within the data wrangling space or within the data preparation space that um allows users to move more quickly and have more accuracy in the work that they're doing okay so so obviously you work in product marketing and and one of the i mean as you know one of the things is important about that

Starting point is 00:28:18 is knowing what i suppose what what is your place in the room what is your part of the market and what i suppose there's always a temptation to expand further and so on. And looking at, I suppose, the competition that's out there, you know, competitors to you have, I suppose, broken out from that space to do other things. You've got competitors that would say,

Starting point is 00:28:35 add analytics into it. So they might start with a data integration and preparation and then start to add analytics in. Is that something you guys have thought about? Or is there a reason why you stick with what you're doing at the moment? Yeah, so we've been pretty focused in just saying, hey, we're the best of breed

Starting point is 00:28:54 data wrangling product. We continue to be that. And we want to make sure that from not only a product strategy perspective, but also from a go-to-market strategy, we want to make sure that interoperability is key. So, you know, we have really tight

Starting point is 00:29:09 integrations with, you know, the platforms we deploy on, whether that's Cloud Era, Hortonworks, Amazon Web Services, Microsoft Azure, and obviously Google Cloud, or the downstream analytics, machine learning, or visualization technologies that we would support, whether that's, you know, Tableau downstream analytics, machine learning, or visualization technologies

Starting point is 00:29:25 that we would support, whether that's, you know, Tableau, Qlik, a company like DataRobot, you know, Domino Data, and various others. And then also you see in the data cataloging space, a number of, you know, companies pop up that are getting popularity, companies like Alation, Waterline Data, Calibra. And so we want to play Switzerland, we want to interoperate with all those technologies and recognize as, you know, a growing but still relatively, you know, small company. We're about, you know, five years, six years old at this point, you know, really maintaining our focus on data wrangling. And we see a huge market opportunity there. And we also see just, you know, a lot of challenges that we are continually trying to take on and, you know, build features for and build product for. And so that right now, you know, we get that question asked a lot.

Starting point is 00:30:15 Do you eventually see yourself going into analytics or data cataloging? And right now we're primarily focused on or exclusively focused on data wrangling. And I don't see that changing for quite some time. Yeah, I suppose data cataloging is interesting, isn't it? I mean, I think there are vendors out there, as you say, that are looking at this. And particularly, I suppose, with the work you've been doing around machine learning and trying to suggest potential wrangles and so on there. What's your thoughts on how that market might evolve? And I suppose just to define this, really really to try and help users to I suppose infer

Starting point is 00:30:45 the meaning of data and catalog it for them and so on is that an area that you think could be interesting for innovation in the future? Oh absolutely I think those companies in that space are you know are doing very well from what I understand and also are poised to grow even further I think I think that the um the critical piece that we view is that data catalogs have to be independent they have to plug into every platform and application they can't be tied to a single process let's say like data wrangling so um and the reason we believe that is you would create a silo in terms of a catalog and if you have application specific catalogs for every applicant of data, then you're just creating more and more silos and more and more data governance issues.

Starting point is 00:31:32 So if you had, we believe that having a centralized data catalog that is not tied to visualization or data prep or data science, but it's exclusively focused on cataloging is critical. So we partner with those vendors that are doing that because, you know, you have to make sure that if you have a data catalog, that has to be a centralized point of truth and it can't be just creating another silo. And but, you know, from the value of that for an end user's perspective, I mean, having being able to pick up a data set and understand who else is using it, what does that data set have in it, where it's being used in different types of analysis, and what is the trust score or how do you validate that this data is accurate? I think it's tremendously valuable and definitely see a huge need for that and increasing need for that as this space matures. Yeah, I guess the flip side of focus, focus though is that you potentially become a feature of something else or you become considered a feature of something else and you see i suppose for

Starting point is 00:32:29 example i think it's um tableau have added data wrangling features to their bi product i work as well and how do you how do you kind of position what you're doing compared to that and what's your view on vendors that just add it as a feature into their product? Yeah, so I mean, Tableau coming out with Maestro, their Dataprep product. One, we find that incredibly validating. I mean, going to Tableau's conference, going to Tableau's conference, I think two years ago and seeing data wrangling everywhere.

Starting point is 00:32:57 I mean, it was great. Our team was just saying, this is awesome. They're free promotion for us and validating that this is a need. So it was great. And we're friends with the Tableau team. I mean, Pat Hanrahan, who is one of the founders of Tableau, is very tight with Jeff here, who is another Stanford guy. And we're really close with the executives and founding team at Tableau and will continue to be. I think, you know, similar to the idea around

Starting point is 00:33:27 data cataloging, you know, we see diversity of inputs and diversity of outputs in wrangling. And a lot of our customers will have Tableau downstream, but they'll also have Qlik. They'll also have like a strategy that also have another bi tool i mean within single departments you could have you know 10 different analytics or visualization tools that are being used downstream and so you know our ability to to once again be able to support diversity of inputs so whether it's files databases um you know cloud storage things like that and also support diversity of outputs. Multiple downstream analytics or consumption applications is critical to us. So I think if you are tying your data prep process to a single application or downstream use, then it's very limiting.

Starting point is 00:34:18 Also, you know, not what we're seeing as the uses of our technology. I think almost every customer we deal with is outputting the results of Trifacta into multiple different technologies or repositories and only supporting a select few or a select analytics application is not the use cases that we're seeing dominate the market at this point. So I suppose the only criticism I've got of cloud data prep

Starting point is 00:34:43 is that it only connects to BigQuery and to Google Cloud Storage. I mean, is that something that will, and obviously you can't talk too much about roadmap and so on, but is that something that you envision seeing maybe extending to things like, you know, to other parts of the Google kind of cloud? Or is it going to be a case that anything beyond that, you go to your main products, really? Yeah, I mean, so one, we'd love for you to talk to us directly. If you have uses for Trifacta outside of Google Cloud, we'd love to be able to start a conversation and figure out how we can help you. We're having conversations with Google now

Starting point is 00:35:18 in terms of the future of that product, where it's going. We are going to GA Cloud data prep in the next few months and, you know, discussing plans for that and also plans for the eventual features and that cloud data prep will have. So I can't share a ton there. But what I can share is that, hey, if you're using cloud data prep,

Starting point is 00:35:42 enjoy it and have other data sources or use cases that you want to take on, we'd love to have a conversation with you. So just to be clear then, the product's in beta at the moment, isn't it? So it will be GA soon and everything we're saying now may well change at some point and so on there. It's great there's a public beta as well,

Starting point is 00:35:57 which is good. So if a customer now had, say, a system built in Cloud Dataprep and they were launching to say transition to the full product from you i mean obviously it would mean porting bits and so on but how much work be involved in that and is it conceptually is it quite similar is it a big task to do that really uh it's pretty it's pretty um easy actually i mean that's part of our architecture what's the the the um uniqueness of that is that every workflow, every wrangling

Starting point is 00:36:28 recipe you create within Cloud Dataprep, you can seamlessly run that in any other environment. So whether that's an on-prem environment, a different cloud-based environment, or in sort of your own trifecta hosted on Google Cloud, that can seamlessly plug in there. So it's not an invasive process by any means. It's simply just porting over all the workflows and metadata you've generated in that product into a different instance of it.

Starting point is 00:36:59 Yeah, I mean, looking myself at some of the other products, it seems it's the same language between them. It's you just export the scripts and so on there. So it seems one of the easier migrations, I think, or upgrades that I could have seen out there. So that looks quite good. I mean, just to kind of move on and looking at reading through the Trifactor website blog, there's quite a good few, I suppose, thought leadership posts there and things about, I guess, you're thinking about and maybe problems that your company is looking to solve in the future.

Starting point is 00:37:23 And it'd be interesting to talk through a couple of those with you just get your views on i suppose where the market's going where you guys are going and and you mentioned that one of the posts you had was about data quality and data quality for new world data sources you know is this something that is becoming an issue now or becoming more aware of and do you think i suppose the big data world has got away with it so far a little bit i mean what's your view on that yeah so um obviously as a data cleaning technology um data quality is is really important to our customers and users and i think one of the use cases that um shines through that quite a bit is is a use case we have stumbled across that has been one that's been one of our more dominant ones, which is around compliance. So, you know, global banks, banks in

Starting point is 00:38:12 the US, banks in Europe that we have worked with, you know, have to submit data to different regulatory bodies to make sure that they are in compliant with government regulations, that they, you know, are able to do stress tests, that they have a certain amount of money set aside to be able to deal with different global events. or manipulation of input data that they performed in the process to then output the results that they're giving to these regulatory bodies. And if you're a head of the bank or if you're a head of compliance at these banks, which the compliance groups of the banks have been growing significantly over the past few years, you want to make sure that you are very, very confident that the data that you're submitting to these government agencies is accurate and you have transparent lineage on that. So, you know,

Starting point is 00:39:10 I think in those use cases, data quality is incredibly important. And, you know, I think even more broadly, I think if you have use cases around marketing or things like that, I think they wouldn't need the pixel perfect data that resulted. They're more sort of optimizing for speed and speed of results. But even then, you want to make sure that the data you're reporting against is actually accurate.

Starting point is 00:39:35 And I think within organizations, a lot of people have lost confidence in terms of the data that is being brought to them as the sort of, hey, this is the published analysis that we all validate. If you actually don't have visibility into how someone came up with those numbers, the different data sources that made up that analysis,

Starting point is 00:39:55 then it's impossible to sort of get buy-in on that. So I think one of the unique elements of Trifactor that we like to preach is that in a workflow, you can see not only at a high level all the different data sets that made up um the the end analysis but also every single transformation step that may have been applied to the data that ended up getting that result and so if someone has ever has questions around how you got to an analysis you can simply show them a workflow and the different recipes that made that made that um made up that workflow to

Starting point is 00:40:23 sort of show them what you did. So I guess that's probably fairly topical with GDPR now. I mean, I think that's something where how data was combined and calculated and the algorithms involved in it and so on is particularly relevant at the moment, isn't it? Yeah, I mean, what's the date, May 25th? Yes, yeah. We're so obsessed with Brexit over here

Starting point is 00:40:44 that we've forgotten everything else that's going on. But I think that's the other day of armageddon i think for uh for our financial services industry over here yeah so we've um obviously seen a lot of interest in leveraging our technology for the gdpr type use cases i think um you know some of the things that um i think are interesting that you know we we have talked about internally is how do you leverage Trifacta to get recommendations on what data might be sensitive within, you know, a table or a file, and then, you know, allow them to mask that data or remove that data from certain repositories or things like that. So it is something we have definitely had conversations around or looking at and, you know, feel like there's quite a few different opportunities for us there, but are also kind of being very careful with how we dive into that

Starting point is 00:41:30 because I know there's a lot going on there and making sure that whatever offerings that we provide or solutions we provide are well thought out and also hard. Yes. Yeah. On your blog as well you mentioned metadata strategies and master data management and and i guess that's that's an interesting topic to me because coming from that i suppose the old corporate etl world that that was very much top of the table you'd be talking about that in enterprise architecture meetings but

Starting point is 00:42:00 in the new world i work in where it's all startups and so on you know getting anyone anyone to listen and talk about metadata is hard but it's actually important and where do you guys think that's important and where do you think you might be able to contribute to that a little bit um so it's funny uh people at um sure i thought a lot of them of them were ex-informatic employees. And it's really interesting to hear. Everyone's their ex there, aren't they? Yeah, it's interesting. It's really interesting to hear their takes on master data management and this single source of the truth. They're actually probably have a lot of horror stories and are less believers than you might think.

Starting point is 00:42:50 At the same time, you know, metadata is obviously critical. You know, understanding the context for the different data you're looking at is incredibly important. And a lot of the work that is done in Trifact is actually defining metadata. So if you have a raw JSON file and you're having to define rows and columns out of that and what those rows and columns mean, a lot of the wrangling process is generating metadata related to different attributes within a data source. So we have a lot of features within the product to allow them to recognize, hey, this is a time-based element. This is a geographic element. This is different data types. And then the integration with data catalogs I

Starting point is 00:43:25 talked about earlier, you can understand business context for how that data is used or what is the makeup of a single data set. So I think we're less concerned around having a single source of truth or master data management and having data dictionaries in that sense, not really our focus, but making sure that users are able to define metadata related to their data and be able to publish that so that other users, other applications can read that and understand that is really critical. Okay, excellent. Well, look, we're almost out of time now.

Starting point is 00:44:00 So just where would anybody, where would people find out about Cloud data prep and also your products? Where on the web and material online and that sort of thing? Yeah. So for Trifacta, you know, we have website trifacta.com. We have a pretty big presence on LinkedIn,

Starting point is 00:44:18 on Twitter. So, and I think Facebook too. So feel free to, to, to, you know, join us in those different social media outlets

Starting point is 00:44:25 and you'll get the latest and greatest of what we're doing. And also interesting articles or blogs from different people within the organization that you might find valuable. Cloud Dataprep, we do talk about it on our website, but it's also on the Google Cloud website. I think it's what Cloud Dataprep

Starting point is 00:44:42 or googleclouddataprep.com. So yeah, it's once again, public beta. So you can go sign up and use that product if you're a Google Cloud customer. And then if you're interested in using Trifacta too, we have a free downloadable desktop version of our product that can handle up to a hundred megabytes and that's free for as long as you'd like. So there's no time-based limit. You can go to tryfider.com, download Wrangler, and get going in a matter of minutes. That's great. Well, Will,

Starting point is 00:45:12 thank you very much for coming on the show. It's been great to speak to you. Have a nice day, rest of the day, and yeah, it's been good to speak to you. Yeah, Mark, it was a pleasure. Thanks for having me on. Thank you. Cheers. Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.49 'Trifacta, Google Cloud Dataprep and Data Wranging for Data Engineers' With Special Guest Will Davis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.