Drill to Detail - Drill to Detail Ep.49 'Trifacta, Google Cloud Dataprep and Data Wranging for Data Engineers' With Special Guest Will Davis
Episode Date: February 5, 2018Mark Rittman is joined by Will Davis from Trifacta to talk about the public beta of Google Cloud Dataprep, Trifacta's data wrangling platform and topics including metadata management, data quality and... data management for big data and cloud data sources.Google Cloud Dataprep on Google Cloud Platform"Google Cloud Dataprep: Spreadsheet-Style Data Wrangling Powered by Google Cloud Dataflow""A New Cloud-Based Data Prep Solution from Google & Trifacta"Trifacta website"A Breakthrough Approach to Exploring and Preparing Data"Trifacta platform architecture"Garbage In, Garbage Out: Why Data Quality Matters""How to Put an Effective Metadata Strategy in Place"
Transcript
Discussion (0)
So hello and welcome to another episode of Drill to Detail and I'm your host Mark Ripman.
A few weeks ago you might have noticed a post on my Medium blog about Google Cloud Data Prep,
a new data wrangling tool I've been working with in the day job and at home in my own data feeds.
So I'm very pleased, therefore, to be joined this episode by Will Davis from Trifactor,
the vendor who many of you will know and actually who partnered with Google to bring out Cloud
Data Prep. So Will, welcome to the show. And why don't you introduce yourself to our listeners?
Thanks, Mark. Thanks for having me on the show. It's great to be able to speak
with you today. Yeah, so my name is Will Davis. I head up product marketing at Trifacta. I've been
here for almost four years now, so quite a bit of time. I'm one of the elder statesmen in terms of
tenure at the company and have been in the data and analytics space for, you know, the past 10 years. So I've been involved in the market anywhere from data infrastructure to, you know, analytics and visualization.
And then now it's my time at Trifacta, you know, kind of play in between, in between, you know,
data platforms and the downstream consumption or visualization of data.
And yeah, happy to talk to you today.
Excellent.
So looking through your LinkedIn profile, as I'd always do when I have guests on here,
you have quite an interesting work history.
You worked at ClearStory Data, GreenPump, GoodData, and so on.
So quite a kind of, I suppose, an interesting set of companies there,
and all pretty cutting edge as well.
Yeah, so I've been, hopefully I don't get pigeonholed in the data,
even though
it is my area of expertise but yeah so um started my career at uh good data in the data space um
you know at the time good data was getting started with business operations in the u.s
and had a engineering and development team based in the Czech Republic in two different locations. So I got my start.
Good Data was really a software as a service BI and data warehousing company
and saw just the struggles that organizations have to simply leverage data
to make decisions and to improve the efficiency of their business.
And then from there, moved on to Greenplum
and headed up the GoToMarket that Greenplum had
into the big data space.
And the company had been acquired by EMC
and then was moving into the parallel processing space
with not only their parallel database,
but also entering into the Hadoop space.
So spent a good amount of time there
working both with Greenplum as an individual entity, but also with the broader EMC and VMware team.
And now that company has spun into what's now called Pivotal, which has been doing very well.
I think they've pivoted a lot more towards cloud at this point.
Then from there, I went to ClearStory Data, which was a Spark-based cloud data visualization.
And now they do a little bit of data prep in their product.
And that was a great experience as well,
learning from their CEO, Sharmila Mulligan,
on launching a company and a lot about my function,
which is in product marketing and marketing.
And then from there, I've been at Trifacta for quite some time now.
Excellent. I was about to say, maybe introduce who Trifacta are and what you do,
but I noticed you've been all over the press recently
the last couple of days with your funding round.
So why don't you just tell us who Trifacta are
and what's the funding you've had recently
and what's the purpose of that and so on?
Yeah, so Trifacta, we are a data wrangling company,
or I think what is also referred to as self-service data preparation or data preparation.
The company was founded out of joint research that was taking place at Stanford and UC Berkeley.
So the three founders, Joe Hellerstein is a professor, well, we're not a professor in parallel systems and database technology at UC Berkeley.
He partnered with one of the experts in data visualizations, Jeffrey Heer, who was a professor at Stanford. And Jeff was, you know, one of the inventors of d3.js, which if you're doing any
data visualization in the browser, you're probably leveraging d3. And they had a PhD student, Sean Candell, who, you know, was part of this PhD project, came up with this prototype called Data Wrangler.
And it was an interactive web-based data cleaning product that, you know, he had brought out during his PhD work at Stanford. And in the matter of a few months,
that product was accessed by tens of thousands of people and gained a lot of notoriety within
the data space. And so then they went on to, you know, raise some money from Excel and Greylock,
two of the top tier venture capital firms in Silicon Valley and started a company. And so, you know, Trifacta has really been the commercialization of that joint research
that Joe, Jeff, and Sean were working on.
And dated back even before Data Wrangler, Joe Hellerstein had a project called Potter's
Wheel that was initially started in 1997.
And the real focus of that was, how do you make data cleaning and data
preparation, structuring, all the work you need to do to get data ready for any type of analysis,
how do you make that more intuitive, more efficient, and also more interactive? So they
were looking at existing methods to do data preparation, whether it was based in code or whether it was based
in existing technologies such as ETL, and really focused on making a more visual, intuitive,
and efficient way to do that.
So that's really our focus.
And the company, upon our initial go-to-market, was really focused on the big data space.
So we were primarily focused on the Hadoop ecosystem and going to market with the leading vendors in the Hadoop space, whether it's a Cloudera, Hortonworks, MapR, and companies such as that.
Still a huge focus of our company, but we've continued to expand into cloud.
We have a desktop version that's for free. We have a hosted cloud version as well for smaller team departmental use
and recognizing that the needs around data preparation and more efficiency and getting
data ready to do something with it spans across any type of user, any type of data,
any type of environment, not just the big data world. I think the nice thing about starting with big data is because you're tackling the hardest,
most difficult environments and ecosystems to take on.
And so we've taken those learnings,
working with some of the world's most advanced organizations
and how they utilize data in some very large scale,
Hadoop based environments,
and then applying those learnings to ongoing development
and work and spreading out the product to different ecosystems. So I know you asked about
the funding that we had or the announcement we had recently. So we did announce actually yesterday,
so I've been around the clock the past few weeks, we did announce around, $48 million, which is going to be able to fuel us to accelerate our
growth over the next few years. And it was exciting to include it in the round. We did
have a number of strategic investors, which I think was especially unique with this fundraise.
Companies such as Ericsson, Deutsche Force, New York Life, and Google
were investors in the round in addition to some other venture capital and private equity firms.
But what's nice about those strategics is actually a few of them started as customers. So
New York Life, large scale insurance company, started as a customer of Trifacta's and then
recognized the opportunity that we were going after and the value that we were creating for their team and actually wanted to move forward with an investment.
Same thing happened with Deutsch Bourse, a company based in Germany that manages the stock exchange there.
Similar to that, started as a customer, recognized the value of what we're doing and the market opportunity and decided to invest.
And then the other piece of that was Google. So Google is a company that you mentioned earlier
we've been partnering with
and have a collaboration around cloud data prep with them.
And we started that relationship
as collaborating on a joint product
that Google was bringing to market
within their cloud platform called Google Cloud Data Prep.
And through that experience,
they also were interested in investing in the
company too and we're part of this round of financing fantastic well i know from the product
marketing people where i work that you must be pretty busy at the moment with the funding round
going on so thanks very much for um for coming on and you mentioned their um cloud data prep now i
want to go into that a bit more detail later on but just again for anybody that hasn't heard of
that product just kind of i suppose paint a. What is that? And how does it relate to the other data integration, data loading tools that you get with Google Cloud, like, say, Cloud Dataflow and that sort of thing?
Yeah, so Cloud Dataprep is a product that is a service that you can use through Google Cloud.
So it's essentially the ability to access data that is in the Google
Cloud ecosystem. So the product supports Google Cloud storage, so the file system on Google Cloud,
it also supports access to BigQuery. So you have the ability to actually access data that's in
Google Cloud, explore data through, you data through Trifacta's interface.
So we've essentially embedded the Trifacta interface into the Google Cloud ecosystem.
So you're able to actually access, explore, and start wrangling data that lives within
Google Cloud.
And so that product is allowing you to sort of build up a wrangling workflow. If you have a multitude of data sets that are living within Google Cloud that you want to explore, clean, prep, join together, and then create some sort of output for doing some analysis, let's say in BigQuery, you'd be able to do that within our products.
And then we support Cloud Dataflow as a processing engine. So essentially, you're accessing data
through the interface that we've developed
and brought to the Google Cloud ecosystem,
build up a workflow of transformations
that you want to apply to that data,
and then those set of transformations
will run an infinitely scalable job
through Cloud Dataflow on Google Cloud,
and then we'll be able to output
to Cloud Storage or BigQuery.
So it's essentially a more visual, more intuitive way to clean and prepare data within the Google
Cloud ecosystem.
Yeah, I mean, I use it every day at home as well in my spare time, which I always think
is a great endorsement of a product, really, if you do use it voluntarily.
And the great thing is that every job you run, it just effectively, the charging is
just charged on the data flow jobs you run in the background.
And it interfaces with BigQuery.
It's about the only tool I can see around that's easy to use that links in with that.
So it's a really easy tool to use and quite pleasurable, really.
Yeah, so how's your experience been with using the tool?
I'd love to hear about it.
It's been good, yeah.
I mean, I've been using it to bring in feeds from things like Strava, bringing in feeds from all different places, really.
And I suppose we're going into this later on,
but actually making sure the data is standardised.
When it's things like, I suppose, fitness feeds,
you've got things like maybe weight readings that don't have a reading every day.
And so you're doing things like filling in the gaps between data
and then doing things like rolling up to the month and then looking at what the change month on month is.
I mean, it's been brilliant.
And that's the reason I was quite keen to get you guys on board because it's on this show because it's a tool I use every day.
So, you know, I'm very impressed with it.
Great.
Yeah.
I always love to hear what types of data customers are using and how they're leveraging the tool.
So it's great to be able to hear your experience and even more,
even better that you're actually enjoying it too.
Yes, yes, yes.
I mean, we'll talk later on about that because I think it's an interesting tool.
It'd be interesting to see where it's going and so on.
But I mean, looking at data preparation as an industry or as a sort of a market sector,
I mean, this seems to come out of nowhere in the last couple of years.
And prior to that, it was just enterprise ETL tools,
but there was no tooling, I suppose,
that suited, I suppose, more power users,
more business users.
I mean, what led to the idea of this
and what market niche does it serve
and what user persona does it serve?
Yeah, it's funny.
We have a lot of conversations with companies.
So the first few years at Trifacta, we were really focused on evangelizing what data wrangling or data preparation tool or you're a self-service ETL. And so it took a lot of work to sort of build the category or create the category and also create clarity within the market
of what we were trying to do or what tools or vendors in this space
were really trying to do.
And I think there, because, you know, when you have technology trends
and then, you know, data prep becomes this hot thing
and then every vendor claims they
do data prep yeah yeah yeah and there was a general awareness that in in data science work and that
sort of thing a lot of the work a lot of the time went in preparing the data so i think that there
was there was a kind of a um a niche there to be filled but but certainly you know it was a market
dominated by etl or scripting wasn't it really? Yeah, no. So our focus is really we want to go after the people that know the data best to do this work.
So I think if you look at ETL technologies and how that process works within an organization,
essentially you have some business person who has requirements around some data that they want to analyze
or some end dashboard that they want to analyze or some end
dashboard that they want to be able to develop. And they essentially have to go to their IT or
ETL developer with a set of requirements, hand them over, and then have that person,
when they find the time, implement those transformations and then build a data
mark or build some sort of end analysis that they can access. And there was just so much broken within that sort of handoff and process organizationally
that, you know, we saw a huge need.
Analysts, data scientists, data engineers, or even, you know, business people that,
you know, are data savvy, that understand how to use Excel or Tableau or tools like this, that wanted to be able to explore, prepare, and bring together data themselves and do
it in a very intuitive, efficient, and visual manner.
And the use cases you were talking about earlier in terms of recognizing nulls in your data
or data quality issues, and to be able to do that really quickly and easily in a visual
interface, we saw a huge need in the market for that. So I think the way we differentiate from legacy or traditional technologies would be, one,
that our users are different.
So that's probably the biggest difference.
The people that are using Dataprep or using Trifacta are not going to typically be your
ETL developer.
They're going to be data analysts, a more self-service
vision for how this work is done that sort of broadens this bottleneck that organizations face
where you have only a few people doing the data prep work. We want to sort of broaden that out
and reduce this 80% stat that we use a lot, but 80% of any analysis is spent on data preparation.
And the other piece of this is also the data is different. So if you look at the data today that's coming in, it's,
it's, you know, multi-structured, you can't, you can't sort of manage the schema. So you're sort
of taking in data from outside sources, and it's always different, it's always coming in a different
structure, and it's more diverse. So you're handling data from all sorts of different files, databases, APIs, different maybe third-party sources as well.
And so the ability to quickly understand what's in that data and gain context for it so you can
then define how it can be leveraged for analysis is really critical. And that's one of the things
that we focus on a lot. We have an, we have an internal name for a use case
that we think about as this concept of data onboarding,
sort of taking external or unfamiliar data,
cracking it open in Trifacta,
and then setting up rules of how you want to prepare that data
or blend it together with other data sets
that you might want to use downstream for analysis.
And I think the other thing that's really different today
is that the speed of business
and the speed of how you need to react to data is just so much faster than it was maybe
10, 5 years ago. And so organizations are prioritizing speed and they're doing that
in any means necessary. And that's essentially what our tool is developed for is we're trying
to make the process of taking something that's raw or diverse and putting it in some sort of
standardized format that you can then use for know use for data visualization use for machine learning or use for
data science down downstream yeah so i guess i guess data lakes and startups and all those kind
of you know use cases and companies that aren't aware the obvious kind of uses this would be but
i suppose the other thing is the rise of the idea of data engineers who want to code everything
themselves i mean is that something that is that is that is that a kind of a competitor to your mind is that idea a competitor
to what you're doing or is it complementary i mean what's your view on data engineering in that
sort of area yeah i think i think we see um the role of the data engineer becoming more critical
within the organization and we do see use cases of Trifacta for them.
I think the one thing we differentiate or how we view data engineers,
it'd actually be interesting to get your input on that,
is I think we view data engineers as individuals
within organizations that move big blocks around,
whether it's systems, whether it's big databases
or even data sources, move those big blocks around and then provision
data, provision systems so that end users can have self-service access to them.
In a lot of cases, data engineers will need to do some provisioning of data into a certain
format that their end users can leverage.
So there might be some initial cleaning or preparation that then they can provide to
their teams that then their teams can go on and begin using and, you know, doing their
work in a self-service fashion.
But, you know, I think that it depends case by case and organization by organization,
the skill set of the team that they're working with.
But I think the biggest differentiation would be that the data engineers
are the ones that are handling large-scale systems or large-scale databases
or large-scale system and provisioning all of that
so that end users can have access to that.
So, you know, a lot of cases we do have data engineers using our technology.
We love that.
I mean, we're not trying to say that we don't want them to use it.
We see definitely use cases and value for them,
and they see it as well.
But moving those sort of provision data
into something that's going to be useful downstream
is probably more where you'll see our sweet spot
with the end users of the data
that the data engineers are provisioning.
Does that make sense at all?
Yeah, I mean, to take my kind of use case
where I work at the moment, Qubit, I mean,
I as a product manager, technical product manager, I would be I would use Cloud Data Prep to maybe do
something that's more tactical or more being driven by business requirements. Or maybe it's
to do with a new customer coming on board, and we're bringing on some new files from them,
that it's more of a kind of one off job really where we want to be using bigquery and uh and cloud data flow in the background but we don't necessarily want to
be coding it and so on whereas the engineers would be more likely to use i don't know airflow or
something or something like that or that data data flow itself building something that's more of a
kind of i suppose an engineering requirement that's going to last for a long time and so on
i mean it's so it's more suppose, tactical and business-focused versus
engineering-focused and maybe a system that's going around for a while, really.
Yeah, it's interesting you say that because I think our focus initially has really been those
ad hoc exploratory types of use cases, right? And one of the things we are looking to really
not only develop in the product more effectively, but also evangelize more in terms of something that we view as something we can handle is the operationalization of workflows.
So it's funny that you said, hey, I view trifactor or cloud data prep as this ad hoc exploratory type of thing.
And, you know, that's exactly what, you know, we get used for a lot.
But we also want to make sure that once you actually define a workflow or define some job that is really valuable for your organization, that you can actually set that on a schedule and you can parameterize that.
You can version that.
You can get monitoring and alerting on that.
You can get performance statistics on how those jobs run.
So sort of the enterprise hardening and operationalization of transformation workflows is definitely something that we want to be able to take on beyond just the ad hoc nature of our technology as well.
Yeah, definitely.
I mean, I actually do use your product to a schedule at home.
So for my own data flows, my own kind of aggregation of data putting it into a fact table you're doing things like that
I actually use a scheduling feature to run that I think it runs every overnight or every kind of
few hours or whatever so yes that is there as well really I guess probably in fact I've got
one of the most fairly complex kind of I suppose workflow there where it has multiple steps going
in there each bit then aggregates
another bit as well I mean coming from my background I understand that but certainly
it can do that as well and and the fact that it can interface in with with cloud storage is useful
as well I mean it's a good product it's a good product for you know and especially the way it's
kind of charged for the fact that it's only charged at the cost of the data flow job underneath there
is is fantastic really yeah I mean we during the the private
beta and public beta with cloud data prep we wanted to make sure that we were um pricing for
mass adoption and um and you know making sure that we get feedback and get people using the
technology and so far that's been tremendous and i guess that's how the um the whole genesis of
this conversation started right i saw your blog and reached out and glad we're here today.
Yeah, fantastic.
But one thing I'm conscious of though,
is you're more than just cloud data prep.
And I think it's what also got me interested
was I knew of your company name beforehand.
I knew, obviously knew of the market before.
And, you know, looking at what you do
and your products beyond that,
it's interesting to think what your differentiators
and what other kind of product areas you work in as well.
I mean, just for the benefit of the listeners,
what are the, I suppose, the unique differentiators for Trifacta's technology compared to the competition?
I mean, things like the pluggable engine, that sort of thing.
How does it work, really?
Yeah, so I would start with architecture is one thing.
You talk about cloud data prep and why Google selected Trifacta.
It started with architecture.
So one of the unique things in our architecture
is we are abstracting the logic you're generating
in the application.
So when you're building wrangling recipes
and different transformation steps,
that gets abstracted into our own language,
which is called wrangle.
It's a domain-specific language for data transformation. And so that, the interface, the language are consistent across
any environment. So you can use Trifacta on your desktop running against a single desktop machine.
You can run Trifacta in a completely parallel environment. And the interface, the workflow, the logic you're creating as part of that,
as part of using the product is completely consistent.
And we just are able to plug into different environments
depending upon where your data resides
or depending upon where you're using the product.
So the same recipe or workflow you generate using our free products in Wrangler
would be completely transferable to using in an infinitely scalable environment
on cloud data flow.
And so that's one of the unique aspects of our architecture.
And it was one of the compelling points when Google was evaluating different
data preparation and ETL providers to
partner with around this cloud data prep product is they saw that our architecture was so unique
and that it fits so well into the Google Cloud ecosystem when we were able to simply plug into
cloud storage, BigQuery, and Dataflow so seamlessly and quickly that it was a huge differentiator for us.
So I would say architecture is definitely
one of the key elements of that.
And we're able to take recipes and run them
on a desktop using our own photon engine
or in Amazon using Spark on EMR,
in a on-prem Cloudera cluster leveraging Spark,
or in Google with Cloud data flow.
And we'll be able to support, we support Azure as well.
So in the environment, same logic, same metadata,
same workflow, it's just completely pluggable.
And as the world becomes more cloud-centric,
more hybrid, multi-cloud,
this interoperability is really key
in terms of allowing organizations to have confidence
that regardless of what happens on the computing side
or on the downstream analytics side,
that we're able to plug in and be able to future-proof their investments in
Trifactor, which is really nice.
I mean, I would also say I would be ashamed if I didn't mention the user
experience and how we leverage machine learning to sort of guide users through
the transformation process.
I mean, one of the light bulb that went off with me
when I first saw the product was the ability
to simply interact with data through dragging,
clicking on different elements of your data.
And then those simple interactions
with elements of your data,
whether it's the delimiter,
whether it's a data quality issue,
kick off all of these suggestions of,
hey, do you want to delete this element? Do you want to drop this? Do you want to extract this?
Do you want to split here? We prompt all of these suggestions based on simple interactions in the interface
that users can then choose from and then, you know, build their workflow through just clicking,
interacting with data, which I think is a huge differentiator.
And to be able to get feedback and previews of how each of these transformations are impacting the data in real time immediately.
And so that's a huge difference from some of the other approaches to this problem. And if you even
look at ETL processes, where you have to sort of set up a whole process, run the job, and then view
the results at the end of that,
you're actually constantly validating
every single step you're building in our interface
through immediate feedback
of how each transformation step
would actually impact the data.
Yeah, definitely.
I mean, just to say kind of how that works,
I mean, you can take a, you know,
you've got a column of data
and there's maybe sort of a few characters of it you want.
There's a space there or something,
or there's some kind of delimiter.
You just kind of drag your mouse over that
and just highlight the bit you want.
And then you get a series of suggestions back saying,
do you want to split on this column?
Do you want to split it into these things here?
Do you want to do this?
Do you want to do that?
And particularly over things like date data types
or anything really where you can see visually
how you should split it, how you should work with it.
But to actually code that as SQL,
particularly when you're working with BigQuery, when you've got legacy, you've got standard SQL and the
difference in the syntax there.
I mean, it's just fantastic the way that that works, really.
Yeah, I mean, that's definitely one of the unique aspects of that.
And if you think about an efficiency gain from the process in its own right, I mean,
you are constantly iterating, constantly iterating. And that fast iteration has been proven to be the key to efficiency. If you look at test-driven
development or any other approaches to whether it's building software or things of that nature,
this constant iteration, constant testing, constant feedback loop has been proven to be
both more efficient and also providing higher quality. And so we're providing this
within the data wrangling space or within the data preparation space that um allows users to move
more quickly and have more accuracy in the work that they're doing okay so so obviously you work
in product marketing and and one of the i mean as you know one of the things is important about that
is knowing what i suppose what what is your place in the room what is your part of the market and
what i suppose there's always a temptation to expand further and so on.
And looking at, I suppose,
the competition that's out there,
you know, competitors to you have,
I suppose, broken out from that space
to do other things.
You've got competitors that would say,
add analytics into it.
So they might start with a data integration
and preparation and then start to add analytics in.
Is that something you guys have thought about?
Or is there a reason why you stick
with what you're doing at the moment?
Yeah, so we've been pretty focused
in just saying, hey, we're the best of breed
data wrangling product.
We continue to be that.
And we want to make sure that
from not only a product strategy perspective,
but also from a go-to-market strategy,
we want to make sure that interoperability
is key. So, you know,
we have really tight
integrations with, you know,
the platforms we deploy on, whether that's
Cloud Era, Hortonworks,
Amazon Web Services,
Microsoft Azure, and obviously
Google Cloud, or the downstream
analytics, machine learning,
or visualization technologies that we would support, whether that's, you know, Tableau downstream analytics, machine learning, or visualization technologies
that we would support, whether that's, you know, Tableau, Qlik, a company like DataRobot,
you know, Domino Data, and various others.
And then also you see in the data cataloging space, a number of, you know, companies pop
up that are getting popularity, companies like Alation, Waterline Data, Calibra.
And so we want to play Switzerland, we want to interoperate with all those technologies and recognize as, you know, a growing but still relatively, you know, small company.
We're about, you know, five years, six years old at this point, you know, really maintaining our focus on data wrangling.
And we see a huge market opportunity there. And we also see just, you know, a lot of challenges that we are continually trying to take on and, you know, build features for and build product for.
And so that right now, you know, we get that question asked a lot.
Do you eventually see yourself going into analytics or data cataloging?
And right now we're primarily focused on or exclusively focused on data wrangling.
And I don't see that changing for quite some time.
Yeah, I suppose data cataloging is interesting, isn't it?
I mean, I think there are vendors out there, as you say, that are looking at this.
And particularly, I suppose, with the work you've been doing around machine learning and trying to suggest potential wrangles and so on there.
What's your thoughts on how that market might evolve?
And I suppose just to define this, really really to try and help users to I suppose infer
the meaning of data and catalog it for them and so on is that an area that you think could be
interesting for innovation in the future? Oh absolutely I think those companies in that space
are you know are doing very well from what I understand and also are poised to grow even
further I think I think that the um the critical
piece that we view is that data catalogs have to be independent they have to plug into every
platform and application they can't be tied to a single process let's say like data wrangling so
um and the reason we believe that is you would create a silo in terms of a catalog and if you
have application specific catalogs for every applicant of data, then you're just creating more and more silos and more and more data governance issues.
So if you had, we believe that having a centralized data catalog that is not tied to visualization or data prep or data science, but it's exclusively focused on cataloging is critical. So we partner with those vendors that are doing that because, you know, you have to
make sure that if you have a data catalog, that has to be a centralized point of truth
and it can't be just creating another silo.
And but, you know, from the value of that for an end user's perspective, I mean, having
being able to pick up a data set and understand who else is using it, what does that data set have in it, where it's being used in different types of analysis, and what is the trust score or how do you validate that this data is accurate?
I think it's tremendously valuable and definitely see a huge need for that and increasing need for that as this space matures.
Yeah, I guess the flip side of focus, focus though is that you potentially become a feature of
something else or you become considered a feature of something else and you see i suppose for
example i think it's um tableau have added data wrangling features to their bi product i work
as well and how do you how do you kind of position what you're doing compared to that and what's your
view on vendors that just add it as a feature into their product? Yeah, so I mean, Tableau coming out with Maestro,
their Dataprep product.
One, we find that incredibly validating.
I mean, going to Tableau's conference,
going to Tableau's conference, I think two years ago
and seeing data wrangling everywhere.
I mean, it was great.
Our team was just saying, this is awesome.
They're free promotion for us
and validating that this is a need.
So it was great.
And we're friends with the Tableau team.
I mean, Pat Hanrahan, who is one of the founders of Tableau, is very tight with Jeff here, who is another Stanford guy.
And we're really close with the executives and founding team at Tableau and will continue to be. I think, you know, similar to the idea around
data cataloging, you know, we see diversity of inputs and diversity of outputs in wrangling.
And a lot of our customers will have Tableau downstream, but they'll also have Qlik. They'll
also have like a strategy that also have another bi tool i mean within single departments you could have you know 10 different analytics or visualization tools that
are being used downstream and so you know our ability to to once again be able to support
diversity of inputs so whether it's files databases um you know cloud storage things
like that and also support diversity of outputs. Multiple downstream analytics or consumption applications is critical to us.
So I think if you are tying your data prep process to a single application or downstream use,
then it's very limiting.
Also, you know, not what we're seeing as the uses of our technology.
I think almost every customer we deal with is outputting the results of Trifacta
into multiple different technologies or repositories
and only supporting a select few
or a select analytics application
is not the use cases that we're seeing
dominate the market at this point.
So I suppose the only criticism I've got of cloud data prep
is that it only connects to BigQuery and to Google Cloud Storage.
I mean, is that something that will, and obviously you can't talk too much about roadmap and so on,
but is that something that you envision seeing maybe extending to things like, you know, to other parts of the Google kind of cloud?
Or is it going to be a case that anything beyond that, you go to your main products, really? Yeah, I mean, so one, we'd love for you to talk to us directly.
If you have uses for Trifacta outside of Google Cloud,
we'd love to be able to start a conversation
and figure out how we can help you.
We're having conversations with Google now
in terms of the future of that product, where it's going.
We are going to GA Cloud data prep in the next few months
and, you know, discussing plans for that
and also plans for the eventual features
and that cloud data prep will have.
So I can't share a ton there.
But what I can share is that,
hey, if you're using cloud data prep,
enjoy it and have other data sources
or use cases that you want to take
on, we'd love to have a conversation with you.
So just to be clear then, the product's
in beta at the moment, isn't it? So it will be
GA soon and everything we're saying now may
well change at some point and so on there.
It's great there's a public beta as well,
which is good. So
if a customer now had, say,
a system built
in Cloud Dataprep and they were launching to say transition to
the full product from you i mean obviously it would mean porting bits and so on but
how much work be involved in that and is it conceptually is it quite similar is it a big
task to do that really uh it's pretty it's pretty um easy actually i mean that's part of our
architecture what's the the the um uniqueness of that is that every workflow, every wrangling
recipe you create within Cloud Dataprep, you can seamlessly run that in any other environment.
So whether that's an on-prem environment, a different cloud-based environment, or in
sort of your own trifecta hosted on Google Cloud,
that can seamlessly plug in there.
So it's not an invasive process by any means.
It's simply just porting over all the workflows
and metadata you've generated in that product
into a different instance of it.
Yeah, I mean, looking myself at some of the other products,
it seems it's the same language between them.
It's you just export the scripts and so on there.
So it seems one of the easier migrations, I think, or upgrades that I could have seen out there.
So that looks quite good.
I mean, just to kind of move on and looking at reading through the Trifactor website blog,
there's quite a good few, I suppose, thought leadership posts there and things about, I guess,
you're thinking about and maybe problems that your company is looking to solve in the future.
And it'd be interesting to talk through a couple of those with you just get your views on i suppose where
the market's going where you guys are going and and you mentioned that one of the posts you had
was about data quality and data quality for new world data sources you know is this something that
is becoming an issue now or becoming more aware of and do you think i suppose the big data world
has got away with it so far a little bit i mean what's your view on that yeah so um obviously as a data cleaning technology um data quality is is
really important to our customers and users and i think one of the use cases that um shines through
that quite a bit is is a use case we have stumbled across that has been one that's been
one of our more dominant ones, which is around compliance. So, you know, global banks, banks in
the US, banks in Europe that we have worked with, you know, have to submit data to different
regulatory bodies to make sure that they are in compliant with government regulations, that they,
you know, are able to do stress tests, that they have a certain amount of money set aside to be able to deal with different global events. or manipulation of input data that they performed in the process
to then output the results that they're giving to these regulatory bodies.
And if you're a head of the bank or if you're a head of compliance at these banks,
which the compliance groups of the banks have been growing significantly
over the past few years, you want to make sure that you are very, very confident
that the data that you're submitting to these government agencies is accurate and you have transparent lineage on that. So, you know,
I think in those use cases, data quality is incredibly important. And, you know, I think
even more broadly, I think if you have use cases around marketing or things like that,
I think they wouldn't need the pixel perfect data that resulted.
They're more sort of optimizing for speed
and speed of results.
But even then, you want to make sure
that the data you're reporting against
is actually accurate.
And I think within organizations,
a lot of people have lost confidence
in terms of the data that is being brought to them
as the sort of, hey, this is the published analysis
that we all validate.
If you actually don't have visibility
into how someone came up with those numbers,
the different data sources that made up that analysis,
then it's impossible to sort of get buy-in on that.
So I think one of the unique elements of Trifactor
that we like to preach is that in a workflow,
you can see not only at a
high level all the different data sets that made up um the the end analysis but also every single
transformation step that may have been applied to the data that ended up getting that result
and so if someone has ever has questions around how you got to an analysis you can simply show
them a workflow and the different recipes that made that made that um made up that workflow to
sort of show them what you did.
So I guess that's probably fairly topical with GDPR now.
I mean, I think that's something where how data was combined
and calculated and the algorithms involved in it and so on
is particularly relevant at the moment, isn't it?
Yeah, I mean, what's the date, May 25th?
Yes, yeah.
We're so obsessed with Brexit over here
that we've forgotten everything else that's going on. But I think that's the other day of armageddon i think for uh for
our financial services industry over here yeah so we've um obviously seen a lot of interest in
leveraging our technology for the gdpr type use cases i think um you know some of the things that
um i think are interesting that you know we we have talked about internally is how do you leverage Trifacta to get recommendations on what data might be sensitive within, you know, a table or a file,
and then, you know, allow them to mask that data or remove that data from certain repositories or
things like that. So it is something we have definitely had conversations around or looking
at and, you know, feel like there's quite a few different opportunities for us there,
but are also kind of being very careful with how we dive into that
because I know there's a lot going on there
and making sure that whatever offerings that we provide
or solutions we provide are well thought out and also hard.
Yes.
Yeah. On your blog as
well you mentioned metadata strategies and master data management and and i guess that's that's an
interesting topic to me because coming from that i suppose the old corporate etl world that that
was very much top of the table you'd be talking about that in enterprise architecture meetings but
in the new world i work in where it's all startups and so on you know getting anyone anyone to listen and talk about metadata is hard but it's actually important and where do you guys
think that's important and where do you think you might be able to contribute to that a little bit
um so it's funny uh people at um sure i thought a lot of them of them were ex-informatic employees.
And it's really interesting to hear.
Everyone's their ex there, aren't they?
Yeah, it's interesting.
It's really interesting to hear their takes on master data management and this single source of the truth.
They're actually probably have a lot of horror stories and are less believers than you might think.
At the same time, you know, metadata is obviously critical.
You know, understanding the context for the different data you're looking at is incredibly important.
And a lot of the work that is done in Trifact is actually defining metadata.
So if you have a raw JSON file and you're having to define
rows and columns out of that and what those rows and columns mean, a lot of the wrangling process is generating metadata related to different attributes within a data source.
So we have a lot of features within the product to allow them to recognize, hey, this is a time-based element.
This is a geographic element.
This is different data types. And then the integration with data catalogs I
talked about earlier, you can understand business context for how that data is used or what is the
makeup of a single data set. So I think we're less concerned around having a single source of truth
or master data management and having data dictionaries in that sense, not really our focus,
but making sure that users are able to define metadata related to their data
and be able to publish that so that other users,
other applications can read that and understand that is really critical.
Okay, excellent.
Well, look, we're almost out of time now.
So just where would anybody,
where would people find out about Cloud data prep and also your products?
Where on the web and material online and that sort of thing?
Yeah.
So for Trifacta,
you know,
we have website trifacta.com.
We have a pretty big presence on LinkedIn,
on Twitter.
So,
and I think Facebook too.
So feel free to,
to,
to,
you know,
join us in those different social media outlets
and you'll get the latest and greatest
of what we're doing.
And also interesting articles or blogs
from different people within the organization
that you might find valuable.
Cloud Dataprep, we do talk about it on our website,
but it's also on the Google Cloud website.
I think it's what Cloud Dataprep
or googleclouddataprep.com.
So yeah, it's once again, public beta. So you can go sign up and use that product if you're a Google Cloud customer.
And then if you're interested in using Trifacta too, we have a free downloadable desktop version
of our product that can handle up to a hundred megabytes and that's free for as long as you'd
like. So there's no time-based limit.
You can go to tryfider.com,
download Wrangler, and get going in a
matter of minutes. That's great. Well, Will,
thank you very much for coming on the show. It's been great to speak to you.
Have a nice day, rest of the day, and
yeah, it's been good to speak to you.
Yeah, Mark, it was a pleasure. Thanks for having me on.
Thank you. Cheers. Thank you.