The Data Stack Show - 63: The ETL - ELT Flip With Ciaran Dynes of Matillion
Episode Date: November 24, 2021On this week’s episode of The Data Stack Show, Eric and Kostas have a conversation with Ciaran Dynes, the Chief Product Officer at Matillion, a powerful and easy-to-use, completely cloud-capable ETL.../ELT solution.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rutterstack, the CDP for developers.
You can learn more at rutterstack.com.
We have a really exciting episode coming up.
And what's most exciting is we're going to live stream it.
The topic is the modern data stack.
And we're going to talk about what that means.
It's December 15th and you'll want to register for the live stream.
Now, Costas, it's really exciting because we have some amazing leaders from some amazing companies.
So tell us who's going to be there.
Yeah, amazing leaders and also an amazing topic. I think we have mentioned the modern data stack
so many times on this show. I think it's time to get all the different vendors who have contributed
in creating this new category of products. And they define the modern data stack and discuss
about what makes it so special. So we are going to have people like Databricks, DBT, and Fivetran, companies that they are
implementing state-of-the-art technologies around their data stack, like Hinge.
And we are also going to have VCs and see what's their own opinion about the modern
data stack.
So in a sense, VC is going also to be there.
And yeah, it's going to be super
exciting and super interesting. So we invite everyone to our first live streaming.
Yeah, we're super excited. The date is December 15th. It's going to be at 4pm Eastern time and
you can register at rudderstack.com slash live. So that's just rudderstack dot com slash live. And we'll send you a link
to watch the live stream. We can't wait to see you there. Welcome back to the Data Stack Show.
We are really excited to talk to Kieran from Matillion. He leads up a product there,
and he has a really long history of working in data. Costas, I'm really interested to ask him
about Matillion specifically. And we'll probably talk about lots of things related to data in
general. But there are a lot of sort of ETL or as we'll talk about T tools out there. And I'm
really interested to know how Matillion does things differently. I mean, they're a really successful company, raised a huge round.
And so I'm excited just to learn more.
How about you?
Yeah, absolutely.
I think they have raised like a quarter of a billion so far.
And they're one of the leaders in this ELT space.
So I think it's going to be very interesting to hear from him both about like,
first of all, like we'll chat with him about like ETL versus ELT, right? Like that's one of the
things that we need to ask him about. And yeah, I mean, Matillion has a great exposure to so many
companies out there. So I'm sure sure he will have some great insights to share
with us about where the industry goes, what the companies are looking for, how the data is used.
And yeah, I think we are going to enjoy our conversation today.
All right, let's dive in. Kiran, thank you so much for joining us on the Data Stack Show.
We're really excited to learn about you,
your background, and what you're doing at Matillion. Hey, thanks for having me, Eric.
Nice to see you. All right. So you've been working in the data space for well over a decade. Do you
want to give us just a quick background on where you came from and what you've done throughout
your career? Yeah, I happen to give a quick intro. I've always been involved in integration software.
Back in the day, I started with a software company in Ireland that was very much about integrating
different applications. I know your listeners know about object request brokers or orbs.
They were kind of the precursor to web services. And then I kind of worked my career up in web
services and enterprise service bus, mostly on the messaging side, how applications and processes
got integrated. A bit of BPM, business process management along the way, kind of ended up then
doing a lot of work on API and so on. And a few friends of mine joined a software company called
Talent and they invited me to join. And it was kind of a bit of a breath of fresh air. I always
found it kind of strange sometimes to explain what is an API, what is an
ESB to friends and family. They're probably bored, senseless, listening to me talk about it.
But I actually found data so much more easy to explain because it was just, you could explain
any kind of interesting analytics project and so many of them. And then, yeah, worked my way
along with Talon. We went IPO. And then more recently, I've joined Matillion,
kind of very much looking at how analytics, cloud analytics and data basically behaves in the cloud.
But yeah, I've always been involved in integration software, as I say. I think data came along or
data integration came along and certainly just made the barrier for me to explain to friends
and family what I do and make it mildly interesting, purely because I think people
are actually interested in some of the big data projects we operate on. to friends and family what I do and make it mildly interesting purely because I think people are
actually interested in some of the big data projects we operate on. Yeah. No, I'm laughing
because working in data and I'm sure Costas has had the same experience. You're at a holiday party
with family and now what does your company do? And you pause for a minute to try to think about,
okay, how do I package this in a way that's digestible? So quick, could you just
quickly explain what did Talon do and what does Matillion do? Just in case any of our listeners
aren't familiar with either of those tools, I think most of them are, but just to kind of set
the table for the conversation would be great. Yeah. So the type of area that Matillion operates
in is in the area of data analytics.
I think a lot of people are familiar with data integration as a kind of a general term.
But data integration means a lot of different types of things.
It can be anything from data loading, people like Fibetran, Matillion, Stitch Data, Talent,
and Informatica do those things, a lot of bunch of open source projects out there as
well that do that. And the simple act of loading the data into a data lake or an S3 book or blob storage, that's certainly one aspect of what we do in data integration.
But it starts to get a little bit more than that.
I think a lot of what Matillion really focuses in on is how data behaves within data analytics and data warehousing.
So it's very much about data in a data warehouse,
how do you merge, how do you curate, how do you get a 360 view of a given data asset?
And a lot of that information then ends up in Tableau reports, PIC reports, it's very much
about BI and analytics. But data integration itself is a bit broader. There's also streaming
and those kind of areas that have little or nothing to do in some respects to analytics.
They can simply just be about moving data from one application to another and maybe even just moving the data back again.
You can imagine like, hey, your ERP, every time a new customer makes a purchase on a website,
ERP basically then is responsible for changing the inventory, doing the order, doing the whole cash flow process.
It's not really analytics per se, but it certainly has a lot to do with data integration.
So it's a pretty broad, all-encompassing term.
Most of what we focus on within Matillion, though, is really about making analytics-ready data.
So people can actually do the tableau, do the click thing, build some reports.
But then going beyond the report, it's about can we take a 360 view of a customer, patient,
employee, and start to basically connect that back into operational systems, be the applications,
customer experiences on websites, operational databases, simply just to scale a business.
So a lot of different things, pretty broad, but as I say, most of to scale a business. So a lot of
different things, pretty broad, but as I say, most of what we focus on is in the area of analytics.
Very cool. And I want to start out with a question, which has been just an interesting topic
in the data space in general, but I think, and I hope that you and Matillion have strong opinions. So ELT versus ETL,
there's a spectrum of opinions on this. And I think some strong opinions on sort of
which one is better, that varies by use case, but what's your take and how does that look
from a sort of actual product perspective at Matillion?
Yeah, I think I had a phrase recently that said ETL should never have existed.
Which somebody said, that's a pretty strong opinion for a company that does ETL.
And I said, yeah, probably is a strong opinion for somebody who does ETL.
But the question is, why did ETL basically get created?
It is a process after all. It's just a way of taking data from a number of different source systems, merging it together, and making a table.
That's literally all it does.
But I think the ETL versus ELT, you've got to look at how data warehouses were used, I think, back in the day.
And even if you were only to go back pre-Snowflake, pre-Big Data,
perhaps let's say go back 10 years, there were these kind of precious systems that people had
a fear with the admin who owned it. And God forbid you went and asked them to basically run,
actually you don't even know the word ad hoc, it's a word you can use with a Teradata admin.
It's just like, what are you referring to? Go back to where you came from.
So it was very much about business critical,
financial critical workloads,
which makes a lot of sense, right?
That you're paying a lot of money
for some very, very highly optimized,
amazing software.
Therefore, like the ad hoc kind of analytics
that maybe we'd run today,
or even just the scale analytics,
it just would have broken the bank. You wouldn't have been able to fund an analytics project. So in that light,
I think ETL got created. So sure, it makes the data warehouse run faster in a sense that it can
extract data, load data, curate data. But actually, it took a lot of the processing,
the more ad hoc analytics processing outside of the data warehouse.
And therefore, you end up with these kind of dual parallel systems, your most important analytics
happening in the data warehouse and everything else, just whatever it is outside in this ETL
process with its own specialized software. It had its own engines. In some ways, it had its own
horizontal scalable engines, clustering, all that type of stuff exists in ETL products.
Whereas if you fast forward today and you even look at Snowflake, Databricks, any of the kind of redshift, they even don't even describe themselves as data warehouses anymore.
They'll describe themselves as a cloud data platform and all that kind of stuff.
But when you peel it back, you kind of say, well, actually, they're a utility. The barrier to go and run a process,
ad hoc or otherwise, even the most important, it's like $2 a credit. So you can go in as a team,
just start to do your own analytics, just purely in isolation, maybe from centralized IT and get
on with it. And in that world, you kind of go and say, well, where does the processing now belong? It's like, well, this utility, the snowflake, this incredible kind of linear scalable capability I've
got with all of my data at my fingertips, surely the better thing would be to leverage it and not
have a separate parallel system. So why did I say that ETL shouldn't exist? It's because
it does exist for certain types of use case, but the balance of processing, whereas previously it may have been like 80-20, you might've had a lot of processing
in ETL and a certain portion of high value stuff in data warehousing. I think that's reversed
completely now when it comes to data analytics. When I'm thinking about the cleaning of the data,
the preparation of the data, it all just lives and belongs inside the data warehouse. It's faster, cheaper, better, more secure. It's kind of like the Olympics. It's that.
Therefore, I think the real architecture design pattern for a lot of what we do when it comes to
cloud data warehousing, it just belongs inside the hyperscalers. And therefore, you should just use
it. Where ETL makes a little bit of sense
still is the loading or the extraction. There are some periphery use cases that make a lot of sense
not to be done in a data warehouse. But back to our original definition of data integration,
I think those things are kind of like either on the load or on the extract, or when you're kind
of doing like app to app kind of data stuff.
But what we use ETL for is always about pushing down into data warehousing.
I think that belongs in the data warehouse.
And that's why I think there is a fundamental shift that's happened where people really are now using an ELT architecture.
Maybe some people don't even recognize it as such.
They go, no, it's ETL, but that's the product category.
I think the architecture is really an ELT architecture.
So no strong opinions at all at Petillion.
Not in the least.
Do you see any kind of current or future use cases where ETL might still be relevant?
I think the one that we see is certainly ingestion,
although some companies would describe it as just ingestion.
The act of having a SaaS application that just does the load,
it makes sense there, right?
Because it's highly optimized.
You can do streaming.
You can do a whole bunch of other types of use cases.
And the data warehouse is not,
well, either they're highly protected, you don't want them connected to the internet that way,
or they're not yet optimized for that use case. I can see Snowflake and others basically heading in that direction where they're adding more streaming ingestion capabilities. But simply
the act of loading ingestion, I think, it's kind of ETL-like.
The other part I think is interesting is the last mile of analytics is where you have something highly curated, 360 view of a customer, and you want to synchronize that back into an operational application, operational database.
I think that is ETL as well.
It's a separate process that sits outside the data warehouse.
Data warehouses themselves are not optimized yet to run a lot of services,
even some of the work that some of the data warehousing companies are doing today.
It tends to be, how would I say it? It's kind of limited in some of the things that it can do. They're not fully formed services in the way like a SOA architecture would think about a
service or even a container or a microservice tend to be extremely tightly bound to do
functional things where state and history and other things basically don't apply.
So I think at the edges, it makes a lot of sense at IoT. But again, are we still doing
ETA at that point? Or is it more like a streaming use case?
Is it Kafka? Is it Confluent?
I think there's other technology out there
that does those really effectively.
But I think ELT as an architecture
makes the most sense today
in terms of how people are using
data in the enterprise.
Cool.
So if I'm thinking about
how someone is doing ETL
with something like Spark, for example, you have the extraction parts, you will write some code for the transformation.
So in transit, the data is going to get transformed and then it's going to get loaded on the destination that in our case, let's say it's like a data warehouse or a data lake how like this transformation part which naturally in etl is like a piece of code that we
write how does it happen in elt because this is part is like pushed into the data warehouse the
data warehouse is a technology that like primarily is being developed to ask questions and get
replies to these questions right so how do you see this implemented and how also Matillion is doing it
if there are like multiple flavors out there
of like how to do that?
Yeah, it's a very interesting question.
So if you look at,
we spent a lot of time working with Databricks
on the way their architecture,
a lot of my background is with Spark technology
from my previous employer.
Arguably what they do though
is that separation of compute and storage.
So their compute and their separation of storage, it's not any way dissimilar to what like a cloud
data platform does. It's just different technology. Now they have different smarts and the different
schedulers and they have different histories, but both the, basically what they're doing though,
is they've, they've separated the storage. They have a way of clustering the compute.
There's a schedule.
They break the task down.
They do a whole kind of MapReduce kind of behavior.
Like if I look at that long and hard, the fact that one uses Spark, one uses Python,
the other uses SQL, that's the modern architecture in my mind.
That's what it is.
That is the ELT.
I think ELT of yesteryear is synonymous with SQL only, and it's only working with cloud data warehousing or even just data warehousing.
But if you look at data like the Lakehouse architecture from Databricks and the way their SQL analytics platform behaves, you can push SQL, Python, PySpark into that engine, it'll look after how the scheduling and the splitting
of the task works.
But in all intents and purposes, it's still an ELT architecture in lots of ways in that
there's a logic that's sitting directly on top of that data, virtualized.
If I was to take the exact same problem and move it over to Snowflake, I'd probably get
it all to work the same behavior.
It might use different technology. It might be SQL- probably get it all to do work the same behavior.
It might use different technology.
It might be SQL based, but pretty much it's the same thing.
You've got access infinitely old to all the data storage and you can spin up the compute as you need it.
It's not like it's a completely separate thing.
Out of respect, I think that's how we would consider it.
And we, as Matillion, we just generate SQL or Databricks. It's highly optimized
for their platform. If we take the same design and we shoot it over to Redshift or Snowflake,
internally, we will just generate different SQL to leverage that platform because they have some
specializations and variants between each of them. But to you as an end user, you just see a design.
But under the covers, we are basically leveraging that ELT architecture. Maybe I couldn't convince Databricks
to call what they do an ELT architecture. But at the end of the day, that separation of compute
and storage, it's that I think is the modern data architecture that people are looking to leverage.
And the fact that the storage basically is like literally just infinitely
scalable and so ubiquitous that you just can create materialized views in the data and use it for
multiple different things. I think that's the big game changer that we basically are witnessing.
And how it works with Matillion, like how, what's like the experience that someone has when
using Matillion? So our experience, I guess, borrows a lot from a no-code, low-code IDE, drag and drop,
where you are designing a logical flow.
So things like you take a data set, you tend to kind of almost get a table view of your
data, tends to try to flatten everything into a table.
We think that tables are, I guess, easier for most human beings to kind of mentally construct. And we're dealing
with analytics people. So ultimately something, if it makes it into a table, it's easy to sort,
easy to filter, and it's easy to basically pivot. That's the moral of the story. But actually under
the covers, it isn't all normalized. It isn't all flattened. It's a highly structured internal data model that
we have. It's just that the visual cue that you see on top of that is just to make it easy for
you to use it. But it is very much a drag and drop metaphor. It has a lot of if then else logic
that you typically see in ETL style products. And from that, we kind of create a visual logical
documentation of the analytics that the
end, the developer is trying to come up with.
And then when they go to run the product,
I think this is where the real kind of smarts of Matillion kicks in.
We start to do a lot of live sampling with the data.
So you kind of construct a piece of logic under the covers.
We're creating SQL, interacting directly live with Snowflake.
We're validating that SQL is valid.
And then we're producing a sample data set so you can actually see whether or not to
that design part of your structure or your flow is like, ah, okay, up until now, I've
got my Salesforce data.
It's looking kind of correct.
Maybe I've normalized the US and the European dates, because it tends to be the case that
in Salesforce, you run into that issue quite a lot.
OK, great.
What's the next thing I want to do?
I want to bring in my power.data.
I want to kind of merge those based on a particular primary key.
And the visual queue basically helps you continuously just iterate, iterate, iterate.
By the time you get to deploying that data into Snowflake or whatever your data warehouse is of choice, you're pretty much certain that the table structure and the logic is correct.
The only thing that potentially goes wrong is just that as you went through the sampling,
you didn't realize that that sample set wasn't representative of the global underlying data set.
That can happen. You tend to only see a couple of hundred rows, but maybe the underlying data
set is billion rows. And that's why when you flush all this through into the data
warehouse, you can then go and check it and visually check to see if you've actually corrected
all the errors in the data. But it's very much a visual metaphor. We try to get as much as we can
to a no code or even low code, but we have a million extension points where people who want to plug in SQL, things like Python,
you could even plug in OR code and things like dbt are all fair game for us.
You can plug in those capabilities and we simply just orchestrate
across all of them.
We're trying to get a visual document representation of your analytics.
And like last I checked, I did a couple of customers last week,
like seven different data sets
is kind of like the form
for anything that's moderately
like considered
to what we'd call an insight.
But like we've got customers
at 26, 27 different data sources
to produce a marketing lead score.
Trying to hand code that
and you can, right?
It's just the maintenance,
iteration, upgradeability
of that flow.
That's where we think that the visual look and feel of the product really starts to come
into its own, as well as that sampling capability, which we think is really just a killer capability
that as you design, you see live data and you see the logic of what you've designed.
It's those things that basically are the powerful capabilities that Matillion offers.
And again, it's all an ELG architecture.
So we're directly operating on top of your data warehouse.
Yeah, that's super interesting.
And who is the user of Matillion?
User for us is data engineer.
But the problem with that term is that means a lot of different things.
So ETL engineer for 100%, it's just that person who's used to
the ETL design paradigm. Data engineer, I think, is a broader term. I think data engineer for us is anybody that could be doing things like airflow orchestration. They could be hand coding. But you
look at it long and hard enough, it's just a different tool set or different stack for them.
So we try to blend both of those in where people who are more used to that kind of
engineering background, which is CICD, injection of control, hey, they probably grew up writing
Java code using Spring Framework for all I know, but that's just me. But that type of person is now
coming into the data world. The reason being is I think it's because people are recognizing that the resilience of the data pipeline is a word you hear a lot of, like the downtime of your data.
And I think engineers have been really good, certainly SRE, cloud ops engineers have been really good in terms of figuring that problem out.
And I think that is influencing, that data ops thing is strongly influencing or has influenced Matillion in terms of how we look
at orchestrating those pipelines together. So we have these ETL people. They're very much looking
at business logic, business data, and their job is to take what your CFO wants to see in terms of
revenue forecasting and that type of thing. But there's a whole bunch of other people around it
who are kind of building all the periphery, the connecting and the loading of the data into the bronze sort of storage.
That engineer is also part of what we do.
But I think they're different skill sets, but they're complementary in nature.
I think we tend to separate that there's like almost like a mini SRE team, which are data engineers that surround these ETL engineers. And the ETL engineers are really looking at the actual design logic creation of this master
record of something.
So that tends to be the two groups that tend to use our product.
Yeah, it makes total sense.
And it's a great point that you are making here because many times you hear people asking
like, okay, what is a data engineer?
Like, why do we need another discipline in engineering, right?
And actually, I think that the best definition that I can personally give is that data engineering
is like hybrid between operations, SREs, as you said, and actual software engineering,
because you also have to do that so
you pretty much like to be a successful data engineer you need to have like knowledge from
both like you need to build your pipelines but at the same time then you have to monitor your
pipelines and care about SLAs and have them like up and running and all these things and
I have a question which is actually like something that I find interesting in general,
not just like for data products, but how is this visual metaphor that you described fits into like
the workflows that engineers have, like developers have, like all these CICD versioning, all the
standard like tools and methodologies that engineers have to support the quality
of their work?
How does it work?
Very interesting question.
I spent a lot of years basically looking at CICD version control.
A good number of years ago, this actually must be about, I'm going to go back to, hold
on, I want to go back 18 years.
I was at one point a Clearcase admin.
So I spent a bunch of time being an engineering manager and I had to be the clear case admin because there wasn't just
nobody else to do it. So I kind of grew up in that whole strong versioning control that IBM
rational products had. And then other types of products have come along. I think these days,
everybody uses Git or Bitbucket and those types of things. But the whole notion of versioning and branching and merging and those types of capabilities,
I just don't think it's, not that it's not natural, but it's not in the kind of the purview,
I think, of the ETL engineer.
It certainly hasn't been, but it's definitely something that engineers are just going to
go, well, that's how you do it.
So what we've tended to see is the capabilities that are kind of the Git-like thing with version control and branching and merging, they're becoming commonplace in the data products, the ETL stack.
We may not use the same labeling and the way it's visually shown to the end user as the way an engineer would be comfortable with, but it's the same thing.
And actually under the covers, we're actually using git for that matter like that's how we do our version control and
it's very strong version control and very strong branching and merging but i never haven't yet
exposed that terminology to the etl engineer i don't want to scare them but i think they like
the fact that they can roll back and they can share and they can do all those things.
And then it goes further than that, right? Because non-repudiation of aversion is becoming a really
important thing in our world because we operate so quickly at some point when something breaks.
Now, breaking could be just that a pipeline doesn't run, could be a security issue,
could be something else, could be something more nefarious, right? That a bunch of records basically appeared on the internet and lo and
behold, we didn't mask something properly. Somebody's got to go check out why. And maybe
there was a misconfiguration of a rule inside one of the ETL pipelines or one of those particular
products. If that's not versioned and controlled and checked in, you have no idea. And a lot of
ETL down through the
years was just not that. It's almost like we got an analytics project. Great. How does the data
work? It does this thing. Okay. How do we know we're being successful? Because the head of sales
basically hasn't given out to me this week. That was the testing, right? It was that. And then you
come along and you kind of upgrade or migrate that. So we go, how do we retest? Well, we check
to see if the head of sales is giving out to us again.
And then we know the report looks like it's correct.
But that's not good enough, I think, clearly in modern enterprises.
So I think the CICD is here to stay.
It's just that we don't necessarily expose those features the way we would to an engineer,
but we're actually still using those under the cover.
So that's how we experience it. But it is strong versioning for a lot of good reasons,
but a lot of it comes back to, we just simply make, we think it makes the data boat go faster
because upgrades and migrations and all those things that happen all of the time now
and reuse is really well supported by those principles. Yeah, that's super interesting.
And there are like two terms that we hear a lot lately,
and they're like many companies are getting funding
to build products around that,
which is anything around data governance quality.
What's your opinion on these?
And how do you see also these kinds of functionalities, how they play together
with the ETL tool or ELT tool like Matillion?
Very interesting one.
I've spent a lot of time over the last number of years building data cataloging technology
and data governance technology.
And I've kind of seen it grow up and then during COVID, I wouldn't say it waned, but it has basically found maybe some of its
place and position.
So it's a case of going, I think cataloging capabilities can really dramatically improve
analytics.
They really promote very strong reuse.
If you can extract a lot of the semantic meaning of data, you can do really cool things.
You can start to automatically infer if the data is good or bad or if it's standard or not
standard. And that stuff comes, I think, a lot from the principles of what data governance teams
and product can do. They're very good at looking at metadata. They're very good looking at
relationships. And if you can put that stuff to use, you can ultimately solve the big problem,
which is the data quality problem.
So I think a lot of what the governance products can do is provide really good semantic understanding of data that could be used not just for the purposes of governance,
but actually, more importantly, used for the purpose of data quality and fixing data or automatically detecting and indicating there's something wrong with the data.
A lot of the governance products that we're're really good friends with Calibra.
A lot of them basically exist in some ways at a different level.
They're kind of like a ticketing system whereby approvals and there's like data custodians
and people who own the data have to basically approve it for sanctioned for use.
But I think they're only really at the beginning of that industry. I think it's like, yeah, we've seen massive innovation there in the
last couple of years. But I think it's going to be more interesting if you look at what Snowflake's
doing around the data cloud, this idea that there is this massively curated sets of reference data
sets. It becomes really interesting that if you start to blend some
of the principles of the catalogs and the governance in terms of where did that data go
and how does anybody know after it's been released in the data cloud? So I think governance is
interesting that it has a whole new innovation area that I think it'll eventually end up in.
But I think primarily right now is I'm fascinated by the use of the metadata that governance tools have, but to actually go and fix the quality problem.
I think that is actually a problem we should go fix.
I think it's not even just practical.
It's like we have to solve that problem.
And I think governance is kind of like an interesting secondary issue that a lot of organizations have.
But everybody has a data quality problem.
Everybody has that problem. So I think for ETL and us, we use the metadata to go fix it.
And then we partner with the best in the business, the likes of Alation and Calibra to help their
customers do what they want and what they do in terms of approvals and all those types of things.
But to me, I'm more interested in the use of the
metadata to go fix the quality issue. Super interesting. Okay, so ETL or ELT,
as we call it today, I mean, it's something that exists pretty much since we created database,
right? So we might keep reinventing it, but as the process, it's something that exists for forever.
What's the future?
How does it look like?
How do you see it based on your experience with Matillion?
What is next, both for Matillion and also for this category of products?
I think you're right.
I think every once in a while, a blog will come out, usually by a data integration vendor,
that ETL is dead, just to kind of reskin it and say it's not quite dead.
It takes on a new life of its own. What we look at right now is this, what we call a definition
of the modern analytics, which is a combination of BI and data science and operational analytics.
So in that respect, what I think is that you're right, ETL is here to stay, but I think the future
of ETL is back to what we talked about in terms of
the operations. It's really about not just automating much more, it's about much more
resilience in those pipelines. How can you detect that something is going to fail before it fails?
I think we can really solve that problem today. How can you do things like get a job to optimize itself?
Those types of things are definitely starting to become real, the things that we can actually go do,
because we've learned a lot more about the relationships of the data and the query
optimizers inside the data, whereas they're becoming a little bit more accessible in terms
of how the APIs would work. But I think that's where I think a lot of the ETL has got to go,
is that can we detect
errors before they happen? Can we alert people? But then can we auto detect that something could
be better optimized by automatically tweaking the configuration? And the only way we can do
those things is A, we've got APIs. We have the ability to inject variables. So again,
good engineering principles. And then it's actually about leveraging the APIs
of those underlying platforms
where they have really smart, intelligent things built in.
And we can basically promote different attributes,
different ways of configuring the optimizers.
And those optimizers then help the actual job run better.
So it's like that kind of, there's an ecosystem
or a kind of a sense of if you can bring
together all those kind of capabilities, the ETL becomes smarter, more resilient, more optimized
in the future. But I do think it comes back to that is that we're trying to solve the problem
of BI data science and operational analytics. And ultimately, that's going to be about making
the pipelines run faster with more resilience and then using the data,
curating it much more and reusing the insight that we actually generate and curate. That's what I
think the future is. And that's exactly what we're building at Matillion. We call it the data
operating system. We think that companies need to run their data as an operating system. And
an operating system by definition is modular, smarter, more resilient,
more scalable than the way we used to look at ETL, let's say last year or the year before.
Nice. One last question from me and then I'll give the stage to Eric. Okay. About the destinations,
I think the set of possible destinations, it's pretty limited.
We know it's all the data warehouses that are out there.
There are not that many anyway.
But about the sources and basically your experience as Matillion with all the different companies that you have interacted.
What are the most, let's say, common ones?
And also, can we break them down into some categories of sources that are distinct in some way?
It's a great question.
I think it's a real bugbear, I think, of all software vendors right now is that everybody ends up basically becoming a connector company in the integration world.
And a lot of it's down to the fact that customers are not sure if it's unwilling because they get why they want it.
Every connector has to be supported.
So then everybody basically does the same thing over and over again.
We all end up with hundreds of connectors.
And then lo and behold, AWS will change its security profile,
come out with some new IAM service,
and you've got to go and iterate through 100 connectors.
And we all do it, right?
Every single one of us.
I mean, it doesn't matter.
You're going to rock up to your next big $1 million customer come January. And they're like, hey, do you guys
support some API from some new CRM that you haven't heard from before? So to break the back of that
problem, we've been kind of looking at, hey, we'll give you a no-code toolkit. You point it at the
API, and we will automatically construct a Matillion connector under the covers to try to
alleviate some of that need for the vendor always to be building out the connector. Connectors for
us largely fall into really just two very simple categories. At Matillion, we tend to broadly look
at batch-orientated APIs, batch-orientated data warehousing like JDBC connectivity. But now
increasingly, we look a lot more at CDC and
streaming APIs. So there's a lot of work that we've been putting in. We're going to announce
it at reInvent in a couple of weeks in the area of change data capture and streaming.
We tend to look at those APIs subtly differently because the nature of the queuing capabilities
and the queuing technology, and there's just a whole other kind of service lifecycle that you
have to obey and observe that's quite different with APIs in a sense of internet APIs, REST APIs versus something
like a queuing technology where you read it once at most once delivery, all those types of things
are very, very different. So I tend to look at them, those are two broad categories, but ultimately
I think it comes back to is there's vertical categories that customers
are interested in.
Do you have a set of capabilities in finance?
Are you guys really good with billing applications?
Like, do you support Recurly and all the rest of it, like the whole list of things like
NetSuite?
But I think for us, it really comes down to that.
The ingestion capabilities are broadly bifurcated into REST APIs, databases, and increasingly now streaming
APIs.
Okay.
That's interesting.
And why do you, like, why CDC is important?
I mean, recently also like Fivetron acquired like a company that like specializes on CDC.
We have seen CDC being mentioned a lot, especially like in big corporations.
We had like someone like from Netflix, they have done mentioned a lot especially like in big corporations we had like someone like
from from netflix they have done like a lot of work there why cdc like a thing because initially
i mean the technology that is based on is like the replication logs of the databases right like
it was built for something completely different so yeah for every time I've heard ETL is dead, I've heard CDC is dead.
I think a lot of it is to do with organizations right now are doing cloud migration.
And they're trying to digitize as fast as they can.
And at the end of the day, they don't have the ability to always change all of their on-prem software at the same time. But they have the need
to basically get that data, the change data sets into their cloud analytics platform. So I think a
lot of it is for me is that they've selected a cloud data warehouse. They've bought in very
strongly to the vision of what that analytics can deliver. I mean, it's true, right? I've seen it
for myself. I can see what those platforms can deliver. But some of those changes in their business are so important, and they have to
happen at a faster rate than basically a daily or an hourly batch load, that it's like, hey,
if we could just use the CDC style of use case, that would basically help our analytics. And I
think it's that state of affairs that we're in right now.
I do believe, though, that there will be another messaging technology.
It could be Kafka.
It could be a new variant that comes along that's so ubiquitous and widely deployed within
the cloud infrastructure.
And it overcomes a bit of the kind of the complexity of the admin that we could just
see a replacement of some of that CDC style of use case, which,
as you said, is the log redo and kind of style.
And it becomes much more of a messaging kind of push to a queue with topics, basically
multiple readers.
Right now, I think it's just one of practicality.
I think we're used to basically doing the logging.
We can't change those operational databases, even if we
wish, because it would just impact the business so catastrophically or so big. It would be just
too risky. So why change it? Why change what works, I think is what I'm observing. Like one in every
four of our customers right now is like, what are you guys doing with CPC? And how do you get it
into Snowflake? So it's not just a, it's like, can you get it into Snowflake
in a highly resilient way?
And I actually think it may be like, I was, I guess,
proven a lot by the likes of the Fivetran guys saying,
hey, data ingestion is not just ETL.
And I was like, it is.
I'm like, well, no, it's different because what we've done is
we've just said we're going to solve the problem of loading data to cloud.
And after that, do what you want with it.
And I think CDC is set for a similar kind of rethink.
It's just get it out of the log file and stick it into S3.
Do what you want after it.
And we'll have it ordered.
We will have a high fidelity.
We'll have a metadata log.
We'll have all of the information
that you need to go and do whatever analytics you want and as much of it as you want. And I think
that's the redo on CDC that's coming. It's optimized for the way we can do analytics in the
cloud. I think that's the evolution that's coming rapidly in CDC. Super interesting.
Eric, all yours.
I know you can keep going.
I think we're getting close to the end here.
Kieran, I wanted to rewind just a little bit.
You talked about modern analytics as being BI, data science, and operational analytics.
And I'd love to drill in on that. And one thing that we've
seen repeatedly on the show is that there are a lot of terms that I think a lot of people,
including myself, think are easy to define. Oh, analytics, right? But then if someone said,
hey, could you give me a really good, concise, articulate definition of analytics. I may have to stop and think about that
because it can be very complex and wide ranging, but it just really struck me that you sort of
included three pretty traditionally separate components in a single definition of analytics.
So can you dig into that? Yeah, I think for me, if I go back to where we kind
of started here, we talked about ELT and the kind of the benefits of cloud storage technology and
the separate of compute, like business intelligence. I'd love to know what came up with the term,
by the way. I think it's fascinating, right? Because I think middleware companies and
integration companies always have a drive for how do we tell people our business value? We never
really cracked the code. But the analytics guys basically cracked it 20 years ago. It's business intelligence. It
just sounds amazing. And really what a lot of it is, is just to basically, as you know,
is providing reporting on data. And there's a lot of stuff that goes into making that happen.
But when I look at data science, I don't think it's the same type of analytics in general. I think a lot of stuff we do in data science can be, but I think it's sometimes about
that the answers that we get from data science are not always deterministic. That's always the
classic one. Sometimes they can be range-based. So within a particular range, the answer is
somewhere here and it's a different type of thing. And you've got a whole bunch of techniques and algorithms and stuff that
people have built up. So I won't even go into all that,
but I still think it's analytics, but it's a different type,
serves a different purpose and whether you're bought in on the volume and
scale and all that type of stuff. Yeah, maybe, but,
but I think it's more to do with that.
The answer is not always a single deterministic value.
It's in a kind of a range of values.
But the last thing in operational analytics, I think that's different only because it distinctly
says we basically want to operationalize something that we've learned. And all it basically says is,
hey, the last mile of analytics is not a visual dashboard. It certainly is a great way to create
a conversation with an executive team.
But ultimately, like I always talk about leading and lagging indicators.
We're big believers at this at Matillion.
There's a framework called four dimensions of execution.
And it's like you define a wildly important goal.
As a team, you set this noting of what's a lagging indicator, which might be revenue or something like that or customer count.
But nobody in a sales team can do revenue on a Monday because like, hey, the deal might
even close for six months from now, but they can do how many customers they've talked to
this week.
How many demos have they done?
How many trials have they got in a queue?
How many SQLs they've been cleared?
They're leading indicators of revenue.
So you start breaking things down that kind of way.
You start to get into this kind of
like, okay, the visual dashboards that we use in lots of organizations are really around those
things. But what do you do with them after you've learned some sort of an insight? You've learned
that there's a correlation between this and this. It makes so much more sense to take that insight
and take it out of the Tableau dashboard and give it back to
that salesperson who every day is making those cold calls. You'd say, hey, not for nothing.
The list of 100 calls we've created for you as the marketing team, we're going to stack rank them in
the best order we think is possible for you to call those customers because we think there's a
propensity model here that you need to know about. Propensity model comes from maybe data science.
The underlying data sets comes from BI. BI basically created a dashboard for everybody
to go and say, oh, that looks interesting. But operational analytics was to take that insight
and actually do something with it in the day-to-day of the salesperson. So that's why I
separate them into three things. It's because I think they deliver different value. And I think they actually are subtly different use cases, even though they all can be additive to each other.
And that's what we define as modern analytics. And when you're doing at least all of those three,
we think you have the right to basically say, hey, we're a digital leader. And it's Pete's Coffee,
it's Slack, it's those Juniper Networks, it's those companies that work with us.
And that's what they're doing. So when I started off by saying, hey, these guys are using these
output connectors from Atillion, those are the companies that are driving us for those. They want
those insights that they've generated from BI and data science, and they want to get those
marketing lead score algorithms they've developed back into the layers of their marketeers.
That's why we call that modern analytics.
We think that really defines a data-driven company.
How many companies?
So a lot of times I think about the journey that a company has to go on
to become data-driven, call it digital transformation, whatever, right?
First, you need to collect all of your data and to your point, fix your data quality issues so that
whatever insights and however you're deriving them on top of all this collected data in your
warehouse are good insights. Number one, dashboarding, I would say is part of that,
but a lot of times it's sort of another step, right? Actually sort of building good dashboards. And that goes from executives sort of down to functional teams. How many,
like what is the fall off of companies who sort of collect data, do the quality thing,
have good dashboards, and then how many companies are actually doing operational analytics well?
Because my sense is that it's probably not many. I mean, you mentioned companies
that we all want to emulate, but I wonder what penetration is with operational analytics.
We did a bunch of surveys actually this year on that. I'm happy to share the data with you,
if you wish. We saw that BI analytics, yeah, no surprise, right? With like Matillion, it's like
in the 80 percentile, right? And a surprise it
wasn't 100 percent. I just kind of scratched my head on that one. The data science guys were in
like the 46 to 52, somewhere around there, because we ran this survey over multiple days with
different webinars that we ran. And the operational analytics are anywhere between 8 and 14 percent,
right? So it's kind of down there. It's not basically as done as much as maybe as the
other types of analytics. I guess that's to be expected. People are maybe only now waking up to
do some of these things. I've got to believe that COVID has fundamentally changed the way we use
data forever. I think a lot of what we're doing right now in terms of digitizing the business
is never going to change.
We're never going to go back to some of the things we used to do.
We've had to basically put a lot more reporting in front of people in Zoom calls, in things that were in Slack and ClickUp.
So we've created the, we don't print out stuff anymore, right?
We don't basically show up to the exec meeting and here's the printout, the exec pack and the rest of it.
The exec pack is basically a Google doc.
And the Google doc links into the Tableau report and all those things.
And then you started getting into like,
what's a hack day in an organization?
It's basically, so the cultural shift is rapidly happening.
And I think it's like, hey, it's winners and losers, right?
A lot of businesses will struggle to make it through this pandemic
and the ones who basically come out the other side of it,
I think they're the ones who are going like,
hey, what is everybody else doing?
And I think if they've at least made the journey
to cloud analytics and they have a cloud platform,
they have a fighting chance of taking some of those insights
they probably have already created.
They've just got to basically get them out of the data warehouse and put it in some operational
system, database, e-commerce website, something, propensity to buy model.
What's the next product that somebody basically wants to select all those things.
A few weeks ago, I met one of our customers.
It's an online webinar.
I kind of jokingly said to him, like it wasn't really insulting, but he's in insurance.
I said, hey, insurance must be the most boring data industry to work in, right?
And I said, so what are you doing in data science?
Probably nothing, just to provoke a reaction.
And he kind of laughed and said, well, let me tell you what we do.
He said, you may have heard in America that sometimes we struggle with things like climate
change and agreeing whether it's happening or not.
I said, might have read that in the news.
Yeah, controversial to kind of dig into that with the customer on a live webinar. But he said, but we as an insurance
company have to take a position on climate change because we insure a lot of properties in the state
of Washington. You looked at the state of Washington recently, a lot of forest fires and
things like that. So they're using a lot of GPS location data, weather reporting data,
a lot of predictive models to incorporate that into the contract insurance information that
they have to say, what is the likelihood of a forest fire wiping out town A, B, and C next year?
Now that is climate change, but it's really interesting that as an insurance company,
they have to take a position on that because that is basically the future or
not the future of an insurance company.
If they are not on the right side of that predictive model,
like therein is the operational analytics that I'm talking about.
They're doing BI,
they've moved to data science and now they're basically building alerting into
their entire system.
Like, there you go.
In the most boring industry, insurance, they are doing incredible things.
And then I thought it was just hilarious.
Not hilarious, kind of big and important for me.
But I thought it was interesting that he chose to say that they had to take a position on
climate change.
So that's the other thing I think is fascinating during COVID.
We are all more exposed to data and data analytics and projects like never before, right? You think
about it. I used to watch the news in America the last 20 years. What is on the news that is to do
with data? It's all financial Wall Street stuff and baseball, right? Two things. So the only two things, and now we have
COVID, right? COVID is the third thing that every day it's stats, reports, all those things.
That's really interesting to me in terms of that our culture is becoming more data savvy,
more analytics aware beyond the two popular ones, as I would have said, of finance and sports.
We're now basically looking at another dimension that is more scientific related,
but look at climate change reports and how they're denied. And basically some people believe and
some people don't believe those things. I think they're going to generate a culture of people
that have a greater, I hope, awareness of the importance of using data to prove or disprove a theory.
Well, we could not have picked a better way to end the show. I think that was incredibly insightful. I am with you. I hope that our societies do become more data-driven and become
more analytical because I think that's really healthy in many facets of life, not just if you work in data integration.
This has been a really great show and we'd love to have you back on as you settle into the saddle even more at Matillion and continue to build some amazing things for your customers.
Hey, great. Thanks for having me guys. Really great to chat with you today and love to do it again.
Okay.
That was a really interesting show.
And this takeaway for me is more of just a funny one. And it was an anecdotal observation that Kieran made, but he talked about the head of sales
giving you hell when something's not working or you have a data quality issue.
I think we've probably both experienced that throughout our careers in one way or another.
And I just got a kick out of the idea that the head of sales is sort of the most high impact data QA engineer that there is.
Yeah, I think the question,
why this lead is not on Salesforce is something that...
How many times you've heard that?
Yeah, like, I don't know,
like an amazing early detection mechanism
for data quality issues.
Yeah, who needs a propensity model?
Absolutely. Yeah, yeah. I think everyone can relate to that.
Yeah, it was an amazing conversation.
I mean, Kieran is like a person who has like a huge, huge experience.
He has experience like ETL or ELT or data integration, whatever we want to call it,
in many different phases of the industry, with talent, with material now. And yeah,
he shared like some amazing thoughts and like experiences with us. And I'm really looking
forward to have him again back to the show. Well, thanks for joining us again. And we will
catch you on the next episode of the Data Stack Show. We hope you enjoyed this episode of the Data Stack Show. Be
sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com. The show is brought to you by rudder stack,
the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.