The Data Stack Show - 63: The ETL - ELT Flip With Ciaran Dynes of Matillion

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rutterstack, the CDP for developers. You can learn more at rutterstack.com. We have a really exciting episode coming up. And what's most exciting is we're going to live stream it.

Starting point is 00:00:30 The topic is the modern data stack. And we're going to talk about what that means. It's December 15th and you'll want to register for the live stream. Now, Costas, it's really exciting because we have some amazing leaders from some amazing companies. So tell us who's going to be there. Yeah, amazing leaders and also an amazing topic. I think we have mentioned the modern data stack so many times on this show. I think it's time to get all the different vendors who have contributed in creating this new category of products. And they define the modern data stack and discuss

Starting point is 00:01:00 about what makes it so special. So we are going to have people like Databricks, DBT, and Fivetran, companies that they are implementing state-of-the-art technologies around their data stack, like Hinge. And we are also going to have VCs and see what's their own opinion about the modern data stack. So in a sense, VC is going also to be there. And yeah, it's going to be super exciting and super interesting. So we invite everyone to our first live streaming. Yeah, we're super excited. The date is December 15th. It's going to be at 4pm Eastern time and

Starting point is 00:01:38 you can register at rudderstack.com slash live. So that's just rudderstack dot com slash live. And we'll send you a link to watch the live stream. We can't wait to see you there. Welcome back to the Data Stack Show. We are really excited to talk to Kieran from Matillion. He leads up a product there, and he has a really long history of working in data. Costas, I'm really interested to ask him about Matillion specifically. And we'll probably talk about lots of things related to data in general. But there are a lot of sort of ETL or as we'll talk about T tools out there. And I'm really interested to know how Matillion does things differently. I mean, they're a really successful company, raised a huge round. And so I'm excited just to learn more.

Starting point is 00:02:29 How about you? Yeah, absolutely. I think they have raised like a quarter of a billion so far. And they're one of the leaders in this ELT space. So I think it's going to be very interesting to hear from him both about like, first of all, like we'll chat with him about like ETL versus ELT, right? Like that's one of the things that we need to ask him about. And yeah, I mean, Matillion has a great exposure to so many companies out there. So I'm sure sure he will have some great insights to share

Starting point is 00:03:06 with us about where the industry goes, what the companies are looking for, how the data is used. And yeah, I think we are going to enjoy our conversation today. All right, let's dive in. Kiran, thank you so much for joining us on the Data Stack Show. We're really excited to learn about you, your background, and what you're doing at Matillion. Hey, thanks for having me, Eric. Nice to see you. All right. So you've been working in the data space for well over a decade. Do you want to give us just a quick background on where you came from and what you've done throughout your career? Yeah, I happen to give a quick intro. I've always been involved in integration software.

Starting point is 00:03:43 Back in the day, I started with a software company in Ireland that was very much about integrating different applications. I know your listeners know about object request brokers or orbs. They were kind of the precursor to web services. And then I kind of worked my career up in web services and enterprise service bus, mostly on the messaging side, how applications and processes got integrated. A bit of BPM, business process management along the way, kind of ended up then doing a lot of work on API and so on. And a few friends of mine joined a software company called Talent and they invited me to join. And it was kind of a bit of a breath of fresh air. I always found it kind of strange sometimes to explain what is an API, what is an

Starting point is 00:04:25 ESB to friends and family. They're probably bored, senseless, listening to me talk about it. But I actually found data so much more easy to explain because it was just, you could explain any kind of interesting analytics project and so many of them. And then, yeah, worked my way along with Talon. We went IPO. And then more recently, I've joined Matillion, kind of very much looking at how analytics, cloud analytics and data basically behaves in the cloud. But yeah, I've always been involved in integration software, as I say. I think data came along or data integration came along and certainly just made the barrier for me to explain to friends and family what I do and make it mildly interesting, purely because I think people

Starting point is 00:05:04 are actually interested in some of the big data projects we operate on. to friends and family what I do and make it mildly interesting purely because I think people are actually interested in some of the big data projects we operate on. Yeah. No, I'm laughing because working in data and I'm sure Costas has had the same experience. You're at a holiday party with family and now what does your company do? And you pause for a minute to try to think about, okay, how do I package this in a way that's digestible? So quick, could you just quickly explain what did Talon do and what does Matillion do? Just in case any of our listeners aren't familiar with either of those tools, I think most of them are, but just to kind of set the table for the conversation would be great. Yeah. So the type of area that Matillion operates

Starting point is 00:05:43 in is in the area of data analytics. I think a lot of people are familiar with data integration as a kind of a general term. But data integration means a lot of different types of things. It can be anything from data loading, people like Fibetran, Matillion, Stitch Data, Talent, and Informatica do those things, a lot of bunch of open source projects out there as well that do that. And the simple act of loading the data into a data lake or an S3 book or blob storage, that's certainly one aspect of what we do in data integration. But it starts to get a little bit more than that. I think a lot of what Matillion really focuses in on is how data behaves within data analytics and data warehousing.

Starting point is 00:06:23 So it's very much about data in a data warehouse, how do you merge, how do you curate, how do you get a 360 view of a given data asset? And a lot of that information then ends up in Tableau reports, PIC reports, it's very much about BI and analytics. But data integration itself is a bit broader. There's also streaming and those kind of areas that have little or nothing to do in some respects to analytics. They can simply just be about moving data from one application to another and maybe even just moving the data back again. You can imagine like, hey, your ERP, every time a new customer makes a purchase on a website, ERP basically then is responsible for changing the inventory, doing the order, doing the whole cash flow process.

Starting point is 00:07:08 It's not really analytics per se, but it certainly has a lot to do with data integration. So it's a pretty broad, all-encompassing term. Most of what we focus on within Matillion, though, is really about making analytics-ready data. So people can actually do the tableau, do the click thing, build some reports. But then going beyond the report, it's about can we take a 360 view of a customer, patient, employee, and start to basically connect that back into operational systems, be the applications, customer experiences on websites, operational databases, simply just to scale a business. So a lot of different things, pretty broad, but as I say, most of to scale a business. So a lot of

Starting point is 00:07:45 different things, pretty broad, but as I say, most of what we focus on is in the area of analytics. Very cool. And I want to start out with a question, which has been just an interesting topic in the data space in general, but I think, and I hope that you and Matillion have strong opinions. So ELT versus ETL, there's a spectrum of opinions on this. And I think some strong opinions on sort of which one is better, that varies by use case, but what's your take and how does that look from a sort of actual product perspective at Matillion? Yeah, I think I had a phrase recently that said ETL should never have existed. Which somebody said, that's a pretty strong opinion for a company that does ETL.

Starting point is 00:08:36 And I said, yeah, probably is a strong opinion for somebody who does ETL. But the question is, why did ETL basically get created? It is a process after all. It's just a way of taking data from a number of different source systems, merging it together, and making a table. That's literally all it does. But I think the ETL versus ELT, you've got to look at how data warehouses were used, I think, back in the day. And even if you were only to go back pre-Snowflake, pre-Big Data, perhaps let's say go back 10 years, there were these kind of precious systems that people had a fear with the admin who owned it. And God forbid you went and asked them to basically run,

Starting point is 00:09:17 actually you don't even know the word ad hoc, it's a word you can use with a Teradata admin. It's just like, what are you referring to? Go back to where you came from. So it was very much about business critical, financial critical workloads, which makes a lot of sense, right? That you're paying a lot of money for some very, very highly optimized, amazing software.

Starting point is 00:09:38 Therefore, like the ad hoc kind of analytics that maybe we'd run today, or even just the scale analytics, it just would have broken the bank. You wouldn't have been able to fund an analytics project. So in that light, I think ETL got created. So sure, it makes the data warehouse run faster in a sense that it can extract data, load data, curate data. But actually, it took a lot of the processing, the more ad hoc analytics processing outside of the data warehouse. And therefore, you end up with these kind of dual parallel systems, your most important analytics

Starting point is 00:10:11 happening in the data warehouse and everything else, just whatever it is outside in this ETL process with its own specialized software. It had its own engines. In some ways, it had its own horizontal scalable engines, clustering, all that type of stuff exists in ETL products. Whereas if you fast forward today and you even look at Snowflake, Databricks, any of the kind of redshift, they even don't even describe themselves as data warehouses anymore. They'll describe themselves as a cloud data platform and all that kind of stuff. But when you peel it back, you kind of say, well, actually, they're a utility. The barrier to go and run a process, ad hoc or otherwise, even the most important, it's like $2 a credit. So you can go in as a team, just start to do your own analytics, just purely in isolation, maybe from centralized IT and get

Starting point is 00:10:59 on with it. And in that world, you kind of go and say, well, where does the processing now belong? It's like, well, this utility, the snowflake, this incredible kind of linear scalable capability I've got with all of my data at my fingertips, surely the better thing would be to leverage it and not have a separate parallel system. So why did I say that ETL shouldn't exist? It's because it does exist for certain types of use case, but the balance of processing, whereas previously it may have been like 80-20, you might've had a lot of processing in ETL and a certain portion of high value stuff in data warehousing. I think that's reversed completely now when it comes to data analytics. When I'm thinking about the cleaning of the data, the preparation of the data, it all just lives and belongs inside the data warehouse. It's faster, cheaper, better, more secure. It's kind of like the Olympics. It's that. Therefore, I think the real architecture design pattern for a lot of what we do when it comes to

Starting point is 00:11:56 cloud data warehousing, it just belongs inside the hyperscalers. And therefore, you should just use it. Where ETL makes a little bit of sense still is the loading or the extraction. There are some periphery use cases that make a lot of sense not to be done in a data warehouse. But back to our original definition of data integration, I think those things are kind of like either on the load or on the extract, or when you're kind of doing like app to app kind of data stuff. But what we use ETL for is always about pushing down into data warehousing. I think that belongs in the data warehouse.

Starting point is 00:12:34 And that's why I think there is a fundamental shift that's happened where people really are now using an ELT architecture. Maybe some people don't even recognize it as such. They go, no, it's ETL, but that's the product category. I think the architecture is really an ELT architecture. So no strong opinions at all at Petillion. Not in the least. Do you see any kind of current or future use cases where ETL might still be relevant? I think the one that we see is certainly ingestion,

Starting point is 00:13:08 although some companies would describe it as just ingestion. The act of having a SaaS application that just does the load, it makes sense there, right? Because it's highly optimized. You can do streaming. You can do a whole bunch of other types of use cases. And the data warehouse is not, well, either they're highly protected, you don't want them connected to the internet that way,

Starting point is 00:13:36 or they're not yet optimized for that use case. I can see Snowflake and others basically heading in that direction where they're adding more streaming ingestion capabilities. But simply the act of loading ingestion, I think, it's kind of ETL-like. The other part I think is interesting is the last mile of analytics is where you have something highly curated, 360 view of a customer, and you want to synchronize that back into an operational application, operational database. I think that is ETL as well. It's a separate process that sits outside the data warehouse. Data warehouses themselves are not optimized yet to run a lot of services, even some of the work that some of the data warehousing companies are doing today. It tends to be, how would I say it? It's kind of limited in some of the things that it can do. They're not fully formed services in the way like a SOA architecture would think about a

Starting point is 00:14:26 service or even a container or a microservice tend to be extremely tightly bound to do functional things where state and history and other things basically don't apply. So I think at the edges, it makes a lot of sense at IoT. But again, are we still doing ETA at that point? Or is it more like a streaming use case? Is it Kafka? Is it Confluent? I think there's other technology out there that does those really effectively. But I think ELT as an architecture

Starting point is 00:14:52 makes the most sense today in terms of how people are using data in the enterprise. Cool. So if I'm thinking about how someone is doing ETL with something like Spark, for example, you have the extraction parts, you will write some code for the transformation. So in transit, the data is going to get transformed and then it's going to get loaded on the destination that in our case, let's say it's like a data warehouse or a data lake how like this transformation part which naturally in etl is like a piece of code that we

Starting point is 00:15:27 write how does it happen in elt because this is part is like pushed into the data warehouse the data warehouse is a technology that like primarily is being developed to ask questions and get replies to these questions right so how do you see this implemented and how also Matillion is doing it if there are like multiple flavors out there of like how to do that? Yeah, it's a very interesting question. So if you look at, we spent a lot of time working with Databricks

Starting point is 00:15:54 on the way their architecture, a lot of my background is with Spark technology from my previous employer. Arguably what they do though is that separation of compute and storage. So their compute and their separation of storage, it's not any way dissimilar to what like a cloud data platform does. It's just different technology. Now they have different smarts and the different schedulers and they have different histories, but both the, basically what they're doing though,

Starting point is 00:16:19 is they've, they've separated the storage. They have a way of clustering the compute. There's a schedule. They break the task down. They do a whole kind of MapReduce kind of behavior. Like if I look at that long and hard, the fact that one uses Spark, one uses Python, the other uses SQL, that's the modern architecture in my mind. That's what it is. That is the ELT.

Starting point is 00:16:42 I think ELT of yesteryear is synonymous with SQL only, and it's only working with cloud data warehousing or even just data warehousing. But if you look at data like the Lakehouse architecture from Databricks and the way their SQL analytics platform behaves, you can push SQL, Python, PySpark into that engine, it'll look after how the scheduling and the splitting of the task works. But in all intents and purposes, it's still an ELT architecture in lots of ways in that there's a logic that's sitting directly on top of that data, virtualized. If I was to take the exact same problem and move it over to Snowflake, I'd probably get it all to work the same behavior. It might use different technology. It might be SQL- probably get it all to do work the same behavior.

Starting point is 00:17:26 It might use different technology. It might be SQL based, but pretty much it's the same thing. You've got access infinitely old to all the data storage and you can spin up the compute as you need it. It's not like it's a completely separate thing. Out of respect, I think that's how we would consider it. And we, as Matillion, we just generate SQL or Databricks. It's highly optimized for their platform. If we take the same design and we shoot it over to Redshift or Snowflake, internally, we will just generate different SQL to leverage that platform because they have some

Starting point is 00:17:58 specializations and variants between each of them. But to you as an end user, you just see a design. But under the covers, we are basically leveraging that ELT architecture. Maybe I couldn't convince Databricks to call what they do an ELT architecture. But at the end of the day, that separation of compute and storage, it's that I think is the modern data architecture that people are looking to leverage. And the fact that the storage basically is like literally just infinitely scalable and so ubiquitous that you just can create materialized views in the data and use it for multiple different things. I think that's the big game changer that we basically are witnessing. And how it works with Matillion, like how, what's like the experience that someone has when

Starting point is 00:18:43 using Matillion? So our experience, I guess, borrows a lot from a no-code, low-code IDE, drag and drop, where you are designing a logical flow. So things like you take a data set, you tend to kind of almost get a table view of your data, tends to try to flatten everything into a table. We think that tables are, I guess, easier for most human beings to kind of mentally construct. And we're dealing with analytics people. So ultimately something, if it makes it into a table, it's easy to sort, easy to filter, and it's easy to basically pivot. That's the moral of the story. But actually under the covers, it isn't all normalized. It isn't all flattened. It's a highly structured internal data model that

Starting point is 00:19:25 we have. It's just that the visual cue that you see on top of that is just to make it easy for you to use it. But it is very much a drag and drop metaphor. It has a lot of if then else logic that you typically see in ETL style products. And from that, we kind of create a visual logical documentation of the analytics that the end, the developer is trying to come up with. And then when they go to run the product, I think this is where the real kind of smarts of Matillion kicks in. We start to do a lot of live sampling with the data.

Starting point is 00:19:58 So you kind of construct a piece of logic under the covers. We're creating SQL, interacting directly live with Snowflake. We're validating that SQL is valid. And then we're producing a sample data set so you can actually see whether or not to that design part of your structure or your flow is like, ah, okay, up until now, I've got my Salesforce data. It's looking kind of correct. Maybe I've normalized the US and the European dates, because it tends to be the case that

Starting point is 00:20:27 in Salesforce, you run into that issue quite a lot. OK, great. What's the next thing I want to do? I want to bring in my power.data. I want to kind of merge those based on a particular primary key. And the visual queue basically helps you continuously just iterate, iterate, iterate. By the time you get to deploying that data into Snowflake or whatever your data warehouse is of choice, you're pretty much certain that the table structure and the logic is correct. The only thing that potentially goes wrong is just that as you went through the sampling,

Starting point is 00:20:55 you didn't realize that that sample set wasn't representative of the global underlying data set. That can happen. You tend to only see a couple of hundred rows, but maybe the underlying data set is billion rows. And that's why when you flush all this through into the data warehouse, you can then go and check it and visually check to see if you've actually corrected all the errors in the data. But it's very much a visual metaphor. We try to get as much as we can to a no code or even low code, but we have a million extension points where people who want to plug in SQL, things like Python, you could even plug in OR code and things like dbt are all fair game for us. You can plug in those capabilities and we simply just orchestrate

Starting point is 00:21:35 across all of them. We're trying to get a visual document representation of your analytics. And like last I checked, I did a couple of customers last week, like seven different data sets is kind of like the form for anything that's moderately like considered to what we'd call an insight.

Starting point is 00:21:52 But like we've got customers at 26, 27 different data sources to produce a marketing lead score. Trying to hand code that and you can, right? It's just the maintenance, iteration, upgradeability of that flow.

Starting point is 00:22:06 That's where we think that the visual look and feel of the product really starts to come into its own, as well as that sampling capability, which we think is really just a killer capability that as you design, you see live data and you see the logic of what you've designed. It's those things that basically are the powerful capabilities that Matillion offers. And again, it's all an ELG architecture. So we're directly operating on top of your data warehouse. Yeah, that's super interesting. And who is the user of Matillion?

Starting point is 00:22:36 User for us is data engineer. But the problem with that term is that means a lot of different things. So ETL engineer for 100%, it's just that person who's used to the ETL design paradigm. Data engineer, I think, is a broader term. I think data engineer for us is anybody that could be doing things like airflow orchestration. They could be hand coding. But you look at it long and hard enough, it's just a different tool set or different stack for them. So we try to blend both of those in where people who are more used to that kind of engineering background, which is CICD, injection of control, hey, they probably grew up writing Java code using Spring Framework for all I know, but that's just me. But that type of person is now

Starting point is 00:23:18 coming into the data world. The reason being is I think it's because people are recognizing that the resilience of the data pipeline is a word you hear a lot of, like the downtime of your data. And I think engineers have been really good, certainly SRE, cloud ops engineers have been really good in terms of figuring that problem out. And I think that is influencing, that data ops thing is strongly influencing or has influenced Matillion in terms of how we look at orchestrating those pipelines together. So we have these ETL people. They're very much looking at business logic, business data, and their job is to take what your CFO wants to see in terms of revenue forecasting and that type of thing. But there's a whole bunch of other people around it who are kind of building all the periphery, the connecting and the loading of the data into the bronze sort of storage. That engineer is also part of what we do.

Starting point is 00:24:12 But I think they're different skill sets, but they're complementary in nature. I think we tend to separate that there's like almost like a mini SRE team, which are data engineers that surround these ETL engineers. And the ETL engineers are really looking at the actual design logic creation of this master record of something. So that tends to be the two groups that tend to use our product. Yeah, it makes total sense. And it's a great point that you are making here because many times you hear people asking like, okay, what is a data engineer? Like, why do we need another discipline in engineering, right?

Starting point is 00:24:49 And actually, I think that the best definition that I can personally give is that data engineering is like hybrid between operations, SREs, as you said, and actual software engineering, because you also have to do that so you pretty much like to be a successful data engineer you need to have like knowledge from both like you need to build your pipelines but at the same time then you have to monitor your pipelines and care about SLAs and have them like up and running and all these things and I have a question which is actually like something that I find interesting in general, not just like for data products, but how is this visual metaphor that you described fits into like

Starting point is 00:25:33 the workflows that engineers have, like developers have, like all these CICD versioning, all the standard like tools and methodologies that engineers have to support the quality of their work? How does it work? Very interesting question. I spent a lot of years basically looking at CICD version control. A good number of years ago, this actually must be about, I'm going to go back to, hold on, I want to go back 18 years.

Starting point is 00:26:02 I was at one point a Clearcase admin. So I spent a bunch of time being an engineering manager and I had to be the clear case admin because there wasn't just nobody else to do it. So I kind of grew up in that whole strong versioning control that IBM rational products had. And then other types of products have come along. I think these days, everybody uses Git or Bitbucket and those types of things. But the whole notion of versioning and branching and merging and those types of capabilities, I just don't think it's, not that it's not natural, but it's not in the kind of the purview, I think, of the ETL engineer. It certainly hasn't been, but it's definitely something that engineers are just going to

Starting point is 00:26:38 go, well, that's how you do it. So what we've tended to see is the capabilities that are kind of the Git-like thing with version control and branching and merging, they're becoming commonplace in the data products, the ETL stack. We may not use the same labeling and the way it's visually shown to the end user as the way an engineer would be comfortable with, but it's the same thing. And actually under the covers, we're actually using git for that matter like that's how we do our version control and it's very strong version control and very strong branching and merging but i never haven't yet exposed that terminology to the etl engineer i don't want to scare them but i think they like the fact that they can roll back and they can share and they can do all those things. And then it goes further than that, right? Because non-repudiation of aversion is becoming a really

Starting point is 00:27:30 important thing in our world because we operate so quickly at some point when something breaks. Now, breaking could be just that a pipeline doesn't run, could be a security issue, could be something else, could be something more nefarious, right? That a bunch of records basically appeared on the internet and lo and behold, we didn't mask something properly. Somebody's got to go check out why. And maybe there was a misconfiguration of a rule inside one of the ETL pipelines or one of those particular products. If that's not versioned and controlled and checked in, you have no idea. And a lot of ETL down through the years was just not that. It's almost like we got an analytics project. Great. How does the data

Starting point is 00:28:11 work? It does this thing. Okay. How do we know we're being successful? Because the head of sales basically hasn't given out to me this week. That was the testing, right? It was that. And then you come along and you kind of upgrade or migrate that. So we go, how do we retest? Well, we check to see if the head of sales is giving out to us again. And then we know the report looks like it's correct. But that's not good enough, I think, clearly in modern enterprises. So I think the CICD is here to stay. It's just that we don't necessarily expose those features the way we would to an engineer,

Starting point is 00:28:42 but we're actually still using those under the cover. So that's how we experience it. But it is strong versioning for a lot of good reasons, but a lot of it comes back to, we just simply make, we think it makes the data boat go faster because upgrades and migrations and all those things that happen all of the time now and reuse is really well supported by those principles. Yeah, that's super interesting. And there are like two terms that we hear a lot lately, and they're like many companies are getting funding to build products around that,

Starting point is 00:29:15 which is anything around data governance quality. What's your opinion on these? And how do you see also these kinds of functionalities, how they play together with the ETL tool or ELT tool like Matillion? Very interesting one. I've spent a lot of time over the last number of years building data cataloging technology and data governance technology. And I've kind of seen it grow up and then during COVID, I wouldn't say it waned, but it has basically found maybe some of its

Starting point is 00:29:49 place and position. So it's a case of going, I think cataloging capabilities can really dramatically improve analytics. They really promote very strong reuse. If you can extract a lot of the semantic meaning of data, you can do really cool things. You can start to automatically infer if the data is good or bad or if it's standard or not standard. And that stuff comes, I think, a lot from the principles of what data governance teams and product can do. They're very good at looking at metadata. They're very good looking at

Starting point is 00:30:19 relationships. And if you can put that stuff to use, you can ultimately solve the big problem, which is the data quality problem. So I think a lot of what the governance products can do is provide really good semantic understanding of data that could be used not just for the purposes of governance, but actually, more importantly, used for the purpose of data quality and fixing data or automatically detecting and indicating there's something wrong with the data. A lot of the governance products that we're're really good friends with Calibra. A lot of them basically exist in some ways at a different level. They're kind of like a ticketing system whereby approvals and there's like data custodians and people who own the data have to basically approve it for sanctioned for use.

Starting point is 00:31:03 But I think they're only really at the beginning of that industry. I think it's like, yeah, we've seen massive innovation there in the last couple of years. But I think it's going to be more interesting if you look at what Snowflake's doing around the data cloud, this idea that there is this massively curated sets of reference data sets. It becomes really interesting that if you start to blend some of the principles of the catalogs and the governance in terms of where did that data go and how does anybody know after it's been released in the data cloud? So I think governance is interesting that it has a whole new innovation area that I think it'll eventually end up in. But I think primarily right now is I'm fascinated by the use of the metadata that governance tools have, but to actually go and fix the quality problem.

Starting point is 00:31:50 I think that is actually a problem we should go fix. I think it's not even just practical. It's like we have to solve that problem. And I think governance is kind of like an interesting secondary issue that a lot of organizations have. But everybody has a data quality problem. Everybody has that problem. So I think for ETL and us, we use the metadata to go fix it. And then we partner with the best in the business, the likes of Alation and Calibra to help their customers do what they want and what they do in terms of approvals and all those types of things.

Starting point is 00:32:22 But to me, I'm more interested in the use of the metadata to go fix the quality issue. Super interesting. Okay, so ETL or ELT, as we call it today, I mean, it's something that exists pretty much since we created database, right? So we might keep reinventing it, but as the process, it's something that exists for forever. What's the future? How does it look like? How do you see it based on your experience with Matillion? What is next, both for Matillion and also for this category of products?

Starting point is 00:32:56 I think you're right. I think every once in a while, a blog will come out, usually by a data integration vendor, that ETL is dead, just to kind of reskin it and say it's not quite dead. It takes on a new life of its own. What we look at right now is this, what we call a definition of the modern analytics, which is a combination of BI and data science and operational analytics. So in that respect, what I think is that you're right, ETL is here to stay, but I think the future of ETL is back to what we talked about in terms of the operations. It's really about not just automating much more, it's about much more

Starting point is 00:33:32 resilience in those pipelines. How can you detect that something is going to fail before it fails? I think we can really solve that problem today. How can you do things like get a job to optimize itself? Those types of things are definitely starting to become real, the things that we can actually go do, because we've learned a lot more about the relationships of the data and the query optimizers inside the data, whereas they're becoming a little bit more accessible in terms of how the APIs would work. But I think that's where I think a lot of the ETL has got to go, is that can we detect errors before they happen? Can we alert people? But then can we auto detect that something could

Starting point is 00:34:11 be better optimized by automatically tweaking the configuration? And the only way we can do those things is A, we've got APIs. We have the ability to inject variables. So again, good engineering principles. And then it's actually about leveraging the APIs of those underlying platforms where they have really smart, intelligent things built in. And we can basically promote different attributes, different ways of configuring the optimizers. And those optimizers then help the actual job run better.

Starting point is 00:34:40 So it's like that kind of, there's an ecosystem or a kind of a sense of if you can bring together all those kind of capabilities, the ETL becomes smarter, more resilient, more optimized in the future. But I do think it comes back to that is that we're trying to solve the problem of BI data science and operational analytics. And ultimately, that's going to be about making the pipelines run faster with more resilience and then using the data, curating it much more and reusing the insight that we actually generate and curate. That's what I think the future is. And that's exactly what we're building at Matillion. We call it the data

Starting point is 00:35:15 operating system. We think that companies need to run their data as an operating system. And an operating system by definition is modular, smarter, more resilient, more scalable than the way we used to look at ETL, let's say last year or the year before. Nice. One last question from me and then I'll give the stage to Eric. Okay. About the destinations, I think the set of possible destinations, it's pretty limited. We know it's all the data warehouses that are out there. There are not that many anyway. But about the sources and basically your experience as Matillion with all the different companies that you have interacted.

Starting point is 00:35:56 What are the most, let's say, common ones? And also, can we break them down into some categories of sources that are distinct in some way? It's a great question. I think it's a real bugbear, I think, of all software vendors right now is that everybody ends up basically becoming a connector company in the integration world. And a lot of it's down to the fact that customers are not sure if it's unwilling because they get why they want it. Every connector has to be supported. So then everybody basically does the same thing over and over again. We all end up with hundreds of connectors.

Starting point is 00:36:33 And then lo and behold, AWS will change its security profile, come out with some new IAM service, and you've got to go and iterate through 100 connectors. And we all do it, right? Every single one of us. I mean, it doesn't matter. You're going to rock up to your next big $1 million customer come January. And they're like, hey, do you guys support some API from some new CRM that you haven't heard from before? So to break the back of that

Starting point is 00:36:54 problem, we've been kind of looking at, hey, we'll give you a no-code toolkit. You point it at the API, and we will automatically construct a Matillion connector under the covers to try to alleviate some of that need for the vendor always to be building out the connector. Connectors for us largely fall into really just two very simple categories. At Matillion, we tend to broadly look at batch-orientated APIs, batch-orientated data warehousing like JDBC connectivity. But now increasingly, we look a lot more at CDC and streaming APIs. So there's a lot of work that we've been putting in. We're going to announce it at reInvent in a couple of weeks in the area of change data capture and streaming.

Starting point is 00:37:34 We tend to look at those APIs subtly differently because the nature of the queuing capabilities and the queuing technology, and there's just a whole other kind of service lifecycle that you have to obey and observe that's quite different with APIs in a sense of internet APIs, REST APIs versus something like a queuing technology where you read it once at most once delivery, all those types of things are very, very different. So I tend to look at them, those are two broad categories, but ultimately I think it comes back to is there's vertical categories that customers are interested in. Do you have a set of capabilities in finance?

Starting point is 00:38:09 Are you guys really good with billing applications? Like, do you support Recurly and all the rest of it, like the whole list of things like NetSuite? But I think for us, it really comes down to that. The ingestion capabilities are broadly bifurcated into REST APIs, databases, and increasingly now streaming APIs. Okay. That's interesting.

Starting point is 00:38:30 And why do you, like, why CDC is important? I mean, recently also like Fivetron acquired like a company that like specializes on CDC. We have seen CDC being mentioned a lot, especially like in big corporations. We had like someone like from Netflix, they have done mentioned a lot especially like in big corporations we had like someone like from from netflix they have done like a lot of work there why cdc like a thing because initially i mean the technology that is based on is like the replication logs of the databases right like it was built for something completely different so yeah for every time I've heard ETL is dead, I've heard CDC is dead. I think a lot of it is to do with organizations right now are doing cloud migration.

Starting point is 00:39:11 And they're trying to digitize as fast as they can. And at the end of the day, they don't have the ability to always change all of their on-prem software at the same time. But they have the need to basically get that data, the change data sets into their cloud analytics platform. So I think a lot of it is for me is that they've selected a cloud data warehouse. They've bought in very strongly to the vision of what that analytics can deliver. I mean, it's true, right? I've seen it for myself. I can see what those platforms can deliver. But some of those changes in their business are so important, and they have to happen at a faster rate than basically a daily or an hourly batch load, that it's like, hey, if we could just use the CDC style of use case, that would basically help our analytics. And I

Starting point is 00:40:02 think it's that state of affairs that we're in right now. I do believe, though, that there will be another messaging technology. It could be Kafka. It could be a new variant that comes along that's so ubiquitous and widely deployed within the cloud infrastructure. And it overcomes a bit of the kind of the complexity of the admin that we could just see a replacement of some of that CDC style of use case, which, as you said, is the log redo and kind of style.

Starting point is 00:40:30 And it becomes much more of a messaging kind of push to a queue with topics, basically multiple readers. Right now, I think it's just one of practicality. I think we're used to basically doing the logging. We can't change those operational databases, even if we wish, because it would just impact the business so catastrophically or so big. It would be just too risky. So why change it? Why change what works, I think is what I'm observing. Like one in every four of our customers right now is like, what are you guys doing with CPC? And how do you get it

Starting point is 00:41:01 into Snowflake? So it's not just a, it's like, can you get it into Snowflake in a highly resilient way? And I actually think it may be like, I was, I guess, proven a lot by the likes of the Fivetran guys saying, hey, data ingestion is not just ETL. And I was like, it is. I'm like, well, no, it's different because what we've done is we've just said we're going to solve the problem of loading data to cloud.

Starting point is 00:41:27 And after that, do what you want with it. And I think CDC is set for a similar kind of rethink. It's just get it out of the log file and stick it into S3. Do what you want after it. And we'll have it ordered. We will have a high fidelity. We'll have a metadata log. We'll have all of the information

Starting point is 00:41:45 that you need to go and do whatever analytics you want and as much of it as you want. And I think that's the redo on CDC that's coming. It's optimized for the way we can do analytics in the cloud. I think that's the evolution that's coming rapidly in CDC. Super interesting. Eric, all yours. I know you can keep going. I think we're getting close to the end here. Kieran, I wanted to rewind just a little bit. You talked about modern analytics as being BI, data science, and operational analytics.

Starting point is 00:42:23 And I'd love to drill in on that. And one thing that we've seen repeatedly on the show is that there are a lot of terms that I think a lot of people, including myself, think are easy to define. Oh, analytics, right? But then if someone said, hey, could you give me a really good, concise, articulate definition of analytics. I may have to stop and think about that because it can be very complex and wide ranging, but it just really struck me that you sort of included three pretty traditionally separate components in a single definition of analytics. So can you dig into that? Yeah, I think for me, if I go back to where we kind of started here, we talked about ELT and the kind of the benefits of cloud storage technology and

Starting point is 00:43:10 the separate of compute, like business intelligence. I'd love to know what came up with the term, by the way. I think it's fascinating, right? Because I think middleware companies and integration companies always have a drive for how do we tell people our business value? We never really cracked the code. But the analytics guys basically cracked it 20 years ago. It's business intelligence. It just sounds amazing. And really what a lot of it is, is just to basically, as you know, is providing reporting on data. And there's a lot of stuff that goes into making that happen. But when I look at data science, I don't think it's the same type of analytics in general. I think a lot of stuff we do in data science can be, but I think it's sometimes about that the answers that we get from data science are not always deterministic. That's always the

Starting point is 00:43:54 classic one. Sometimes they can be range-based. So within a particular range, the answer is somewhere here and it's a different type of thing. And you've got a whole bunch of techniques and algorithms and stuff that people have built up. So I won't even go into all that, but I still think it's analytics, but it's a different type, serves a different purpose and whether you're bought in on the volume and scale and all that type of stuff. Yeah, maybe, but, but I think it's more to do with that. The answer is not always a single deterministic value.

Starting point is 00:44:24 It's in a kind of a range of values. But the last thing in operational analytics, I think that's different only because it distinctly says we basically want to operationalize something that we've learned. And all it basically says is, hey, the last mile of analytics is not a visual dashboard. It certainly is a great way to create a conversation with an executive team. But ultimately, like I always talk about leading and lagging indicators. We're big believers at this at Matillion. There's a framework called four dimensions of execution.

Starting point is 00:44:55 And it's like you define a wildly important goal. As a team, you set this noting of what's a lagging indicator, which might be revenue or something like that or customer count. But nobody in a sales team can do revenue on a Monday because like, hey, the deal might even close for six months from now, but they can do how many customers they've talked to this week. How many demos have they done? How many trials have they got in a queue? How many SQLs they've been cleared?

Starting point is 00:45:19 They're leading indicators of revenue. So you start breaking things down that kind of way. You start to get into this kind of like, okay, the visual dashboards that we use in lots of organizations are really around those things. But what do you do with them after you've learned some sort of an insight? You've learned that there's a correlation between this and this. It makes so much more sense to take that insight and take it out of the Tableau dashboard and give it back to that salesperson who every day is making those cold calls. You'd say, hey, not for nothing.

Starting point is 00:45:50 The list of 100 calls we've created for you as the marketing team, we're going to stack rank them in the best order we think is possible for you to call those customers because we think there's a propensity model here that you need to know about. Propensity model comes from maybe data science. The underlying data sets comes from BI. BI basically created a dashboard for everybody to go and say, oh, that looks interesting. But operational analytics was to take that insight and actually do something with it in the day-to-day of the salesperson. So that's why I separate them into three things. It's because I think they deliver different value. And I think they actually are subtly different use cases, even though they all can be additive to each other. And that's what we define as modern analytics. And when you're doing at least all of those three,

Starting point is 00:46:35 we think you have the right to basically say, hey, we're a digital leader. And it's Pete's Coffee, it's Slack, it's those Juniper Networks, it's those companies that work with us. And that's what they're doing. So when I started off by saying, hey, these guys are using these output connectors from Atillion, those are the companies that are driving us for those. They want those insights that they've generated from BI and data science, and they want to get those marketing lead score algorithms they've developed back into the layers of their marketeers. That's why we call that modern analytics. We think that really defines a data-driven company.

Starting point is 00:47:14 How many companies? So a lot of times I think about the journey that a company has to go on to become data-driven, call it digital transformation, whatever, right? First, you need to collect all of your data and to your point, fix your data quality issues so that whatever insights and however you're deriving them on top of all this collected data in your warehouse are good insights. Number one, dashboarding, I would say is part of that, but a lot of times it's sort of another step, right? Actually sort of building good dashboards. And that goes from executives sort of down to functional teams. How many, like what is the fall off of companies who sort of collect data, do the quality thing,

Starting point is 00:47:55 have good dashboards, and then how many companies are actually doing operational analytics well? Because my sense is that it's probably not many. I mean, you mentioned companies that we all want to emulate, but I wonder what penetration is with operational analytics. We did a bunch of surveys actually this year on that. I'm happy to share the data with you, if you wish. We saw that BI analytics, yeah, no surprise, right? With like Matillion, it's like in the 80 percentile, right? And a surprise it wasn't 100 percent. I just kind of scratched my head on that one. The data science guys were in like the 46 to 52, somewhere around there, because we ran this survey over multiple days with

Starting point is 00:48:35 different webinars that we ran. And the operational analytics are anywhere between 8 and 14 percent, right? So it's kind of down there. It's not basically as done as much as maybe as the other types of analytics. I guess that's to be expected. People are maybe only now waking up to do some of these things. I've got to believe that COVID has fundamentally changed the way we use data forever. I think a lot of what we're doing right now in terms of digitizing the business is never going to change. We're never going to go back to some of the things we used to do. We've had to basically put a lot more reporting in front of people in Zoom calls, in things that were in Slack and ClickUp.

Starting point is 00:49:18 So we've created the, we don't print out stuff anymore, right? We don't basically show up to the exec meeting and here's the printout, the exec pack and the rest of it. The exec pack is basically a Google doc. And the Google doc links into the Tableau report and all those things. And then you started getting into like, what's a hack day in an organization? It's basically, so the cultural shift is rapidly happening. And I think it's like, hey, it's winners and losers, right?

Starting point is 00:49:43 A lot of businesses will struggle to make it through this pandemic and the ones who basically come out the other side of it, I think they're the ones who are going like, hey, what is everybody else doing? And I think if they've at least made the journey to cloud analytics and they have a cloud platform, they have a fighting chance of taking some of those insights they probably have already created.

Starting point is 00:50:03 They've just got to basically get them out of the data warehouse and put it in some operational system, database, e-commerce website, something, propensity to buy model. What's the next product that somebody basically wants to select all those things. A few weeks ago, I met one of our customers. It's an online webinar. I kind of jokingly said to him, like it wasn't really insulting, but he's in insurance. I said, hey, insurance must be the most boring data industry to work in, right? And I said, so what are you doing in data science?

Starting point is 00:50:30 Probably nothing, just to provoke a reaction. And he kind of laughed and said, well, let me tell you what we do. He said, you may have heard in America that sometimes we struggle with things like climate change and agreeing whether it's happening or not. I said, might have read that in the news. Yeah, controversial to kind of dig into that with the customer on a live webinar. But he said, but we as an insurance company have to take a position on climate change because we insure a lot of properties in the state of Washington. You looked at the state of Washington recently, a lot of forest fires and

Starting point is 00:51:00 things like that. So they're using a lot of GPS location data, weather reporting data, a lot of predictive models to incorporate that into the contract insurance information that they have to say, what is the likelihood of a forest fire wiping out town A, B, and C next year? Now that is climate change, but it's really interesting that as an insurance company, they have to take a position on that because that is basically the future or not the future of an insurance company. If they are not on the right side of that predictive model, like therein is the operational analytics that I'm talking about.

Starting point is 00:51:38 They're doing BI, they've moved to data science and now they're basically building alerting into their entire system. Like, there you go. In the most boring industry, insurance, they are doing incredible things. And then I thought it was just hilarious. Not hilarious, kind of big and important for me. But I thought it was interesting that he chose to say that they had to take a position on

Starting point is 00:52:02 climate change. So that's the other thing I think is fascinating during COVID. We are all more exposed to data and data analytics and projects like never before, right? You think about it. I used to watch the news in America the last 20 years. What is on the news that is to do with data? It's all financial Wall Street stuff and baseball, right? Two things. So the only two things, and now we have COVID, right? COVID is the third thing that every day it's stats, reports, all those things. That's really interesting to me in terms of that our culture is becoming more data savvy, more analytics aware beyond the two popular ones, as I would have said, of finance and sports.

Starting point is 00:52:45 We're now basically looking at another dimension that is more scientific related, but look at climate change reports and how they're denied. And basically some people believe and some people don't believe those things. I think they're going to generate a culture of people that have a greater, I hope, awareness of the importance of using data to prove or disprove a theory. Well, we could not have picked a better way to end the show. I think that was incredibly insightful. I am with you. I hope that our societies do become more data-driven and become more analytical because I think that's really healthy in many facets of life, not just if you work in data integration. This has been a really great show and we'd love to have you back on as you settle into the saddle even more at Matillion and continue to build some amazing things for your customers. Hey, great. Thanks for having me guys. Really great to chat with you today and love to do it again.

Starting point is 00:53:45 Okay. That was a really interesting show. And this takeaway for me is more of just a funny one. And it was an anecdotal observation that Kieran made, but he talked about the head of sales giving you hell when something's not working or you have a data quality issue. I think we've probably both experienced that throughout our careers in one way or another. And I just got a kick out of the idea that the head of sales is sort of the most high impact data QA engineer that there is. Yeah, I think the question, why this lead is not on Salesforce is something that...

Starting point is 00:54:29 How many times you've heard that? Yeah, like, I don't know, like an amazing early detection mechanism for data quality issues. Yeah, who needs a propensity model? Absolutely. Yeah, yeah. I think everyone can relate to that. Yeah, it was an amazing conversation. I mean, Kieran is like a person who has like a huge, huge experience.

Starting point is 00:55:00 He has experience like ETL or ELT or data integration, whatever we want to call it, in many different phases of the industry, with talent, with material now. And yeah, he shared like some amazing thoughts and like experiences with us. And I'm really looking forward to have him again back to the show. Well, thanks for joining us again. And we will catch you on the next episode of the Data Stack Show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by rudder stack,

Starting point is 00:55:48 the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 63: The ETL - ELT Flip With Ciaran Dynes of Matillion

On this week’s episode of The Data Stack Show, Eric and Kostas have a conversation with Ciaran Dynes, the Chief Product Officer at Matillion, a powerful and easy-to-use, completely cloud-capable ETL.../ELT solution.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 63: The ETL - ELT Flip With Ciaran Dynes of Matillion

On this week’s episode of The Data Stack Show, Eric and Kostas have a conversation with Ciaran Dynes, the Chief Product Officer at Matillion, a powerful and easy-to-use, completely cloud-capable ETL.../ELT solution.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.