Drill to Detail - Drill to Detail Ep.117 ‘How DataCoves Operationalises the Modern Data Stack’ featuring Special Guest Noel Gomez

Starting point is 00:00:00 So hello and welcome to the last Drill to Detail episode of 2024, and I'm very pleased to be joined today by Noel Gomez, co-founder of DataCoves. So welcome to the show, Noel. Hi, nice to, thanks for having me on the call and I really appreciate it and let's get going. Great, brilliant. Okay, so Noel, we met, we actually met at the Data Renegades event at Coraless

Starting point is 00:00:37 a few weeks ago, actually, when I was over in the States and it was great to meet you there. And I said at the time that I was very keen to find out more, you know, about DataCovs and what you're doing. And yeah, just understand the story really of you and the company. So maybe Noel, just start off by explaining who you are and a little kind of elevator pitch really for DataCovs.

Starting point is 00:00:57 Yeah, so I'm Noel Gomez, a co-founder of DataCovs. I've been in technology, I would say, modernization for a long time. I've worked in large enterprises and with large enterprises. I've done some software development, worked with people like that. So I always saw a need to improve how we handle analytics in a way that's more repeatable. And what I saw is that there were great tools like dbt and snowflake and Databricks, et cetera, but it was very hard for people to figure all the pieces out.

Starting point is 00:01:34 There's a lot of discrete components that are doing different pieces. And the modern data stack gives us the possibility of connecting these great open source tools, but a lot of the work is left to the person implementing. And so Data Copes came about as a way to simplify all the infrastructure, all the platforming, as well as helping people figure out all the best practices. Because when you're starting out, you don't even know what you don't know.

Starting point is 00:02:02 Okay, okay. So let's drill into that really as a topic first of all then. So you talk about DBT and there's other tools around DBT that we tend to use like Airflow and I suppose IDEs and so on. So maybe just talk a little bit about the difficult, I suppose, the complexities of running something like DBT in production. Why is that complicated and why is it more than just running say dbt core just in a batch script or something yeah so dbt is solving one part of the elt or etl process so it is only handling the the t or the transformation of that and it and it solves a lot of great problems. So it helps organizations start thinking about treating analytics as code.

Starting point is 00:02:52 So things like version control, having a repeatable process, having data quality, documentation, lineage, all those things are solved by dbt. But as I said, it starts with the transformation. So the first thing that you need to figure out is where are you actually storing this data? So there's a part of this where you're figuring out which data warehouse are you using? Are you using Snowflake or Redshift or BigQuery? And then you're likely using a tool to transform data because you're bringing together tools or data that comes from multiple tools.

Starting point is 00:03:29 So this could be your CRM. It could be your web analytics platform. And so you need to think about, like, how do I get that data out of those systems and into my data warehouse, into my BigQuery or into my Snowflake? So there's a step there of extracting and loading the data. And then dbt comes along. And now we get into the orchestration. So the sort of the basic or when people get started and don't really think about scalability,

Starting point is 00:03:59 they may be scheduling the extracting and loading at a certain time. Let's say at 8am, I will extract my data and load it. And then at 9 a.m. I will perform my transformations. And if these two tools are scheduled independent of one another, what may happen is that the first step fails or somehow it gets delayed. The second step will still run because it doesn't know that the data wasn't ready. And so that's when an orchestration tool comes in. So it's very common for organizations to need an orchestration tool,

Starting point is 00:04:32 whether it's Airflow or Daxter or something like that, which is connecting the process of extracting, loading, and transforming. And the process doesn't end there. So you may need to then refresh or extract data into your reporting tool or send some reports out to people. So there's additional steps. And so what a tool like Airflow is helping you do is stitch together all those discrete pieces.

Starting point is 00:04:59 Okay, okay. So tools like Airflow have existed for a while. And what do you, are you talking about running these things yourselves as in sort of customer managed? Are you talking about it's even complex when these things are provided to you as a service? Yeah, so a lot of people that, you know, the good thing about open source is that

Starting point is 00:05:20 you can go ahead and run it on your own. And a lot of people will start down that road and go like, oh, it's quote unquote free. I will start install these things. And we've worked with people like that. So they may get a VM, like an EC2 on AWS, install Airflow, and they're ready to go. Now, the problem with that is they may not be thinking

Starting point is 00:05:41 about having a reproducible installation process. So they may not be using Docker images. They may not even know about having a reproducible installation process. So they may not be using Docker images. They may not even know what Docker is. And running it on a single EC2 means that you're constrained by the CPU and the RAM of that system. So a more sophisticated way of running Airflow is to run it on something like Kubernetes, which will scale. So it'll bring up containers and bring them down as they're not needed. But running Airflow in Kubernetes now requires that people understand what Kubernetes is. And there's a lot of complexity that comes along with that. And so then where people then, you know, when they run into these issues, then they'll go to a managed service.

Starting point is 00:06:24 So Amazon has MWAA as an example. So that's their managed airflow offering. And that's fine that there's no inherently, there's no problem with that. But now you're using a very generic service. So it's not intended for DBT per se. I guess I should step back and say airflow can be used for any number of tasks. People even use it to do the data transformation without DBT per se could, I guess I should step back and say Airflow can be used for any number of tasks. People even use it to do the data transformation without DBT. So it's a generic tool that's, like you said, it's been around for many years. And so when you get these managed services,

Starting point is 00:06:57 now they're going to be generic as well. So you can use it for DBT, but you could also use it for other stuff. Where Data Coast comes in is to say, here's how you run dbt with Airflow without you worrying about any of that stuff. So it's not just the management of the platform, but how do you integrate dbt so that it is able to diagnose that a lot simpler faster so that uh the the customer gets up and running quickly because we are very focused on how to run dbt and airflow together and so they're not discrete services from different providers okay okay so so it's probably worth making a distinction here when we talk about DBT, so DBT can mean many things. And in fact, actually at Coralis, there was the initiative from DBT Labs to talk about DBT1 and so on.

Starting point is 00:07:54 So what are the different versions of DBT? And what are we talking about specifically here in terms of versions? Yeah, so it's interesting because DBbt started out as an open source project. And what most people know as dbt is pretty much the same whether you're using dbt cloud or data codes. So dbt is initially, I mean, there's been a desire to move away from this meaning, but DBT stands for data build tool. So it is a tool for data transformation. So think about joining, thinking about transforming certain fields.

Starting point is 00:08:38 You know, if you get one and two and you want to make that equal to male and female or something. So you're using DBT for that. And so what was happening was it was realized early on that the infrastructure or the getting started with DBT, there's a learning curve. So think about the typical DBT user that exists and initially were business analysts, people who were that new SQL, but didn't necessarily know Git or didn't necessarily understand software development best practices. So what dbt labs did was they created a dbt cloud, which is still the simplest way to get going with dbt so you go in there there's a simple way to onboard a simple way to get going um but the focus has always been on just a transformation

Starting point is 00:09:36 dbt one um i can't really speak to them for them, but it is an initiative essentially to integrate all of the different pieces that are going on in the data stack, all built by dbt labs. And so what they have realized, which we realized from the beginning, is that you need more than dbt. As I mentioned earlier, you need to to load data you need to orchestrate um you may need a data catalog and you need visualization and all of these kinds of things so dbt labs isn't solving for all of that but there are pieces that they're solving for and they're wrapping it all under that umbrella which is being sold uh as part of dbt cloud okay so so why why might an organization want to orchestrate dbt core along with sort of airflow as opposed to using dbt cloud so the dbt cloud orchestration is is like i said has been traditionally just focused on the dbt part of that so uh just to give you an example, when we started, we integrated a tool called

Starting point is 00:10:46 Airbyte. And that's a very simple point and click type of tool where you can extract and load data. And there's nothing wrong with Airbyte. Airbyte also is, they offer their own service for that. Another tool that people use in this space is Fivetran. And so what you will see is that at most organizations of any size, pretty much, you quickly get to the point where you're not using a single tool for loading data. So you may be using Airbyte and Fivetran, or a custom Python framework, or there's a newer entrant in this space called DLP, which stands for data load tool. And so a solution like dbt cloud is not going to help you orchestrate something that isn't part of dbt cloud. Okay, so these ingestion tools, they're going to be independent. What does exist are essentially triggers. So you can load data with Fivetran, and if I'm loading data with Airbyte and dbt cloud and DLT, now I can't, I can't do, you know, this type of orchestration

Starting point is 00:12:13 that I can with airflow, because essentially what airflow is allowing me to do is sit on top of all of that and trigger each of those pieces. And when they all succeed, go on to the next step. Okay. Okay. So, so as well as I suppose, as well as the data transformation part and the orchestration part, what other open source technologies and I suppose components of a data stack do you typically see customers want to use and maybe, you know, data codes covers as well? Yeah, so the interesting thing about this is that what we do with data codes is give you a starting point, but we find that people want to do additional things. So a good example of this is, by default, we don't have a data quality observability tool, but we have people who use elementary. So elementary is another open source package, and they have their own service. So you can use it either way with the service or

Starting point is 00:13:12 independently. And there are tools like cube, cube.dev, it's a semantic layer, same thing, where you can use that along with data coves, but it isn't bundled in DataCodes. In DataCodes, we do bundle other tools. So we have created our own Python library called dbtcodes. And that's a tool that helps you do things like generate source YAML files and staging models and things like that without having to type a lot of code. So it just generates a lot of stuff for you. We also use a tool that we also took over called dbt checkpoint. It's for CICD.

Starting point is 00:13:59 And so what we're doing constantly is evaluating what are new tools out there, what are things that make sense to really accelerate the maturity of how people use these tools. So we took those two over. There are other tools. Actually, you mentioned how we met at the Renegades happy hour. There was another sponsor there called Recce, and it's spelled R-E-C-C-E. They also have a data quality tool that can be used during the CICD process to actually understand how data is going to change. So what it'll do is it'll actually compare a table or a view, the results that that generates in your current branch versus how it looks in production. So that you can see this before you've applied the change, how is your data going to change? Are we going to lose a column or lose certain rows or how do the values change? And so

Starting point is 00:14:57 it makes it more, you know, realistic before you actually apply the change in production. So there's a lot of that. Our approach, because I came from enterprise, our approach from the beginning was to make a tool that's very flexible. We cannot assume that we know everything, what everybody will want to use, what new things are going to come in the horizon. And so we wanted to make it so that you could use any of these things. And as an example, DLT did not exist when we had MVPs and early versions of data codes. But now we have customers who use it because it's out there. It's a great tool.

Starting point is 00:15:34 And so we can integrate it easily. Okay. So if I understand you correctly, just to summarize what data code is, you're a sort of a replacement for a tool like dbtbt cloud but you extend into other areas of the kind of the the uh the elt process so the actual extraction and so on but you're a fully managed service that takes dbt core uh provides it as a service and extends extends the functionality beyond just the kind of the the transformation part um so what about the ide part of data of dbt cloud how do you what do you find people want to use there and and what does basic hoax do around that

Starting point is 00:16:11 really yeah so um when i started uh using dbt and as i mentioned i think dbt cloud is still very simple to get started but um i felt like i graduated the dbt cloud IDE. And by that, I mean, like, I wanted to use Python libraries, I wanted to use VS Code extensions. VS Code, what's great about VS Code is that it's it's multimodal, you're not just using it for SQL, you may be using it for writing Python scripts and things like that. And so I felt like I wanted more power. And so what we did, instead of creating a custom IDE, we went with VS Code. So in browser VS Code, which then allows us to offer any number of things, you have a full terminal. So you can run any commands, you want to do a CP command to copy a file or a folder.

Starting point is 00:17:10 It's very easy. You want to install something. And so what we do is we give you VS Code in the browser with a default image. And so what that means is if you're using Snowflake, it'll have pre-installed the Snowflake extension. So you don't have to do anything or dbt coves or any of these kinds of tools. They're all pre-inst installed. Now, the advantage that we also have with this model is that when we work with companies, and they say, you know, in my process, I also need the Azure CLI. So that's not in our default image. But what we can do is we can say, we'll add that to your image so that in your custom account, let's say, all of your users will have the Azure CLI pre-installed. So a new user comes in and in 10 minutes, they're ready to go. There's nothing for them to install. It's just configuring their credentials to Git and to their data warehouse. Okay. Okay. So I guess another reason I've seen people use dbt cloud, for example, is because there are features in dbt that are only available through that particular version.

Starting point is 00:18:10 I think one of the most obvious ones is the kind of dbt mesh or building up projects where you've got cross project references and so on. So what's your views on that? And what does Data Codes to enable say mesh deployments really yeah yeah and and it's it's really interesting because um we've come across those things um and like i said i've worked with enterprises and and we have some customers that have very big dbt projects so many thousands of models um they started with us before there was a dbt mesh. Um, but when that came around, it completely made sense to me. Um, it, it, it's, you know, you start seeing the limitations, um, or, or the parsing time and certain things that happen when you have a very large project. So our approach was to say, how do we offer something like dbt mesh? And so we created data codes mesh, same

Starting point is 00:19:07 idea. And so what we can do in data codes is you can have multiple projects, we will store the upstream projects manifest, and then in the downstream project, you can make references. And so then we think about like, how does that manifest get refreshed, whether you're using Git, like, let's say, GitHub Actions or Airflow? How do we upgrade that? And so we've taken care of that whole process so that it's very, very seamless. And then the other advantage that we have, once we save that manifest, we could also use it for deferral. So when you're in the upstream project, let's say I'm in development, I don't want to necessarily build all the models because I'm changing one of them.

Starting point is 00:19:52 And so deferral is a feature in dbt that allows you to say all the upstream models that are not in my current database, get those from production. Well, we use the same manifest that we're using for dbt-mesh to do that deferral. And we can also do CRUST project column-level lineage. Now, our approach to that was we could have created a custom viewer.

Starting point is 00:20:18 So we know that dbt-docs, the open-source dbt-docs, does not support CRUST project lineage or column level lineage. And so dbt cloud offers dbt explorer to solve that. We could have gone down that road, but we felt that what people really need is a real catalog. You want something that is helping you govern your data. And while dbt is a core component, there's other pieces of your stack that aren't dbt. So this could be metadata coming from

Starting point is 00:20:52 Tableau or metadata coming from even Airflow or your data load tool, etc. And so what we decided to go with was actually a tool called DataHub. So DataHub allows us to extract metadata from any number of tools, including Looker, etc. And what's really good is that even when you're putting it against the same data warehouse, let's say you're using Snowflake, there may be things in Snowflake that were not created by dbt. And DataHub allows you to extract that metadata, integrates really well by dbt and data hub allows you to extract that metadata integrates really well with dbt it gives us a column level lineage the cross project lineage and it allows the organization

Starting point is 00:21:33 to do the governance of the data so add additional things like data domains or identify fields that have pii data etc so it's it's the best of both worlds. Okay. Okay. So you mentioned Airflow earlier on. And when I heard that, I kind of thought to myself, 2016 called and wanted its orchestrator back at that point because that's obviously, to some extent, it's fairly a mature technology. And actually, we tend to use Dagstra a lot on projects

Starting point is 00:21:59 in our consultancy. So why the choice of Airflow? And any thoughts on i suppose airflow versus dagster prefect you know those kind of tools really yeah so actually dax or something that we've been interested on in uh since the very first mvps of data codes um when we were actually creating mvps of data codes back in 2021 we we actually had Daxter. And it wasn't, you know, the community wasn't quite there at the time, it was still, you know, a relatively young project that has all changed since that time. But what we found was that, or the decision at the

Starting point is 00:22:40 time was, let's go with something that has been around forever. So whenever you have a question, you can easily find answers in tools like Stack Overflow. And what we found was actually that in many organizations, Airflow is the de facto orchestration tool. So many, many people already know Airflow. You'll find people with the right skills. And so this is a challenge for any new tool, including data codes, that when you have an incumbent, it's very hard to get through the noise. And so what we have found is that in a large organization, they don't even know what DAX there is. That's the reality. And so our approach has always been to just, you know,

Starting point is 00:23:28 when there's enough demand, then we will add it. But until that point, it doesn't make sense. Now, that being said, I have one of the other features that we can do in Data Codes is we can actually run web servers. And so, for example, I can run Streamlit in my VS Code that's running in the browser. So a few weeks back, I ran Daxter just to educate myself on the tool and the recent developments. I ran an Airflow development environment right within Data Codes. So we can do that, but it isn't like the same way as a managed airflow.

Starting point is 00:24:06 We don't have that yet, but it is something that we could definitely add in the future. Okay. So I'm curious now, who do you see as your target customer? If you have a sales inquiry through, how do you qualify them? How do you work out if they're the kind of customer that gets value from what you do? What's your thoughts on that? It's really interesting because we initially thought that our type of customer that would work for us would be a smaller company, a shorter sales cycle, etc. But what we found is that those types of companies don't always have the

Starting point is 00:24:47 challenges that data codes are solving for. So for example, I mentioned how if you're running Fivetran and then you need to run DBT, then you could do it today. You don't need data codes for that. And so when does somebody need something like Airflow? Well, when you have multiple tools loading data. And so it isn't until you reach that level of maturity that you need that. Or I mentioned how with VS Code, you can install any Python library. Well, that requires that you have a need to install these Python libraries and extensions. And so there's a certain level of maturity for an organization to see the need for data cove. So today we have customers that are small, medium, and large. What's common across all of those is in the size of the organization, but it's the maturity or the desire to get to the maturity that makes sense. So we may have a three-person

Starting point is 00:25:46 team, but they understand the value of CICD and a connected pipeline and using different tools for loading data. So I tell people to focus on not on the tool, but on the pattern. So you have data that's moving from S3 to Snowflake or database to Snowflake or API to Snowflake. Each of those patterns may require a different tool. And so that's when they start becoming a better fit for us. Or they may have machine learning use cases that they want to build directly in VS Code because it's just Python and et cetera. And so those are the kinds of users that tend to be a better fit than just a small company that doesn't have a lot of complexity, has just a couple of users, et cetera. Yeah. Okay. You mentioned tools there. You

Starting point is 00:26:36 mentioned DLT, right? So that's something I've increasingly, a tool I've increasingly been hearing about recently. So what is DLT? I know it's not your product, but what is dlt and and what problems it's solved and what's it a substitute for and so on so so dlt is another um data loading or data ingestion tool so um i don't know how long it's been around it's only been a few years um and it is let me explain it another way. When you're loading data, you either go for a very simple tool like Fivetran or Airbyte. They're very simple. They have pre-configured connectors. And so you're just giving it credentials, you give it a destination and you're ready to go. What organizations find over time is that maybe the tool doesn't have the connector that they need

Starting point is 00:27:33 or it doesn't work as expected, scalability, et cetera. And so they go to the other extreme. They'll say, well, I will build my own framework and many companies have their own data ingestion framework written in Python, but something that they maintain and have to keep up to date, et cetera. I see DLT as the sweet spot in between those two extremes. What DLT allows people to do is really focus on the extraction of the data. Where's the source of the data coming from? Where is it going to land?

Starting point is 00:28:07 But the framework itself, it's handling things like creating the destination table for you. So you don't have to write any DDL. It manages the scheme evolution. It manages the change data capture. So the framework itself is simplifying a lot of that stuff so that you don't have to build that in addition to figuring out how you're going to extract the data, et cetera. And then what's happened is over time, they've created very generic connectors. So let's say I need to move data from a database.

Starting point is 00:28:40 Well, anything that you can connect to from using SQL Alchemy, the Python library, you can use as a source in DLT. They have a REST API pattern, a connector. And so you're not having to build everything and figuring out rate limiting and all of this kind of stuff. So they're simplifying a lot of stuff, but you're still writing code. At the end of the day, it is some Python code that you're writing. They've just decreased it significantly. Okay, interesting.

Starting point is 00:29:11 So that's interesting. So what about, so I met you at the Cube event, the Cube sort of happy hour. And I suppose semantic layers, they've always been, I suppose, the next big thing. They've always been sort of seen as important. What's your view on that market at the moment where it's going and how how does data codes relate into that as well yeah so i i agree i think that there's there's definitely a place for the semantic layer and we see tools like looker really innovate in that space. Over time, that switched to something that was decoupled from the

Starting point is 00:29:47 BI layer. So that's where you had cube come in. So the headless BI. And then you have newer tools like the dbt cloud semantic layer. There's another one out there called Honeydew at scale. There's multiple of these things. And that's both good and bad. The good is that there is more emphasis in this space. The bad is that it's still in flux, in my opinion, that there's still things that are being figured out, things that are maturing. So there's differences in how you write these semantic models between the different tools. There are tools that are still coupled in the BI layer. And so our approach has been to wait and see to some extent. And what I tell people is when you're starting out, the first problem that you need to solve for

Starting point is 00:30:45 is not the semantic layer. So I would say, figure out all the foundational things. There's a lot of stuff to be done. Let this stuff settle down a little bit and then make a decision. That being said, if it is something that you need,

Starting point is 00:30:58 we usually tell people, look at very mature tools like Cube. So Cube is a great tool. They're one of the more mature and partners in this space. And, and they have things that other tools do not have. So the way that they handle caching, the way that they handle security and the tools that they integrate with is a

Starting point is 00:31:20 lot more robust than, than most other people out there. So what we tell people is we don't have it directly in Data Codes, but if you need a tool, look at these other options. Honey, do I also mention they're snowflake specific, but it might work. And so depending on when you need to implement it, I would say consider these options. And the way that our approach has always been in Data Codes is to find the best tool for the given task, but don't lock yourself into that one tool forever because there may be another tool in the future. Like I said, we had Airbyte in the beginning, then DLT comes along and it's doing a good job. So you choose. You can use both.

Starting point is 00:32:05 You can use either one. There's no lock-in that you must use A or B. Okay. So I've got a few questions now, I suppose, about the commercials, really. So around how you price yourselves, how people are on board, and what does a typical kind of deployment and implementation look like? And also from a kind of self-interest perspective, is this something that replaces consultants or is it something that is taking

Starting point is 00:32:27 away a task that is not really value adding? But let's start first of all with how do you make money? Okay. So how do you make money? How does the product sort of get charged for? And do you see yourself as being a premium product, a sort of like a cheap product? But how do you make money? Yeah. Definitely not a cheap product you know what but how do you make money first yeah yeah um definitely not a cheap product i i always feel that um in order to offer a good quality product you have to stand behind that behind it with with support and and the the type of developers that you hire etc so it isn't about just getting a bunch of features out there, but making sure that people are well supported and that we can be

Starting point is 00:33:11 very responsive. So I would say, you know, we're somewhere in the middle. We're not the cheapest game in town, but we're also not the most expensive. How we make money, we have two ways of deploying the product. So we have a SaaS version, just like dbt cloud, you come in, in 10 minutes, you're ready to go. And when we're talking about SaaS, we have three components to the pricing. There's the developer seat. So how many developers do you have? How many people need to access that Visual Studio Code IDE. The second component is those other tools that we have, whether it's Airbyte or Superset or Data Hub or Airflow, which of those tools do you need?

Starting point is 00:33:56 And how many instances of those tools do you need? So for example, Airflow is one where you may have like a development instance and a production instance. And then the last piece of that is these tools are running all the time. So they may have a web server behind them, a database, et cetera. So there's a fixed cost to running these tools, but then there's also a variable cost. So I may have a customer that runs an Airflow job that takes 10 minutes and another one that runs an Airflow job that runs an Airflow job that

Starting point is 00:34:25 takes an hour. So we have worker minutes. How long does the job actually takes to run? And so that's how the pricing is. It's the developer seat, the tools, and the worker minutes. As far as when we go to an on-prem deployment, so a private cloud deployment. And by the way, in SaaS, you could get data codes for a month and cancel after a month, et cetera. There's no lock-in period.

Starting point is 00:34:53 I mean, we may have better offers if you do lock in for a year or two, but that's month to month. Essentially, when we do a private deployment, we have a minimum of one month, one year, sorry. So yearly licenses. And the pricing structure is very different because at that point,

Starting point is 00:35:11 it's they're paying for the compute, right? It's in their cloud account. And so, you know, we would figure out something that works, but it tends to scale very well with the larger implementations. We'll sell 200 user license,

Starting point is 00:35:24 150 user license, 200 user license 150 user license 100 user license and and so it scales very well at that point for them okay so so if you if let's say for example if you were to um could you bring so how much how much can you customize the the the components that are deployed for customer. And do you ever use the, do you ever integrate the kind of the cloud versions of those, the paid versions, or is a sale that's made through you a sale loss to those partners? Is that of interest?

Starting point is 00:35:54 How does that work? It depends what it is. So we, I would say that the core that everybody uses is VS code and the airflow piece. So everybody needs that. And we VS Code and the Airflow piece. Everybody needs that. We're just using the core and the open source versions of these things. The money's

Starting point is 00:36:14 coming to us and we are deploying it and supporting it, etc. There are other tools that I would say it's not the minimum but it is what people tend to choose. There are some people that may use Airbyte, for example. Again, we have an agreement with Airbyte.

Starting point is 00:36:37 And so we're deploying and maintaining and upgrading, et cetera, Airbyte. So we take care of all of that. But it doesn't preclude anyone from using the paid version. So you could use the Airbyte Cloud with DataCos, just like we have people that use Fivetran. So we don't support Fivetran. That's an independent service. So they can do that.

Starting point is 00:36:58 Like I mentioned, Kube is another example of something that we don't even, that one, we don't have it integrated at all. So they could use Kube Cloud as an example there. And even Daxter, I mean, you mentioned how you prefer Daxter. There's nothing precluding you from using data codes only for the IDE and then using Daxter Cloud for the orchestration.

Starting point is 00:37:23 That's perfectly fine. Okay. Okay. Okay. So as a business then, so are you VC funded? Are you bootstrapped? What's your philosophy around the business and growth and how that is driving decisions you make really around that business? Yes.

Starting point is 00:37:38 So when we started out, we consciously wanted to do this as a bootstrap business. We were doing some consulting already and we focused early on, on profitability over growth. And, and that came from a philosophy of like, I followed many people in this space, people like the people who created Ruby and rails, the web framework and other friends that I have in this space, people like the people who created Ruby on Rails, the web framework, and other friends that I have in this space.

Starting point is 00:38:09 And sort of like the idea that we don't need to be, I don't know if you know, the U.S. chain, I don't know what they call it, but it's a casual Italian restaurant called the Olive Garden. Oh, yeah, definitely. So we're like, we don't need to be the Olive Garden. We want to be a casual Italian restaurant called the Olive Garden. Oh, yeah, definitely. So we're like, we don't need to be the Olive Garden. We want to be a good Italian restaurant, right? So for us, it's perfectly fine to be a smaller profitable business. And so that's always been our philosophy.

Starting point is 00:38:40 Our idea wasn't to just say, we want to sell something and good luck. We wanted to make sure we sold something and that people were seeing results. Because if they're seeing results, if they're having success with DVT and all of these tools, they'll renew with us. And so our goal is sustainability, keep things going, make sure that people are happy and keep renewing with us. And so far, we've been very successful with that. And so there is no VC in our future. The only money we've ever taken was from a startup accelerator. And that was more for the mentorship and knowledge sharing, because for me, you know, sales and marketing were not my strong suits. I didn't come from that background. So having other people to bounce ideas off of, suits. I didn't come from that background.

Starting point is 00:39:25 So having other people to bounce ideas off of, et cetera, that's why we went in that direction. So Noel, you said you had a background in enterprise, sort of like in organizations and deployments and so on. So maybe stepping slightly away from data coverage, but I imagine this is probably through your involvement with that. If you're going to a large organization and you're going in there and you're helping them sort the things out that data coverage, but I imagine this is probably through your involvement with that. What do you, if you're going to a large organization and you're going in there and you're helping them sort the things out that data coverage addresses, what are the kind of the lessons you might have learned or

Starting point is 00:39:51 what are some of the observations or I suppose, you know, keys to success really in kind of getting involved in a data project in a large organization about the sort of things that you do? Yeah. So I have been involved, the enterprise that I work with, in the digital transformation. So that is implementing Databricks, implementing Airflow, implementing our own framework. So this is pre-BBT. We also, you know, we're moving out of legacy tools. And so I had been down that road and seen that journey firsthand. And one of the things that I initially didn't have an appreciation for, but I do now, and I

Starting point is 00:40:33 really tell people to focus on is actually the change management of all of this. So the tool is only a tool, and it's only going to be a success. You're only going to be as successful as how you change your processes. So I will see organizations say they want to use dbt because it, you know, at a high level, it seems like it's going to solve a lot of problems, but if you don't change how you're doing things, it's not going to necessarily make you more agile.

Starting point is 00:41:03 It's not going to help you have better documentation or better data quality because it requires you to change how these things are being done. So you mentioned earlier, like how does a tool, you know, it's a tool like DataCodes taking work away from you. I would say not necessarily. DataCodes is helping you focus on those things. So we're not solving, we're not doing the consulting to help people through that change management and establish those best practices, etc. We give you some guidance, but there's somebody that needs to be holding their hand. And so the less time that you're focusing on the platforming, the more time you can spend doing that kind of stuff. And that's really where the value is. And I find that that is the most challenging part in any organization is really that inertia to change, to really fundamentally change how you're doing things, because that's what's required.

Starting point is 00:41:56 It isn't just putting in Snowflake, putting in DVT and everything's going to get better. Okay, it must be challenging for you though, because a lot of the success of your product is based on things outside your control really um how do you as a as a vendor um influence that or or try and ensure the success of your of your product in those in those kind of environments yeah i think you're you're kind of hitting on something that um it is definitely a challenge and um you you know it when we when I talked to people and we were

Starting point is 00:42:27 talking about who, who is that ideal customer, it is the customer who, who realizes that they need to change other pieces. So it isn't just, I'm going to put in this, this tool and, you know, it's a silver bullet that's going to solve everything. We really, uh, are most successful when we partner with people who realize that there is more to be done. And people who don't value that may not be the ideal customer. Or sometimes it's a combination. It's like, maybe if they realize where they want to be in two years, then we can help them take the necessary steps to get there. They may not be doing it all at one shot, but they understand the journey. And so you do need those people who are going to advocate for those, you know, bigger fundamental changes in how they do things.

Starting point is 00:43:17 Like I said, you may want better data quality, but who's defining which fields are important and what are the rules that need to do that? That's still a human in the loop. The same thing happens with data contracts or any of these kinds of things. When I worked in enterprise, I did data quality reports across the organization, different functions, different regions, et cetera. And it wasn't detecting the problem that was the issue. It's actually getting it fixed

Starting point is 00:43:49 and preventing the problem from happening. So if you have master data that you're collecting and the tool where you're collecting that master data does no validation, you're going to keep seeing the same error happening over and over and over again. So yes, you have a nice detection thing but um my background is in industrial engineering and part of that is is saying you

Starting point is 00:44:11 need to prevent the error right what do you put what fixture what what what validations do you put in that source system maybe it's your crm or something else to make sure that you prevent the error from occurring in the first place and that's what's really hard okay so so to kind of wrap things up a little bit you um we're getting towards the end of the year perhaps it is the end of the year now um and so it's sort of tradition to sort of think you know think forward to the next 12 months what what do you what are you going to see in the industry and what do you kind of hope to see what do you think you'll see and so on so where do you see corny question but where do you see the modern data stack going in the next 12 months? And you're part of the industry in general. I think, you know, all the focus, all the buzz, all the investment is happening in Gen AI.

Starting point is 00:44:57 That's where all the excitement is. But I think anyone who's been around this industry long enough, understands that there's cycles that that there's a lot of focus, you know, when, when I was going through this transformation in enterprise years ago, the focus was on machine learning, oh, we need data science, and all of this kind of stuff, advanced analytics. But it all keeps coming back down to the fundamentals. And so how good is Gen AI, if you don't have your data pipelines in order, and all of these fundamental foundational things. And so my, I don't know if it's going to happen, but my desire is that people understand that

Starting point is 00:45:40 there's value in this stuff. I was talking to someone earlier today, and I was saying, imagine making a neighborhood. You have a new development. You have a lot of land, and you're going to set up all these houses and all of that stuff. Imagine doing that without having the utilities figured out. How are you going to get electricity, internet, gas, wastewater removal? If you don't figure that stuff out, you may have the prettiest houses in the world, but they're not going to be so good. And so that city planning, these foundational things that we take for granted, we go to a wall and we hit a switch and light turns on. Nobody thinks about that until it doesn't work, right? When it doesn't work, now we're like, oh, wow, what happened to the electricity? And we need data to work that way. We need data to just flow, to be sort of the second thought.

Starting point is 00:46:30 And so if I do that well, I know that when I go to use that data, it's in good shape, it's in good quality, I can trust it. Now I can focus on those shiny objects. I can focus on the gen AI, I can focus on those shiny objects i can i can focus on the gen ai i can focus on the machine learning without having to keep going back to the well to get my water every time yeah yeah okay and hearing what you're saying there how is that informing what you're doing with the roadmap for data coves i mean what do you what are you looking to get in the product over the next kind of year and and how's how is what you're saying informing the decisions you make over investment and direction and so on?

Starting point is 00:47:06 Yeah, so what we are investing in and have always invested in is how do we make these things simpler to use and how do we put it into the context of how the user uses it? So I'll give you an example. We are working on integrating Gen AI into the product. But how do we do it in a way that it is, and it's giving a developer a superpower, it isn't removing the developer from the loop.

Starting point is 00:47:31 So there's some people that feel like, I will just hit a button and magic is just going to happen. I remember years ago, I think Amazon had something called like zero ETL. Like, what do you mean zero ETL? Right? It was a lot of marketing hype. There was a lot of stuff hype there was a lot of stuff

Starting point is 00:47:45 okay i don't know where that is today but you still need a human in the loop and so what we want to make sure is we bring these tools in a way that it isn't it's just simplifying things for the user but not removing the user it isn't code. It isn't about drag and drop and magic is going to happen. It's about making people more productive. And the other part of that is, how do we do it in such a way that it is configurable, that it is enterprise grade? So this may mean not using the public OpenAI endpoint. You may want to use Azure OpenAI or Cloud or your own custom model running in Snowflake. That's our focus is to get these tools to work in a way that is more configurable and adopts to the needs of a large enterprise. Fantastic.

Starting point is 00:48:47 So Noel, how do people find out more about Data Codes and how do they maybe get a trial or they speak to you about whether, you know, to find out whether it's the right tool for them? Yeah, so you can just visit datacodes.com and just there's a contact form right there. We will, you know, assess whether you're a right fit. And by the way, we talk to people sometimes and say, yeah, maybe data codes isn't right for you or not at this time. And there's nothing wrong with that.

Starting point is 00:49:11 I think that we want to make sure that people are successful regardless of what tool they end up choosing. We do have trials. So we will make sure that we help people set all these pieces up and kind of hold their hand and make sure that things are working as expected. And we'll even go to the point where we will, in a large organization, because there's so many components, we will do an MVP showing them everything. Everything from how you handle security in Snowflake and all of that. Sometimes we'll outsource it to companies like you where you'd handle all of that. Sometimes we'll outsource it to companies like you where you'd handle all of that,

Starting point is 00:49:46 but the customer's hand is held essentially to see the potential of all of these pieces coming together. Fantastic. Okay, well, fantastic. It was great speaking to you, Noel. Thank you very much for coming on the show. Best of luck for the future. And yeah, very interesting conversation.

Starting point is 00:50:01 Thanks for inviting me and I appreciate the time. Thank you. And happy Christmas and New Year as well for later in the month Thank you.

Your Ad Here

Drill to Detail - Drill to Detail Ep.117 ‘How DataCoves Operationalises the Modern Data Stack’ featuring Special Guest Noel Gomez

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.