Orchestrate all the Things - More than words: Shedding light on the data terminology mess. Featuring Soda Founders Maarten Masschelein and Tom Baeyens

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. It's a data terminology mess up there. Let's try and untangle it because there's more to words than lingo. Hopefully, technology investment decisions in your organization are made based on more than hype. But as technology is evolving faster than ever, it's hard to keep up with all the terminology

Starting point is 00:00:25 that describes it. Some people see terminology as an obfuscation layer meant to glorify the ones who come up with it, hype products, and make people who throw terms around appear smart. There may be some truth in this, but that does not mean terminology is useless. Terminology is there to address a real need, which is to describe emerging concepts in a fast-moving domain. Ideally, assert vocabulary should facilitate understanding of different concepts, market segments, and products. Case in point, data and metadata management. Have you heard the terms data management, data observability, data fabric,

Starting point is 00:01:03 data mesh, data ops, ML ops, AI ops before. Do you know what each of them means exactly and how they're all related? Here's your chance to find out, getting definitions right from the source, seasoned experts working in the field. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. So I'm Martin, one of the co-founders, CEO. I've been in the data space for a good 10 years now, so data management. The earlier parts of that, I was an early employee, employee number five at a company called Colibra,

Starting point is 00:01:41 who were the first ultimately selling software or positioning software to achieve data officers before the CDO came into existence I think back and when they started was 2008-9 you had the first CDO from Yahoo but then it gradually built out kind of data governance became a thing, metadata management became much more important as companies were doing more with data. So I enjoyed that journey for six years and then branched off, co-founded with Tom. Tom was, ultimately, we didn't really know each other that well prior to starting,

Starting point is 00:02:25 but we did have a first conversation a couple of years ago, or actually before we started, because I was on a train together with a colleague of mine to London from Brussels. So we're in the Eurostar. And the colleague of mine says, I know this guy. I know that this is Tom from the workflow that we use at Colibra. Colibra has a core capability around data collaboration, which was workflow engine driven.

Starting point is 00:02:51 So JBPM and was kind of at the foundation of that. And that's Tom's open source project. So my colleague knew him. We started chatting and ultimately that's where we first met and the conversation got started. It's a good segue to make the transition. I got started in open source workflow engines. Created two subsequent workflow engines in the developer ecosystem with significant communities and the second

Starting point is 00:03:27 one was the biggest one and that's the one indeed that got adopted into a lot of companies like Colibra but there's now a handful companies on the actual technology itself rather than like hundreds or more like baking into their products that's where I saw really that the developer ecosystem, open source, that's quite a different beast from normal traditional development. And I really like that. It's like an awesome environment to be in, to work in.

Starting point is 00:03:57 After those endeavors, I also did a SaaS startup. And that's after that SaaS startup, which was in the BPM space, workflow and collaboration as well, but more the simplified version with UI towards the business enterprise version. And so that's kind of how I segued into data when I met Martin because open source and workflow are key components in the product vision of Soda as well. So that's where the connection comes from.

Starting point is 00:04:29 We have, in the meantime, established an open source trajectory, launched it earlier this year. And workflow and collaboration is key in our platform as well. So that's the link and where we met up. That's how we got started. Okay, interesting. So if I got it right, Tom, you used to work around JBPM. Is that correct? Yeah, I'm the original founder of JBPM.

Starting point is 00:04:58 Okay. And a later activity. Yeah, I mean, like many others, I guess. I have used JBPM at some point. And yeah, I even met some people in Berlin. Unfortunately, I forget the name of the company now, but since you're into BPM... J-Boss Red Hat, maybe?

Starting point is 00:05:16 No, they started a new... Ah, Camunda. Yes, Camunda. Yeah. I met them at some point. And yeah, we actually discussed quite a bit about, you know, the inner workings of business process management and workflow engines and open source and all of that.

Starting point is 00:05:32 That's a great crew. Each time when I go to Berlin, I try to meet up with them. Some nice friends there. Yeah, cool. Okay. So, yeah, thanks. Thank you both for the introduction to both yourselves and how you met up. And I should also mention that it's kind of a nice coincidence, nice timing, because one of the things that James who organized the discussion was kind enough to let me know that you just got an award as a cool vendor from Gartner. And I was looking at the brief in which they described what this award is about.

Starting point is 00:06:12 And they also mentioned a few words about the vendors that got the award. And what struck me about it was that there seems to be quite a wealth of terminology thrown around. So you have data management and data observability and data fabric and data mesh and data ops as well and AI ops and ML ops and all of those things. And well, even though it's mostly analyst lingo in a way, and I count myself as an analyst as well. Many people say, okay, so this is analyst speak, you know, to, I don't know, to confuse

Starting point is 00:06:50 people or to invent product categories or whatnot. But I think there's actually, there should be at least some value in those terms. Like people are basically trying to describe like emergent, let's say, emergent trends in the market, emergent products. And so they kind of have to come up with lots of terms, I guess. But I think it's also interesting if we try to clarify, let's say, a little bit around those. And in the process of doing so, I think it will be good for people who may be listening as well,

Starting point is 00:07:24 because then you will also be able to position yourselves around those terms and how do you self-identify, let's say, as a company. Yes. Thomas, if that's okay, I'll give this a first stab and then you can fill in some of the blanks, the things that I've missed. Let's see where we start. I think it's best that we start with both concepts of data mesh and data fabric.

Starting point is 00:07:50 So data mesh and data fabric, ultimately, it's all about kind of a framework for organizing around data for scale, ultimately. It's kind of where fabric focuses more on this kind of concept in the technology domain of like the data platform, one unified way for us to access data using shared services, kind of making abstraction of all of the underlying complexities of technology, or whether data resides somewhere in a legacy Teradata or what have you, or more in the cloud in a snowflake.

Starting point is 00:08:26 Data Fabric is all about technology and understanding the organizational setup for speed and growth and doing more things with data. Data Mesh is a similar concept, however, slightly nuancely different in the sense that it focuses more on the organizational aspects of it. It focuses more, and I would almost call it like the modernized version of data governance

Starting point is 00:08:52 principles that are applicable for broader data teams to kind of structure and organize in a good way, removing some of the bottlenecks of the past. The bottlenecks of the past were predominantly around, for example, your data warehouse team that was extremely kind of like a funnel that you had to go through, which was not scalable. So with data mesh, it's fundamentally about building data products and data services. So it's the data product thinking thinking instead of like when at governance, we talk about managing data as an asset. In the data mesh concept,

Starting point is 00:09:30 we talk about managing data as a product, which is more specific ultimately. And it's this notion that, yeah, we should have kind of core platform services. But then on top of that, we need to structure ourselves around data domains, areas of certain like business expertise and knowledge and enabling them to be self-serve.

Starting point is 00:09:53 I think that's also the key. In the past, we had a lot of bottlenecks. Again, now we try to make data technology much more self-serve and introduce concepts in the data mesh like data ownership much clearer and kindserve and introduce concepts in the data mesh like data ownership much clearer and kind of what your expectations are and one of the roles or kind of the responsibilities of people who own data is also to manage the quality of it and that's maybe a

Starting point is 00:10:18 nice segue into data management data management is a term that exists already for multiple decades and originally very heavily described by the Data Management Association, where they did a lot of work around how should we manage data ultimately. A part of that was metadata management, which spun out in data cataloginging software which spun out data lineage capabilities as well as another one in there was data quality management and that is a kind of where you can see the terms that we're more associated to data monitoring data observability data testing as being kind of specialized areas underneath quality management within the broader framework of data management. So that's kind of my first take on that. So data management,

Starting point is 00:11:14 I talk more about capabilities. And with data fabric and data mesh, we talk more about like organization. And then the last one is data ops. Well, that's more about process. Where in data ops, it is about, now that we have these capabilities, now that we've organized, what are best practice processes for us to deliver data products at an increasing velocity with an increased reliability as we do that.

Starting point is 00:11:44 And this goes really into the nitty-gritty sometimes. Like, for example, today, when an analytics team delivers new data pipeline codes, like new data transformation codes, let's say in DBT, they want to have processes in place, checks and balances, that, for example, the engineering manager reviews the PR, the pull request to new codes, and can see okay is this

Starting point is 00:12:06 going to have a structural impact on the data set or is this still are do we rely on this new or can we rely on this new code base or this new code change and that's an example of a process and that's all like small processes that need to be put in place and standardized for us to work much better with data, similar to what we've done with DevOps and software engineering. So it was quite a lengthy one, but there are a lot of terms to go into. I don't know if that makes sense, raises questions, or Tom, if you have things that you'd love to add there. No, I just only wanted to add a bit context around that data observability space.

Starting point is 00:12:49 I think for those that are not familiar with the space, I think the way you can understand this is that there's like engineers building data pipelines. They're preparing the data to be used in data products. Products, for example, are machine learning models. So anything, any algorithm that's using data on a continuous basis or where the data then automatically gets updated from source system, transformed over data pipelines,

Starting point is 00:13:16 and then being prepared for usage in the algorithms or use case. If you look at that landscape, there are a bunch of engineers developing new products regularly. or use case, that's, if you look at that landscape, there are a bunch of engineers developing new products regularly, once those products get into production, that's the context, that's where the observability starts. That's where the data could actually go bad and the software algorithms using the data, they keep on working, they don't notice that the data is bad.

Starting point is 00:13:45 And this leads to all sorts of very costly dangers or costly consequences that you want to protect yourself against. You don't want that your clients find out that your website is showing wrong information. You don't want that your, for instance, hotel room price calculation algorithm uses wrong data because then your revenue is directly impacted. So checking and continuously monitoring that data as your use cases and your data products are in operations, that's where you need observability. That's where you want to protect yourselves against. Yeah, yeah, great. Thank you. Thank you both.

Starting point is 00:14:27 And I think that it's just, indeed, these are lots of terms and it makes sense that, well, it takes some time to go through all of them. And actually, I think you did a very good job of, both of you, of kind of describing and aligning them, let's say. There's just a couple of terms that we left out ml ops and ai ops which in my understanding at least i would say that they're kind of a specialization of data ops i would say that that specifically applies and to my mind they're pretty much the same even even though, you know, in theory, machine learning

Starting point is 00:15:05 is definitely not the same as AI. I think in practice, those terms, at least in the MLOps versus AIOps context, I think they're probably used interchangeably. I don't know what your view is on that. I think, yes. So there's, I think they rely on each other. I think MLOps relies on a good DataOps foundation ultimately, but it's more specialized. Like in DataOps, we won't be monitoring our prediction accuracy,

Starting point is 00:15:35 for example. That's specific to the data product. And that's also specific to the lifecycle of the data product. So I like to think of it more from like a lifecycle perspective. And then for me, those are two separate things because the lifecycle of a data set is not directly tightly coupled to the lifecycle of a machine learning or a data product ultimately. So there are also different people doing that. When it comes to managing data and data ops, we have data producers, which can be external to the organization. You could have a Bloomberg feed.

Starting point is 00:16:14 You can have all sorts of data coming in that you either buy or collect. You have internally generated data. So there's much more of an organization around it, typically also in the business that takes ownership of it. So I would see it as a separate thing, however, with quite some commonalities. Another way of looking at it is kind of looking at the tooling landscape. And if you look at the monitoring and observability software in the entire stack, and with the stack, I mean, like infrastructure at the bottom of it, then our applications that we write. And then nowadays, these applications, we use data and machine learning as two kind of new layers. And in those two layers, we're just getting started with software and platforms to help you monitor

Starting point is 00:17:05 that. That's relatively new, whereas the other ones have been existing for much longer because so I think there there's a lot of analogies across those layers of the stack, but there are some intricacies about each one of those. Yeah, I would actually consider more like data observability and checking your data is more fundamental layer on top of which you have the MLOps and the AIOps in the sense that MLOps have specific workflows around how you deal with these machine learning models that you have to figure out, like if the data changes,

Starting point is 00:17:45 then how is the impact on the actual result? Or if the model changes, what's the impact? And then this versioning, throughout the versioning, you can sometimes trace back, like where was the problem originally? So those are specific machine learning. And similarly for AIOps, the different flavors of the workflows on top,

Starting point is 00:18:04 but fundamentally underneath all of the use cases with data, they need correct data to start with. So in that sense, we're more like the base layer on top of which the more specific workflows are being developed. Yeah, thank you. And yeah, I think it's a task of quite amazing complexity, actually. I mean, just managing DevOps before we even start talking about data ops and all of that and on top of that models and versioning, it just kind of explodes. So that kind of goes to show the need for solutions like what you're building, I guess. It's a nice segue to actually talk about what you're building.

Starting point is 00:19:01 So it seems like the message that you put across is focused around four areas. So monitoring, testing, data fitness and collaboration. And I wonder if you'd like to just say a few words about the platform in general and more specifically those four areas and what you offer in each of those. No, totally. I'll try to keep that kind of short and crisp. Ultimately, the first capability is really like a capability for the data platform team. And it's all about automatically monitoring data sets in our environment for issues.

Starting point is 00:19:40 No configuration, ultimately. So that means that we try to figure out if there's something abnormal about the data sets that land in your environment. For example, across how many records did you process this time around? Is that abnormal compared to what there was same day last week? Or using some machine learning, comparing that, factoring in seasonality to figure out if that's off or not. Or things like when your data updates. Data freshness is a key consideration always, a key problem that companies are looking for solutions. Because sometimes your data providers will change something that you didn't foresee, and then all of a sudden data is stale or becomes stale,

Starting point is 00:20:30 and that has a direct downstream impact into your data products. So it's about automating that as much as possible, and no business logic needed really. And that covers a part of the discovery of data issues in your organization, but it doesn't cover all of them. It actually only covers a small percentage of the types of data issues that you can have. So that's why data testing and data validation is kind of the next step. This is where you enable both the data engineer on the one hand and the data subject matter experts so the like a business counterpart on the other hand to write kind of more descriptive declarative validations on data things that need to hold each time new data arrives like we can only have x percent of

Starting point is 00:21:17 missing data in this column for example or this needs to be unique or this needs to be referential integrity or it needs to be allowable set of values. For the engineer, because the pipeline might break. For the analyst or the data subject matter experts, because their data products will potentially break or there's a business process that needs to be triggered. So those are kind of squarely into the discovery of data issues as kind of step one. But that's not where it ends because if you have a system for discovery of data issues, it will create a lot of alerts. But how do you handle the alerts?

Starting point is 00:21:54 What is the business process that you then go through? And that is, I think, very key. That's where we enable the data owners, for example. And that's kind of the analysis and prioritize phase. And that's where we have things like data fitness dashboards, which is more broadly about SLA tracking, about giving data owners a view of all the expectations on data across the organization so they can improve the quality and know where to prioritize, as well as kind of the workflow around the resolution of the problem, all the way to creating tickets and Jira service now

Starting point is 00:22:35 to then further handle the data incidents. That's ultimately kind of the higher level of capabilities, whereas collaboration fits in across all of these areas ultimately. But predominantly, collaboration is there in also the analysis phase as you can easily bring people with different knowledge about the problem, like the data engineer on the one hand, the analytics engineer, as well as the business person. They often have tacit knowledge that's not documented

Starting point is 00:23:01 that is needed to resolve the problem. Maybe you can add a few notes and comments to it. So I'd like to make the summary as to these are very distinct kind of capabilities, but I'd like to also give an overview perspective of how do we approach this space. So we approach it from the, so if you look at it from the engineering perspective, you're mostly involved in, for instance, data testing. You want to make sure that your pipeline runs smooth and that your pipeline is stopped if you detect bad data so that it doesn't flow downstream.

Starting point is 00:23:41 But there's a lot more people involved into the whole data ecosystem in a company. There's the analysts like yourself, like using the data, building interesting use cases and products with it. There's the subject matter experts that have the domain knowledge of intimate domain knowledge and all the details of what a particular field looks like, what kind of data is normal and abnormal, and what are the specialties on it. And now, you can't really prevent that data issues will happen. So, there needs to be a process in place that actually monitors, finds those issues, and then resolve them.

Starting point is 00:24:19 And if you then look at it from the head of analytics or the CDO, the chief data officer's perspective at your organization, you are responsible to make sure that your issues are discovered and resolved. And of course, there's a bunch of steps in between, but that's the business process that you're responsible for. And you need to make sure that that flywheel is continuously running. So that's where Soda focuses on in the summary for making sure that that flywheel of finding issues and resolving it, that's an operational concern and that's dealt with. Okay.

Starting point is 00:24:55 Yeah, it seems like, you know, there's a kind of logical progression, at least among the three first areas. So monitoring, testing and data fitness, especially in the way that you described them. So it kind of all ends up, you know, simplistically, very, very simplistically put, it kind of all ends up in a dashboard in a way where the person responsible for overseeing the process, let's say, can see how it's all going. And then you have a cross-cutting concern, which is collaboration that kind of

Starting point is 00:25:25 facilitates everything yes correctly and the one point i also wanted to add there is that this collaboration and the workflows behind it that might look like a bit of overkill in the smaller to medium-sized organizations where you have smaller teams and usually like there is one or two data engineers and they handle everything. But yeah, so I think that's you need to have the awareness that as your use cases of data will grow towards the future, that these roles will more specialized and then you need to have more of this collaboration in place and managed in order to keep that all running. Otherwise, your data engineers will get overloaded. Yeah, it makes sense.

Starting point is 00:26:09 And actually, I wanted to ask you a little bit more about some more details about two of those areas. So let's start with collaboration, actually, because of the fact that you mentioned it as a kind of way to elicit implicit knowledge in a way so I'm wondering if you have a specific way of doing that so I guess the obvious would be well logging all conversations and exchanges but I wonder if you also do any analysis on those and whether you have some results that you distilled out of that. Ultimately collaboration, there's a couple of things unique about

Starting point is 00:26:52 it. One of them is that it's it probably works roles-based. So for example, as an as an I'm a data, I've written this data transformation code, for example, or I ingest this data set. So I'm kind of attached as a, as a, with that, with my role of data engineer. So when we have technical problems, that will be the person we default to, to, to involve and inform as a way to also reduce noise and kind of make alerting highly kind of specific, making them go to the right people. But the same thing is when we have issues around,

Starting point is 00:27:31 like, for example, a lack of data or completeness of data, validity of data that's downgrading, that's something more of a concern to the person who knows the data inside out, right? So again, there we can work based on roles to route alerts to the right people. You've also touched upon kind of the analysis phase. So very often, like when there's issues or incidents, we triage them first. We see which ones we're going to work on.

Starting point is 00:28:01 Maybe we group some incidents together because there's an underlying root cause. In that analysis, we leverage a multitude of tools from data lineage capabilities to diagnostics data that we analyze. For example, we have a set of passing data on a certain validation, and there's a set of failing data and if we analyze those and see kind of what's different about them we can already figure out maybe it's this um device type because all the records in the failing uh sample um all pointed is one device type being android for example so um all of that kind of is in the broader space of helping flow from prioritizing, analyzing to resolving, pointing them to the right tools,

Starting point is 00:28:53 giving all of the users information that can help in their decision process and the ability to tag people, a bit like in Google Docs, right? Or you'll be tagging somebody in to help out or to give some more insights into why something might have happened. That's kind of how we envision that. If I can add one more thing is like collaboration doesn't always mean the typical and trivial collaboration features like commenting and sending things to the right people.

Starting point is 00:29:26 Collaboration for us goes broader because we also want to make sure that this domain knowledge of the subject matter experts is captured. Because normally this domain knowledge that they have is bottlenecked by the availability of the data engineers to actually implement the rules that they have, because this is not something that you can tackle with AI. The domain knowledge needs to be captured in rules. And so there we've invested a lot to make sure that analysts can actually do self-service monitor creation so that they can actually, without the involvement of the data engineers, manage this domain knowledge

Starting point is 00:30:05 themselves. And that way scale a lot more of these rules, because that used to be the problem that the rules can't scale well enough. And there's two reasons for it. First of all, now there's more data. That's one thing. But the other part is that it used to be a technical solution. And there we now go on the self-service mode where yeah where people can do this together and therefore a lot more of that domain knowledge can be covered okay okay cool thanks so i guess that's also um a good way to um talk a little bit about the underpinnings of what you've built. And I was wondering what kind of technology you could have possibly used

Starting point is 00:30:50 to build the individual modules, let's say, functionality, as well as to glue them all together. But actually, I'm going to make a guess here. And having heard from you, Tom, that you were deeply involved in workflow engines, I'm guessing that maybe that has something to do with it. Yeah, definitely. That's going to be a good way of looking at a technical underpinnings because we split that up into all the developer tools that we have to make connectivity with the data. We have SodaSQL, SodaSpark is almost there, and then we have SodaStreaming as well in the pipeline. So these give like full coverage of the complete data landscape or the data stack,

Starting point is 00:31:48 which is important because in large organizations, your data is all, you want to monitor the data, not only in your warehouse, but also in these other places. So you don't want to be stuck to only a warehouse, for instance. So there we realized that connectivity, a lot of the times has to do with the data engineers in the team. They want to work with configuration files, for instance, like YAML files. They want to work on SQL level. They want to check these things into their Git repository.

Starting point is 00:32:14 So we spent like a lot of time making sure that from the engineering perspective, this is like a seamless thing. This is something they love rather than they have to be forced to actually implement this tool. So that's the underpinnings there. And then the connectivity leads to a set of metrics being computed on a scheduled basis. And then each time when the metrics are computed on a certain data set, they're sent to our cloud product. Our cloud product will actually collect these and store them so that you get like for each metric, a time series, so that we can see over the history, how does this metric behave, so we can apply anomaly detection on it to check like is this

Starting point is 00:32:58 normal or abnormal. And it's also in that platform where we allow then the analysts to start building their own monitors on top of that metrics that come in as well. So that's in a nutshell. That's also in the cloud, by the way, that that's where the workflow engine is part of that as well to drive the resolution. That's like a rough outline. I don't know if maybe Martin has something to add to that. You're on mute. Yes. Sentence of the year.

Starting point is 00:33:32 So, no, I think one way I like to sometimes look at it is to compare it kind of in the management. We've talked about data cataloging, for example, a while ago, which is more focused on kind of the process of finding data in an organization and sharing data definitions sharing understanding so you can more quickly access that data and start using it that's kind of an adjacent process to ours which is an of the data that you use how do we make sure it remains fit for purpose? So that's kind of where we focus. And because in data, everything is connected, right? We have data sets that are at the source, and then we make copies, transformations,

Starting point is 00:34:20 and then it goes to maybe another part of the organization where they use it for another purpose. And before you know it, you have a large graph of connected data sets. And I think the data cataloging space is predominantly focused on graph-based systems, finding connections between data, understanding where data comes from, how it transforms, et cetera. I think that's one component that is ultimately needed predominantly in the analysis phase of a product, of a problem, a data incident. For us, we focus much more on metrics, on like how you can compare metrics, how we can find problems through metrics. So it's more of a time series based system. We're heavily kind of more on that operational day-to-day nature and a bit less around necessarily

Starting point is 00:35:15 the graph of things. So it's maybe another way of looking at kind of underlying technology focus and technology choices. Okay, thank you. And yeah, interesting that you mentioned graphs and such because that was also something I meant to ask you about again in relation to your inclusion in Gartner's report and in that report which also deals with data fabrics, they have a kind of stack there, which includes things such as data sources and metadata and so on and so forth.

Starting point is 00:35:53 And they also include knowledge graphs, which kind of struck me. It makes sense. And there's lots of products in the data cataloging, mostly, space that are based on that. And I was going to ask you how much of this stack well first of all if you think the stack makes makes sense and then second part how much of this stack would you say that your own

Starting point is 00:36:15 solution touches so we have our data sources then there's a layer of data and metadata then there's an augmented data catalog and then there's a layer of data and metadata. Then there's an augmented data catalog. And then there's a knowledge graph with semantics. So I think what they're really doing is the bottom layer is ultimately the data cataloging space. Where the primary focus of what we do is we ingest metadata into a centralized system. And we manage the lifecycle of that metadata. And as part of that lifecycle, one of the things that we typically do is understand what data do we have in here. Does this column connect to a high-level concept in the business. You can imagine that a customer or customer address data

Starting point is 00:37:06 is stored in many, many different physical tables, files, or what have you. And that's ultimately the domain of kind of the catalog. The knowledge graph and the semantics are about representing the intricacies about the business at hand, the company that is building that knowledge graph, to kind of ultimately better manage their data.

Starting point is 00:37:31 Because if you have a semantic layer, you can start defining policies more on that level. You can say, well, for all customer data, we do X, Y, Z in terms of our data management process. And for us, that is not a space that we're in. We have an integration strategy there, and we've already integrated successfully and have that running at some customers

Starting point is 00:37:57 with some of the most commonly used data catalogs. So we're ultimately, if they say augmented data catalog, well, you could say that we're part of the augmentation. Because when somebody searches for data in their organization, they can immediately see if that data set is properly maintained. If we've had issues in the past, how quickly do we react to those issues? And that is really valuable information as part of your kind of enterprise repository of your knowledge graph, your data catalog, your business metadata, ultimately.

Starting point is 00:38:37 So I think up to that level, we are involved in that predominantly from an augmenting the data catalogs perspective. I think where you then go higher up in the stack, I think they predominantly focus or see us within orchestration and data ops. As predominantly the metadata activation, I have to be honest, I have to look that up actually what they exactly mean with that. But then, of course, in data ops and your processes around how you manage data on a day-to-day basis,

Starting point is 00:39:10 how we resolve and identify and prioritize issues, well, that's 100% also on our wheelhouse. Yeah, thank you. And yeah, to be honest with you, yeah, that's part of the reason why I asked, you know, whether that layer diagram makes sense to you. Because, again, making the connection to what we opened the discussion with about terminology, it's a bit dense, let's say, and the differences can be quite subtle. Yeah, no, no, I agree. It's a bit of the Wild West out there when it comes to terminology today. What we feel like is that ultimately data mesh is a very interesting concept that we see a lot of discussion around because it's organizational.

Starting point is 00:39:58 It's a cultural and organizational change. And that is, I think, a key one. And then observability as well, because people who today work and rely on data don't really always know what's going on. They're not close to it. And that sometimes causes a lot of sleepless nights because you're automating with it and you don't know what's going on with it always.

Starting point is 00:40:22 And that is a clear kind of, and those are two very clear pay points that if you kind of want to forget about everything else, right? Those are two things that are super hot today and that a lot of companies are working on thinking about and are actively implementing. Okay. So I had in mind about, well, kind of wrapping up the discussion to ask you about where are you headed next, basically, with your product development. And I wanted to bring into that what you briefly touched upon earlier, so the open source aspect. So looking a little bit around as I did some background research on

Starting point is 00:41:06 your product, I realized that you also have an open source layer. So I was wondering if it works, it kind of looked to me like it works in a typical way for commercial open source as a sort of onboarding layer, let's say that people can start using and experimenting with. And then as their needs grow, they can move on to the other offerings. I was wondering if that's indeed the case and what kind of traction you're seeing around the open source version, but the different product offerings that you have and where do you want to go with that? And one last part to an already long question.

Starting point is 00:41:48 If you wanted to say a few words about time series anomaly detection, which I think you briefly mentioned earlier, and it's something you're going to be releasing and announcing soon as well. Maybe I can start on the first part and then start from the anomaly detection Martin. Yeah. And add stuff. So open source, I think, and how is the open source versus the commercial offering split? I think here in our landscape, we have something very interesting going on because there's like different personas involved. As I tried to sketch earlier, the chief data officer, head of analytics, that's really who we target from a company and from a product perspective. And that's the use case we want to solve for them.

Starting point is 00:42:36 Now, one of the parts or key personas in this whole cycle and in this whole trajectory is the data engineer. And data engineers, they build pipelines. They have also a very specific requirement, which is they want to stop the pipeline if they detect bad data. And so that's where we have been able to craft, like how can we make sure, because there was actually a gap. We analyzed the market and we saw a clear need for a very simple to use

Starting point is 00:43:05 tool which is based on YAML and sequel that the developers have under control that they can write configurations do the check-ins into their git repositories and make sure that this workflow facilitates with with their workflow of the development cycle and that's actually quite different from a typical cloud or SaaS product they don't want to work with a full-blown SaaS product if they don't need to. They just want to make or run a command line tool or a command line interface, for instance, to change. So there we use that opportunity that in this whole space, there is not like an easy solution focused on SQL and YAML.

Starting point is 00:43:45 And that's where we started SodaSQL. In terms of uptake, we've been pleasantly surprised because I've done a number of open source projects in the past. And then after two or three months, I was like, yay, someone asks a question. And I was like super happy. And then another month passed by. So here we see like immediately from the week that we launched, there were like several people.

Starting point is 00:44:07 And like two, three weeks in, the people started talking to each other, which is like always like a great milestone that you're not driving the community on your own, but you get like interaction between the people there as well. And that's where very recently afterwards, there were like even code contributions. Normally, the biggest contribution of an open source community is just complaining. People don't realize that.

Starting point is 00:44:30 But if you just complain, that's actually a good contribution because that prioritizes for us very well. Then we can see the trends. If a lot of people do that, then we can see the trends of where the biggest gaps are in our offering. But now we actually saw people like extending it and tweaking it to their use case. So that uptake in the first five to six months that we see, like that really went beyond our expectations.

Starting point is 00:44:58 And maybe the licensing, as you mentioned, also has something to it is because we chose the most liberal license, which is quite free. And so we don't, that's because we actually can completely cater to the data engineering use case without having to install like crippleware or anything else towards that usage. That's where we have the cloud product to uh to go for that and maybe martin you can take the rest of the question um or yeah complete this answers and then anomaly detection and so on sure and george you have that the same thing the tendency that i have i always ask seven questions

Starting point is 00:45:39 at once but what happens sometimes is that that uh we're to have to go back to some of the... That's fine. I think I remember most of it. Adding a couple of things. You said something about, do you see that as a way for people to evaluate? I think partially, yes. But I think most importantly, one of the things we realized is that there were very few

Starting point is 00:46:11 kind of open source projects out there that had the way of working or kind of had thought about how exactly do things need to tie together? How does it need to work? How can we make that super simple um so we focus a lot on building a developer experience for people to get started which is very quickly because the need is there like every so many data teams are implementing it every i would like to say day but it's kind of a couple of times a week we learn

Starting point is 00:46:42 about new kind of production implementations of our open source software it's kind of a couple of times a week we learn about new kind of production implementations of our open source software it's for example used across many countries for covet data little did we know we only recently started finding that out and that's the one of the things with open source and i think the nice thing about us is that it creates within those communities or within those teams it is one of the tools that they use. They don't even have to get in touch with us and they can get value from it. But how we see it from a commercial perspective,

Starting point is 00:47:13 we see that value is only a very small part of the value offering that we have ultimately. And we're not that much interested in monetizing on those things. We are, from monetization, it's much more focused on the process. Like if you're a larger company and if you have to bring these stakeholders together

Starting point is 00:47:32 into a process, into a flow, and you want to manage that through us, that's one of the key ways of providing value. Another one is the automated piece. Like through open source developer tools, you cannot necessarily get full-fledged machine learning models that will automatically identify issues for you. So the layer of intelligence is also something,

Starting point is 00:47:53 or at least a part of the layer of intelligence is also something that the cloud offers. So we see it more as progressive. You can start using sort of SQL, and we foresee that some companies will just be using sort of SQL alone for quite some time possibly, and that's totally fine. So it's also a great way to get to know the technology and the context to get a feel of how well we document things and spend time on how easy it is to set up in terms of the user experience so that's

Starting point is 00:48:26 kind of um that will be the response there but um now i'm sure i'm missing out on some parts of the remainder of the question yeah actually it's it's it ties pretty pretty well to what you're describing because i was wondering about well future directions of growth and development for the platform. And I'm kind of guessing again that what you just described, so extracting automations and insights, basically what you've just done with the time series anomaly detection, you may actually expand that to other areas. Yeah, indeed. We, we just, it's actually, it's, it's a really great feature because you don't have to configure anything anymore, right? Open source today is predominantly

Starting point is 00:49:13 kind of rules-based and more kind of declarative in nature. Here it's, it's intelligence. It's always, it's on for all of your data sets, which is more of an enterprise feature anyway. So that's kind of how we see it. And the time series anomaly detection is really cool because on kind of both dataset level, it covers a couple of data quality dimensions automatically, no configuration needed. And then you could ad hoc enable it for more granular

Starting point is 00:49:44 or kind of column level or feature level metrics. Like one of the things we always by default do is we calculate the number of kind of missing values in any given column. Or look at the distribution or look at validity. When we look at the data itself, can we figure out the semantic type like an email? Okay, if so, then we can automatically apply email validity rules. So those are some of the things we do there. Anomaly detection is just for us step one into the intelligence roadmap with a lot more, I think, extremely cool things to come. But time series anomaly detection is a really low-hanging fruit ultimately.

Starting point is 00:50:31 So we first focused with SodaSQL on the creation of the metrics in a scalable, efficient way, enabling data testing. And now we've started leveraging more, okay, what automations, what insights can we derive from all of those metrics that we've calculated. Okay, I see.

Starting point is 00:50:49 And, yeah, one closing brief comment because we're almost out of time, I think. I also realized that you seem to have gotten some funding recently and so I'm guessing in terms of company growth you probably are going to be expanding the team and go to market strategy and this

Starting point is 00:51:09 kind of thing yes yes totally so I'm I think we've we voted we we completed about like a six months seven month, both are seed and series A funding. That's just simply because of the markets being there, the product finding a good fit in the markets. That kind of was a trigger for that. We completed, I think, a total of, I think in the US, it was 17 or 18 million euros in terms of funding route, which is quite sizable. And that gives us plenty of runway.

Starting point is 00:51:48 We're a team of about 25 today. So the goal there is to gradually now further expand the first time also really investing in more of the commercial go-to-market side of it for the enterprise. So yeah, we're very excited about that because the core team product, all of that is there. of a mode of scaling that go-to-market motion, setting up customer success and kind of all of the other aspects of modern day sales business. Okay. Great. Thank you both for

Starting point is 00:52:35 a very interesting discussion and well, glad you managed with my very, very long questions. It was our pleasure. Thanks for taking the time today. Thanks for having us. I hope you enjoyed the podcast. If you like my work,

Starting point is 00:52:51 you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Orchestrate all the Things - More than words: Shedding light on the data terminology mess. Featuring Soda Founders Maarten Masschelein and Tom Baeyens

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.