Orchestrate all the Things - More than words: Shedding light on the data terminology mess. Featuring Soda Founders Maarten Masschelein and Tom Baeyens
Episode Date: June 22, 2021It's a data terminology mess out there. Let's try and untangle it, because there's more to words than lingo. Hopefully technology investment decisions in your organization are made based on more... than hype. But as technology is evolving faster than ever, it's hard to keep up with all the terminology that describes it. Some people see terminology as an obfuscation layer meant to glorify the ones who come up with it, hype products, and make people who throw terms around appear smart. There may be some truth in this, but that does not mean terminology is useless. Terminology is there to address a real need, which is to describe emerging concepts in a fast moving domain. Ideally, a shared vocabulary should facilitate understanding of different concepts, market segments, and products. Case in point - data and metadata management. Have you heard the terms data management, data observability, data fabric, data mesh, DataOps, MLOps and AIOps before? Do you know what each of them means, exactly, and how they are all related? Here's your chance to find out, getting definitions right from the source - seasoned experts working in the field. Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
It's a data terminology mess up there.
Let's try and untangle it because there's more to words than lingo.
Hopefully, technology investment decisions in your organization
are made based on more than hype.
But as technology is evolving faster than ever,
it's hard to keep up with all the terminology
that describes it.
Some people see terminology as an obfuscation layer meant to glorify the ones who come up
with it, hype products, and make people who throw terms around appear smart.
There may be some truth in this, but that does not mean terminology is useless.
Terminology is there to address a real need, which is to describe emerging concepts
in a fast-moving domain. Ideally, assert vocabulary should facilitate understanding
of different concepts, market segments, and products. Case in point, data and metadata
management. Have you heard the terms data management, data observability, data fabric,
data mesh, data ops, ML ops, AI ops before.
Do you know what each of them means exactly and how they're all related?
Here's your chance to find out, getting definitions right from the source, seasoned experts working in the field.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.
So I'm Martin, one of the co-founders, CEO.
I've been in the data space for a good 10 years now, so data management.
The earlier parts of that, I was an early employee, employee number five at a company called Colibra,
who were the first ultimately selling software or positioning software to achieve data officers before the CDO came into
existence I think back and when they started was 2008-9 you had the first
CDO from Yahoo but then it gradually built out kind of data governance became a thing, metadata management
became much more important as companies were doing more with data.
So I enjoyed that journey for six years
and then branched off, co-founded with Tom.
Tom was, ultimately, we didn't really know each other that well
prior to starting,
but we did have a first conversation a couple of years ago,
or actually before we started,
because I was on a train together with a colleague of mine to London from Brussels.
So we're in the Eurostar.
And the colleague of mine says, I know this guy.
I know that this is Tom from the workflow that we use at Colibra.
Colibra has a core capability around data collaboration,
which was workflow engine driven.
So JBPM and was kind of at the foundation of that.
And that's Tom's open source project.
So my colleague knew him.
We started chatting and ultimately that's where we first met
and the conversation got started.
It's a good segue to make the transition.
I got started in open source workflow engines.
Created two subsequent workflow engines in the developer ecosystem with significant communities and the second
one was the biggest one and that's the one indeed that got adopted into a lot
of companies like Colibra but there's now a handful companies on the actual
technology itself rather than like hundreds or more like baking into their
products that's where I saw really that the developer ecosystem,
open source, that's quite a different beast
from normal traditional development.
And I really like that.
It's like an awesome environment to be in, to work in.
After those endeavors, I also did a SaaS startup.
And that's after that SaaS startup, which was in the BPM space,
workflow and collaboration as well,
but more the simplified version with UI towards the business enterprise version.
And so that's kind of how I segued into data when I met Martin
because open source and workflow are key components
in the product vision of Soda as well.
So that's where the connection comes from.
We have, in the meantime, established an open source trajectory, launched it earlier this year.
And workflow and collaboration is key in our platform as well.
So that's the link and where we met up.
That's how we got started.
Okay, interesting.
So if I got it right, Tom, you used to work around JBPM.
Is that correct?
Yeah, I'm the original founder of JBPM.
Okay.
And a later activity.
Yeah, I mean, like many others, I guess.
I have used JBPM at some point.
And yeah, I even met some people in Berlin.
Unfortunately, I forget the name of the company now,
but since you're into BPM...
J-Boss Red Hat, maybe?
No, they started a new...
Ah, Camunda.
Yes, Camunda.
Yeah.
I met them at some point.
And yeah, we actually discussed quite a bit about, you know,
the inner workings of business process management and workflow engines
and open source and all of that.
That's a great crew.
Each time when I go to Berlin, I try to meet up with them.
Some nice friends there.
Yeah, cool.
Okay.
So, yeah, thanks.
Thank you both for the introduction to both yourselves and how you met up. And I should also mention that it's kind of a nice coincidence, nice timing, because one of the things that James who organized the discussion was kind enough to let me know that you just got an award as a cool vendor from Gartner.
And I was looking at the brief in which they described what this award is about.
And they also mentioned a few words about the vendors that got the award.
And what struck me about it was that there seems to be quite a wealth of terminology thrown around.
So you have data management and data observability
and data fabric and data mesh and data ops as well
and AI ops and ML ops and all of those things.
And well, even though it's mostly analyst lingo in a way,
and I count myself as an analyst as well.
Many people say, okay, so this is analyst speak, you know, to, I don't know, to confuse
people or to invent product categories or whatnot.
But I think there's actually, there should be at least some value in those terms.
Like people are basically trying to describe like emergent, let's say, emergent trends in the market, emergent products.
And so they kind of have to come up with lots of terms, I guess.
But I think it's also interesting if we try to clarify,
let's say, a little bit around those.
And in the process of doing so,
I think it will be good for people who may be listening as well,
because then you will
also be able to position yourselves around those terms and how do you self-identify,
let's say, as a company.
Yes.
Thomas, if that's okay, I'll give this a first stab and then you can fill in some of the
blanks, the things that I've missed.
Let's see where we start. I think it's best that we start with both concepts
of data mesh and data fabric.
So data mesh and data fabric, ultimately,
it's all about kind of a framework
for organizing around data for scale, ultimately.
It's kind of where fabric focuses more
on this kind of concept in the technology
domain of like the data platform, one unified way for us to access data using shared services,
kind of making abstraction of all of the underlying complexities of technology,
or whether data resides somewhere in a legacy Teradata or what have you, or more in the cloud in a snowflake.
Data Fabric is all about technology
and understanding the organizational setup for speed and growth
and doing more things with data.
Data Mesh is a similar concept,
however, slightly nuancely different
in the sense that it focuses more on the organizational
aspects of it.
It focuses more, and I would almost call it like the modernized version of data governance
principles that are applicable for broader data teams to kind of structure and organize
in a good way, removing some of the bottlenecks of the past.
The bottlenecks of the past were predominantly around, for example, your data warehouse team
that was extremely kind of like a funnel that you had to go through, which was not scalable.
So with data mesh, it's fundamentally about building data products and data services.
So it's the data product thinking thinking instead of like when at governance,
we talk about managing data as an asset.
In the data mesh concept,
we talk about managing data as a product,
which is more specific ultimately.
And it's this notion that, yeah,
we should have kind of core platform services.
But then on top of that,
we need to structure ourselves around data domains,
areas of certain like business expertise and knowledge
and enabling them to be self-serve.
I think that's also the key.
In the past, we had a lot of bottlenecks.
Again, now we try to make data technology
much more self-serve
and introduce concepts in the data mesh
like data ownership much clearer and kindserve and introduce concepts in the data mesh like data ownership
much clearer and kind of what your expectations are and one of the roles or kind of the
responsibilities of people who own data is also to manage the quality of it and that's maybe a
nice segue into data management data management is a term that exists already for multiple decades and originally very
heavily described by the Data Management Association, where they did a lot of work
around how should we manage data ultimately. A part of that was metadata management,
which spun out in data cataloginging software which spun out data lineage
capabilities as well as another one in there was data quality management and that is a kind of
where you can see the terms that we're more associated to data monitoring data observability
data testing as being kind of specialized areas underneath quality management within the broader
framework of data management. So that's kind of my first take on that. So data management,
I talk more about capabilities. And with data fabric and data mesh, we talk more about like
organization. And then the last one is data ops. Well, that's more about process.
Where in data ops, it is about,
now that we have these capabilities,
now that we've organized,
what are best practice processes
for us to deliver data products at an increasing velocity
with an increased reliability as we do that.
And this goes really into the nitty-gritty sometimes.
Like, for example, today,
when an analytics team delivers new data pipeline codes,
like new data transformation codes, let's say in DBT,
they want to have processes in place, checks and balances,
that, for example, the engineering manager reviews the PR,
the pull request to new codes,
and can see okay is this
going to have a structural impact on the data set or is this still are do we rely on this new or can
we rely on this new code base or this new code change and that's an example of a process and
that's all like small processes that need to be put in place and standardized for us to work much better with data, similar to what we've done with DevOps and software engineering.
So it was quite a lengthy one, but there are a lot of terms to go into.
I don't know if that makes sense, raises questions, or Tom, if you have things that you'd love
to add there.
No, I just only wanted to add a bit context
around that data observability space.
I think for those that are not familiar with the space,
I think the way you can understand this
is that there's like engineers building data pipelines.
They're preparing the data to be used in data products.
Products, for example, are machine learning models.
So anything, any algorithm that's using data on a continuous basis
or where the data then automatically gets updated from source system,
transformed over data pipelines,
and then being prepared for usage in the algorithms or use case.
If you look at that landscape,
there are a bunch of engineers developing new products regularly. or use case, that's, if you look at that landscape,
there are a bunch of engineers developing new products regularly, once those products get into production,
that's the context, that's where the observability starts.
That's where the data could actually go bad
and the software algorithms using the data,
they keep on working, they don't notice that the data is bad.
And this leads to all sorts of very costly dangers or costly consequences that you want to protect
yourself against. You don't want that your clients find out that your website is showing wrong
information. You don't want that your, for instance, hotel room price calculation algorithm uses wrong data because
then your revenue is directly impacted. So checking and continuously monitoring that data as your
use cases and your data products are in operations, that's where you need observability.
That's where you want to protect yourselves against. Yeah, yeah, great.
Thank you.
Thank you both.
And I think that it's just, indeed,
these are lots of terms and it makes sense that, well,
it takes some time to go through all of them.
And actually, I think you did a very good job of,
both of you, of kind of describing and aligning them, let's say.
There's just a couple of terms that we left out ml ops and ai ops which in my understanding at least i would say that
they're kind of a specialization of data ops i would say that that specifically applies and to
my mind they're pretty much the same even even though, you know, in theory, machine learning
is definitely not the same as AI.
I think in practice, those terms, at least in the MLOps versus AIOps context, I think
they're probably used interchangeably.
I don't know what your view is on that.
I think, yes.
So there's, I think they rely on each other. I think MLOps relies on a good DataOps foundation ultimately,
but it's more specialized.
Like in DataOps, we won't be monitoring our prediction accuracy,
for example. That's specific to the data product.
And that's also specific to the lifecycle of the data product.
So I like to think of it more from like a lifecycle perspective.
And then for me, those are two separate things because the lifecycle of a data set is not
directly tightly coupled to the lifecycle of a machine learning or a data product ultimately.
So there are also different people doing that. When it comes to managing data and data ops,
we have data producers, which can be external to the organization.
You could have a Bloomberg feed.
You can have all sorts of data coming in that you either buy or collect.
You have internally generated data.
So there's much more of an organization around it,
typically also in the business that takes ownership of it. So I would see it as a separate thing, however,
with quite some commonalities. Another way of looking at it is kind of looking at the
tooling landscape. And if you look at the monitoring and observability software in the entire stack, and with the stack, I mean, like infrastructure at the bottom of it, then our applications that we write.
And then nowadays, these applications, we use data and machine learning as two kind of new layers.
And in those two layers, we're just getting started with software and platforms to help you monitor
that. That's relatively new, whereas the other ones have been existing for much longer because
so I think there there's a lot of analogies across those layers of the stack, but there
are some intricacies about each one of those. Yeah, I would actually consider more like data observability
and checking your data is more fundamental layer
on top of which you have the MLOps and the AIOps
in the sense that MLOps have specific workflows
around how you deal with these machine learning models
that you have to figure out, like if the data changes,
then how is the impact on the actual result?
Or if the model changes, what's the impact?
And then this versioning, throughout the versioning,
you can sometimes trace back,
like where was the problem originally?
So those are specific machine learning.
And similarly for AIOps,
the different flavors of the workflows on top,
but fundamentally underneath all of the use cases with data,
they need correct data to start with.
So in that sense, we're more like the base layer on top of which
the more specific workflows are being developed.
Yeah, thank you.
And yeah, I think it's a task of quite amazing complexity, actually. I mean, just managing DevOps before we even start talking about data ops and all of that and on top of that models and versioning, it just kind of explodes.
So that kind of goes to show the need for solutions like what you're building, I guess.
It's a nice segue to actually talk about what you're building.
So it seems like the message that you put across is focused around
four areas. So monitoring, testing, data fitness and collaboration. And I wonder if you'd like to
just say a few words about the platform in general and more specifically those four areas and what
you offer in each of those. No, totally.
I'll try to keep that kind of short and crisp.
Ultimately, the first capability is really like a capability for the data platform team.
And it's all about automatically monitoring data sets
in our environment for issues.
No configuration, ultimately.
So that means that we try to figure out if there's something abnormal about the data sets that land in your environment.
For example, across how many records did you process this time around?
Is that abnormal compared to what there was same day last week? Or using some machine learning, comparing that, factoring in seasonality to figure out if that's off or not.
Or things like when your data updates.
Data freshness is a key consideration always, a key problem that companies are looking for solutions.
Because sometimes your data providers will change something that you didn't foresee,
and then all of a sudden data is stale or becomes stale,
and that has a direct downstream impact into your data products.
So it's about automating that as much as possible,
and no business logic needed really.
And that covers a part of the discovery of data issues in your organization,
but it doesn't cover all of them.
It actually only covers a small percentage of the types of data issues that you can have.
So that's why data testing and data validation is kind of the next step.
This is where you enable both the data engineer on the one hand and the data subject matter experts so the like a business counterpart on the other hand to write kind of more descriptive declarative validations on data things that need to hold each time new data arrives like we can only have x percent of
missing data in this column for example or this needs to be unique or this needs to be referential
integrity or it needs to be allowable set of values.
For the engineer, because the pipeline might break.
For the analyst or the data subject matter experts, because their data products will potentially break or there's a business process that needs to be triggered.
So those are kind of squarely into the discovery of data issues as kind of step one.
But that's not where it ends because if you have a system
for discovery of data issues, it will create a lot of alerts.
But how do you handle the alerts?
What is the business process that you then go through?
And that is, I think, very key.
That's where we enable the data owners, for example.
And that's kind of the analysis and prioritize phase. And that's where we have things like data fitness dashboards,
which is more broadly about SLA tracking, about giving data owners a view of all the expectations
on data across the organization so they can improve the quality and know where to prioritize,
as well as kind of the workflow around the resolution of the problem,
all the way to creating tickets and Jira service now
to then further handle the data incidents.
That's ultimately kind of the higher level of capabilities,
whereas collaboration fits in across all of these areas ultimately.
But predominantly, collaboration is there in also the analysis phase
as you can easily bring people with different knowledge about the problem,
like the data engineer on the one hand, the analytics engineer,
as well as the business person.
They often have tacit knowledge that's not documented
that is needed to resolve the problem.
Maybe you can add a few notes and comments to it. So I'd like to make the summary as to these are very distinct kind of capabilities, but I'd like to also give an overview perspective of how
do we approach this space. So we approach it from the,
so if you look at it from the engineering perspective,
you're mostly involved in, for instance, data testing.
You want to make sure that your pipeline runs smooth
and that your pipeline is stopped
if you detect bad data so that it doesn't flow downstream.
But there's a lot more people involved
into the whole data ecosystem in a company.
There's the analysts like yourself, like using the data, building interesting use cases and
products with it. There's the subject matter experts that have the domain knowledge of
intimate domain knowledge and all the details of what a particular field looks like, what kind of data is normal and abnormal,
and what are the specialties on it. And now, you can't really prevent that data issues will happen.
So, there needs to be a process in place that actually monitors, finds those issues, and
then resolve them.
And if you then look at it from the head of analytics or the CDO, the chief data officer's perspective
at your organization, you are responsible to make sure that your issues are discovered
and resolved.
And of course, there's a bunch of steps in between, but that's the business process that
you're responsible for.
And you need to make sure that that flywheel is continuously running.
So that's where Soda focuses on in the summary for making sure that that flywheel of finding issues and resolving it, that's an operational concern and that's dealt with.
Okay.
Yeah, it seems like, you know, there's a kind of logical progression, at least among the three first areas.
So monitoring, testing and data fitness, especially in the way that you described them.
So it kind of all ends up, you know, simplistically,
very, very simplistically put, it kind of all ends up in a dashboard
in a way where the person responsible for overseeing the process,
let's say, can see how it's all going.
And then you have a cross-cutting concern, which is collaboration
that kind of
facilitates everything yes correctly and the one point i also wanted to add there is that
this collaboration and the workflows behind it that might look like a bit of overkill in the
smaller to medium-sized organizations where you have smaller teams and usually like there is one
or two data engineers and they handle everything.
But yeah, so I think that's you need to have the awareness that as your use cases of data will grow towards the future,
that these roles will more specialized and then you need to have more of this collaboration in place and managed in order to keep that all running.
Otherwise, your data engineers will get overloaded.
Yeah, it makes sense.
And actually, I wanted to ask you a little bit more about some more details about two
of those areas.
So let's start with collaboration, actually, because of the fact that you mentioned it
as a kind of way to elicit implicit
knowledge in a way so I'm wondering if you have a specific way of doing that so
I guess the obvious would be well logging all conversations and exchanges
but I wonder if you also do any analysis on those and whether you have some results that you distilled out
of that. Ultimately collaboration, there's a couple of things unique about
it. One of them is that it's it probably works roles-based. So for example, as an
as an I'm a data, I've written this data transformation
code, for example, or I ingest this data set.
So I'm kind of attached as a, as a, with that, with my role of data engineer.
So when we have technical problems, that will be the person we default to, to, to involve
and inform as a way to also reduce noise and kind of make alerting highly kind of specific,
making them go to the right people.
But the same thing is when we have issues around,
like, for example, a lack of data or completeness of data,
validity of data that's downgrading,
that's something more of a concern
to the person who knows the data inside out, right?
So again, there we can work based on roles to route alerts to the right people.
You've also touched upon kind of the analysis phase.
So very often, like when there's issues or incidents, we triage them first.
We see which ones we're going to work on.
Maybe we group some incidents together because there's an underlying root cause. In that analysis, we leverage a multitude of tools from data lineage
capabilities to diagnostics data that we analyze. For example, we have a set of passing data
on a certain validation, and there's a set of failing data and if we analyze those and
see kind of what's different about them we can already figure out maybe it's this um device type
because all the records in the failing uh sample um all pointed is one device type being android
for example so um all of that kind of is in the broader space
of helping flow from prioritizing, analyzing to resolving,
pointing them to the right tools,
giving all of the users information
that can help in their decision process
and the ability to tag people,
a bit like in Google Docs, right?
Or you'll be tagging somebody in to help
out or to give some more insights into why something might have happened. That's kind of
how we envision that. If I can add one more thing is like collaboration doesn't always mean the
typical and trivial collaboration features like commenting and sending things to the right people.
Collaboration for us goes broader because we also want to make sure that this domain knowledge
of the subject matter experts is captured.
Because normally this domain knowledge that they have is bottlenecked by the availability
of the data engineers to actually implement the rules that they have, because this is not something that you can tackle with AI.
The domain knowledge needs to be captured in rules.
And so there we've invested a lot to make sure that analysts can actually do self-service
monitor creation so that they can actually, without the involvement of the data engineers,
manage this domain knowledge
themselves. And that way scale a lot more of these rules, because that used to be the problem
that the rules can't scale well enough. And there's two reasons for it. First of all,
now there's more data. That's one thing. But the other part is that it used to be a technical
solution. And there we now go on the self-service mode where
yeah where people can do this together and therefore a lot more of that domain knowledge
can be covered okay okay cool thanks so i guess that's also um a good way to um
talk a little bit about the underpinnings of what you've built.
And I was wondering what kind of technology you could have possibly used
to build the individual modules, let's say, functionality,
as well as to glue them all together.
But actually, I'm going to make a guess here.
And having heard from you, Tom, that you were deeply involved in workflow engines, I'm guessing that maybe that has something to do with it.
Yeah, definitely.
That's going to be a good way of looking at a technical underpinnings because we split that up into all the developer tools that we have to make connectivity with the data.
We have SodaSQL, SodaSpark is almost there, and then we have SodaStreaming as well in the pipeline.
So these give like full coverage of the complete data landscape or the data stack,
which is important because in large organizations, your data is all,
you want to monitor the data, not only in your warehouse,
but also in these other places.
So you don't want to be stuck to only a warehouse, for instance.
So there we realized that connectivity, a lot of the times has to do with the data engineers in the team.
They want to work with configuration files, for instance, like YAML files.
They want to work on SQL level.
They want to check these things into their Git repository.
So we spent like a lot of time making sure that from the engineering perspective,
this is like a seamless thing.
This is something they love rather than they have to be forced to
actually implement this tool. So that's the underpinnings there. And then the connectivity
leads to a set of metrics being computed on a scheduled basis. And then each time when the
metrics are computed on a certain data set, they're sent to our cloud product. Our cloud product will actually collect these and
store them so that you get like for each metric, a time series, so that we can see over the history,
how does this metric behave, so we can apply anomaly detection on it to check like is this
normal or abnormal. And it's also in that platform where we allow then the analysts to start building their own monitors on top of that metrics that come in as well.
So that's in a nutshell.
That's also in the cloud, by the way, that that's where the workflow engine is part of that as well to drive the resolution.
That's like a rough outline.
I don't know if maybe Martin has something to add to that.
You're on mute.
Yes.
Sentence of the year.
So, no, I think one way I like to sometimes look at it is to compare it kind of in the management.
We've talked about data cataloging, for example, a while ago,
which is more focused on kind of the process
of finding data in an organization and sharing data definitions sharing understanding so you can
more quickly access that data and start using it that's kind of an adjacent process to ours
which is an of the data that you use how do we make sure it remains fit for purpose?
So that's kind of where we focus.
And because in data, everything is connected, right? We have data sets that are at the source, and then we make copies, transformations,
and then it goes to maybe another part of the organization where they use it for another purpose.
And before you know it, you have a large graph of connected data sets.
And I think the data cataloging space is predominantly focused on graph-based systems,
finding connections between data, understanding where data comes from, how it transforms, et cetera. I think that's one component that is ultimately needed predominantly in the analysis phase
of a product, of a problem, a data incident.
For us, we focus much more on metrics, on like how you can compare metrics, how we can
find problems through metrics. So it's more of a time series based system. We're
heavily kind of more on that operational day-to-day nature and a bit less around necessarily
the graph of things. So it's maybe another way of looking at kind of underlying technology
focus and technology choices. Okay, thank you. And yeah, interesting that
you mentioned graphs and such because that was also something I meant to ask
you about again in relation to your inclusion in Gartner's report and in
that report which also deals with data fabrics,
they have a kind of stack there,
which includes things such as data sources and metadata
and so on and so forth.
And they also include knowledge graphs,
which kind of struck me.
It makes sense.
And there's lots of products in the data cataloging,
mostly, space that are based on that.
And I was going to ask you how
much of this stack well first of all if you think the stack makes makes sense
and then second part how much of this stack would you say that your own
solution touches so we have our data sources then there's a layer of data and
metadata then there's an augmented data catalog and then there's a layer of data and metadata. Then there's an augmented data catalog.
And then there's a knowledge graph with semantics.
So I think what they're really doing is the bottom layer is ultimately the data cataloging space.
Where the primary focus of what we do is we ingest metadata into a centralized system.
And we manage the lifecycle of that metadata.
And as part of that lifecycle, one of the things that we typically do is understand what data do we have in here.
Does this column connect to a high-level concept in the business. You can imagine that a customer or customer address data
is stored in many, many different physical tables,
files, or what have you.
And that's ultimately the domain of kind of the catalog.
The knowledge graph and the semantics
are about representing the intricacies
about the business at hand,
the company that is building that knowledge graph,
to kind of ultimately better manage their data.
Because if you have a semantic layer,
you can start defining policies more on that level.
You can say, well, for all customer data,
we do X, Y, Z in terms of our data management process.
And for us, that is not a space that we're in.
We have an integration strategy there,
and we've already integrated successfully
and have that running at some customers
with some of the most commonly used data catalogs.
So we're ultimately, if they say augmented data catalog, well, you could say that
we're part of the augmentation. Because when somebody searches for data in their organization,
they can immediately see if that data set is properly maintained. If we've had issues in the
past, how quickly do we react to those issues? And that is really valuable information
as part of your kind of enterprise repository
of your knowledge graph, your data catalog,
your business metadata, ultimately.
So I think up to that level,
we are involved in that predominantly
from an augmenting the data catalogs perspective.
I think where you then go higher up in the stack,
I think they predominantly focus or see us within orchestration and data ops.
As predominantly the metadata activation, I have to be honest,
I have to look that up actually what they exactly mean with that.
But then, of course, in data ops and your processes around how you manage data on a day-to-day basis,
how we resolve and identify and prioritize issues, well, that's 100% also on our wheelhouse.
Yeah, thank you.
And yeah, to be honest with you, yeah, that's part of the reason why I asked, you know,
whether that layer diagram makes sense to you.
Because, again, making the connection to what we opened the discussion with about terminology, it's a bit dense, let's say, and the differences can be quite subtle.
Yeah, no, no, I agree. It's a bit of the Wild West out there when it comes to terminology today.
What we feel like is that ultimately data mesh is a very interesting concept
that we see a lot of discussion around because it's organizational.
It's a cultural and organizational change.
And that is, I think, a key one. And then observability as well,
because people who today work and rely on data
don't really always know what's going on.
They're not close to it.
And that sometimes causes a lot of sleepless nights
because you're automating with it
and you don't know what's going on with it always.
And that is a clear kind of,
and those are two very clear pay points that if you kind of want to forget about everything else, right?
Those are two things that are super hot today and that a lot of companies are working on
thinking about and are actively implementing.
Okay. So I had in mind about, well, kind of wrapping up the discussion to ask you about
where are you headed next, basically, with your product development. And I wanted to
bring into that what you briefly touched upon earlier, so the open source aspect. So
looking a little bit around as I did some background research on
your product, I realized that you also have an open source layer. So I was wondering if it works,
it kind of looked to me like it works in a typical way for commercial open source as a sort of
onboarding layer, let's say that people can start using and experimenting with. And then as their needs grow, they can move on to the other offerings.
I was wondering if that's indeed the case
and what kind of traction you're seeing around the open source version,
but the different product offerings that you have
and where do you want to go with that?
And one last part to an already long question.
If you wanted to say a few words about time series anomaly detection, which I think you
briefly mentioned earlier, and it's something you're going to be releasing and announcing soon
as well. Maybe I can start on the first part and then start from the anomaly detection Martin.
Yeah.
And add stuff.
So open source, I think, and how is the open source versus the commercial offering split?
I think here in our landscape, we have something very interesting going on because there's like different personas involved. As I tried to sketch earlier, the chief data officer, head of analytics, that's really who we target from a company and from a product perspective.
And that's the use case we want to solve for them.
Now, one of the parts or key personas in this whole cycle and in this whole trajectory is the data engineer.
And data engineers, they build pipelines.
They have also a very specific requirement,
which is they want to stop the pipeline if they detect bad data.
And so that's where we have been able to craft,
like how can we make sure, because there was actually a gap.
We analyzed the market and we saw a clear need
for a very simple to use
tool which is based on YAML and sequel that the developers have under control
that they can write configurations do the check-ins into their git
repositories and make sure that this workflow facilitates with with their
workflow of the development cycle and that's actually quite different from a
typical cloud or SaaS product they don't want to work with a full-blown SaaS product if they don't need
to. They just want to make or run a command line tool or a command line interface, for instance,
to change. So there we use that opportunity that in this whole space, there is not like an easy
solution focused on SQL and YAML.
And that's where we started SodaSQL.
In terms of uptake, we've been pleasantly surprised because I've done a number of open
source projects in the past.
And then after two or three months, I was like, yay, someone asks a question.
And I was like super happy.
And then another month passed by.
So here we see like immediately from the week that we launched,
there were like several people.
And like two, three weeks in, the people started talking to each other,
which is like always like a great milestone
that you're not driving the community on your own,
but you get like interaction between the people there as well.
And that's where very recently afterwards,
there were like even code contributions.
Normally, the biggest contribution of an open source community is just complaining.
People don't realize that.
But if you just complain, that's actually a good contribution because that prioritizes
for us very well.
Then we can see the trends.
If a lot of people do that, then we can see the trends of where the biggest gaps are in
our offering.
But now we actually saw people like extending it and tweaking it to their use case.
So that uptake in the first five to six months that we see, like that really went beyond
our expectations.
And maybe the licensing, as you mentioned, also has something to it is because we chose
the most liberal license,
which is quite free.
And so we don't, that's because we actually can completely cater to the data engineering
use case without having to install like crippleware or anything else towards that usage.
That's where we have the cloud product to uh to go for that and maybe martin you can
take the rest of the question um or yeah complete this answers and then anomaly detection and so on
sure and george you have that the same thing the tendency that i have i always ask seven questions
at once but what happens sometimes is that that uh we're to have to go back to some of the...
That's fine.
I think I remember most of it.
Adding a couple of things.
You said something about, do you see that as a way for people to evaluate?
I think partially, yes. But I think most importantly,
one of the things we realized
is that there were very few
kind of open source projects out there
that had the way of working
or kind of had thought about
how exactly do things need to tie together?
How does it need to work?
How can we make that super simple um so we focus a lot on building a developer experience for people
to get started which is very quickly because the need is there like every so many data teams are
implementing it every i would like to say day but it's kind of a couple of times a week we learn
about new kind of production implementations of our open source software it's kind of a couple of times a week we learn about new kind of production implementations of our
open source software it's for example used across many countries for covet data little did we know
we only recently started finding that out and that's the one of the things with open source
and i think the nice thing about us is that it creates within those communities or within those
teams it is one of the tools that they use.
They don't even have to get in touch with us
and they can get value from it.
But how we see it from a commercial perspective,
we see that value is only a very small part
of the value offering that we have ultimately.
And we're not that much interested
in monetizing on those things.
We are, from monetization,
it's much more focused on the process.
Like if you're a larger company
and if you have to bring these stakeholders together
into a process, into a flow,
and you want to manage that through us,
that's one of the key ways of providing value.
Another one is the automated piece.
Like through open source developer tools,
you cannot necessarily get full-fledged machine learning models
that will automatically identify issues for you.
So the layer of intelligence is also something,
or at least a part of the layer of intelligence
is also something that the cloud offers.
So we see it more as progressive.
You can start using sort of SQL,
and we foresee that some companies will just be using
sort of SQL alone for quite some time possibly, and that's totally fine. So it's also a great
way to get to know the technology and the context to get a feel of how well we document things and
spend time on how easy it is to set up in terms of the user experience so that's
kind of um that will be the response there but um now i'm sure i'm missing out on some parts of the
remainder of the question yeah actually it's it's it ties pretty pretty well to what you're
describing because i was wondering about well future directions of growth and development for the platform.
And I'm kind of guessing again that what you just described, so extracting automations
and insights, basically what you've just done with the time series anomaly detection,
you may actually expand that to other areas.
Yeah, indeed. We, we just, it's actually, it's, it's a really great feature
because you don't have to configure anything anymore, right? Open source today is predominantly
kind of rules-based and more kind of declarative in nature. Here it's, it's intelligence. It's
always, it's on for all of your data sets, which is more of an enterprise feature anyway.
So that's kind of how we see it.
And the time series anomaly detection is really cool
because on kind of both dataset level,
it covers a couple of data quality dimensions automatically,
no configuration needed.
And then you could ad hoc enable it for more granular
or kind of column level or feature level metrics.
Like one of the things we always by default do is we calculate the number of kind of missing values in any given column.
Or look at the distribution or look at validity.
When we look at the data itself, can we figure out the semantic type like an email?
Okay, if so, then we can automatically apply email validity rules. So those are some of the things we do there.
Anomaly detection is just for us step one into the intelligence roadmap with a lot more,
I think, extremely cool things to come.
But time series anomaly detection is a really low-hanging fruit ultimately.
So we first focused with SodaSQL
on the creation of the metrics
in a scalable, efficient way,
enabling data testing.
And now we've started leveraging more,
okay, what automations, what insights can we derive
from all of those metrics that we've calculated.
Okay, I see.
And, yeah, one
closing brief comment because we're
almost out of time, I think. I
also realized
that you seem to have gotten
some funding recently and so I'm guessing
in terms of company growth
you probably are going to be expanding the team and go to market strategy and this
kind of thing yes yes totally so I'm I think we've we voted we we completed
about like a six months seven month, both are seed and series A funding.
That's just simply because of the markets being there,
the product finding a good fit in the markets.
That kind of was a trigger for that.
We completed, I think, a total of, I think in the US,
it was 17 or 18 million euros in terms of funding route, which is quite sizable.
And that gives us plenty of runway.
We're a team of about 25 today.
So the goal there is to gradually now further expand the first time also really investing in more of the commercial go-to-market side of it for the enterprise.
So yeah, we're very excited about that because the core team product, all of that is there. of a mode of scaling that go-to-market motion, setting up customer success
and kind of all of the other
aspects of modern day sales
business.
Okay. Great.
Thank you both for
a very interesting discussion
and well, glad you
managed with my very, very long questions.
It was
our pleasure. Thanks for taking the time today.
Thanks for having us.
I hope you enjoyed the podcast.
If you like my work,
you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.