The Data Stack Show - 141: A Journey From Backend Engineer to Data Engineer with Ioannis Foukarakis of Mattermost
Episode Date: June 7, 2023Highlights from this week’s conversation include:Ioannis’ background and journey in data (2:42)Rudderstack’s transformations feature and examples of its application (4:20)Winning the transformat...ions contest at Rudderstack (7:21)How Ioannis’ transformation project works for data governance (9:40)Memories from college for Ioannis and Kostas (12:30)Getting into the world of software development (17:27)The changes in data and engineering over the years (20:29)Bridging java with python (23:15)Dealing with ML workloads in the past vs. workflows of today (26:30)Data engineers and ML engineers (33:12)Dealing with data in the early stages to ensure reliability later on (38:39)What creates problems with data quality? (42:11)Exciting developments in data engineering (46:48)Final thoughts and takeaways (51:12)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Costas, fun episode. So Rudderstack,
the company that helps us put on the show, recently ran a competition around transforming data.
And we are going to talk to the winner of that competition. His name is Yanni and he works at
a company called Mattermost, but you actually know Yanni from your days in the university.
So I have a feeling this is going to be an extremely fun conversation. I'm going to ask
the obvious question, what did he build for this competition? Little preview. It's a pretty cool
data governance flavored feature that relies on the concepts of data contracts, but it kind of runs in transit in the pipeline.
So pretty interesting approach.
So I want to dig into that with him
because I think it was a pretty creative effort.
But you obviously know a lot about Yanni.
So what are you going to ask?
Yeah, I think it would be great to go through his journey
because he, just like me, has been
around for a while.
And he has an interesting journey from graduating to doing a PhD, going into the industry, doing
backend engineering to ML engineering to data engineering.
So I think he has a lot to share about this journey and in a way how the industry has evolved.
And then I think it would be great also to spend some time with him
and learn from his experience about data engineering,
ML engineering, the boundaries between the two, and what it takes to make sure
that both functions operate correctly. So let's do that and chat with him. And I'm sure there
are going to be some fun moments remembering the past there. So let's see. Let's do it.
Yanni, welcome to the Data Stack Show
and congratulations on winning
Rudder Stack's Transformations Challenge.
It was really cool to see all the submissions
and you won.
Thank you.
First of all, thanks for having me.
It's great to talk with all of you.
Thank you for your words for the submission.
I think I was pretty lucky because there were a lot of great submissions out there.
Cool.
We'll talk about that challenge and we want to hear what you built because it actually
relates to data quality, data contracts, data governance, lots of topics that we've covered
on the show that are super relevant.
But first, give us your background. You actually have a connection to Costas in your past, which I want to dig into a little bit later.
But yeah, give us your background and tell us what you do for work today.
Yeah. So I'm a data engineer at Matter at Most. I received my PhD in electrical engineering a few years ago.
That's where I actually know Kostas from.
After receiving the PhD, I started working as an adjunct lecturer,
teaching object-oriented programming with Java, database systems, and software engineering.
Then I moved to the industry,
initially as a Java backend engineer,
and then later as a machine learning engineer.
But, you know, these things are kind of connected
and I gradually moved to the latest field,
which is data engineering.
Love it.
And just give us a quick overview.
You work at Mattermost.
What does Mattermost do? Just give us a quick... while meeting nation-state level security and compliance requirements.
They have this really nice tool and a lot of customers that range from US Air Force to Bank of America, Tesla Motors, Meta, Facebook,
and all these great companies.
Wow. Incredible.
Okay, well, let's talk about the transformations challenge really quick.
So Rudders Act, and of course I work for Rudders Act, so I'm familiar with this, but we want to hear it in your words.
Our customers love our transformations feature.
First of all, tell us, you explain transformations to us.
What is RutterSack's transformations feature?
And maybe what are some of the ways that you use it at Mattermost? So, transformations is a way to modify incoming events or filter
events before they reach your, the final destination.
So as soon as the client fires an event and it's detected by RutterStack,
RutterStack runs this transformation.
The transformation follows its logic and then stores the result.
You can think of it as changing the order where the load and transformation happen. So, it's up to you to decide whether you do the transformation or which transformation
you do after the data are loading the database or before.
Got it. And what are some of the ways that you use transformations that matter most?
Because you stream data from multiple iOS, Android, web, etc.
Yeah, exactly. So we don't currently use
transformations, we're investigating
and we have a lot
of data coming from clients
and
we were thinking about
modifying the organization
of the data and
how eventually starting
the data warehouse.
So one of the things that we were thinking is
whether we can filter some events that were coming as noise.
But there's also some bugs that might happen.
And these bugs might exist in servers that have an older version of the code.
And you can't wait for the customer,
you can't force the customer to upgrade something
that's installed on-prem.
So we can use the transformation in order to reconcile
for these bugs that we might identify.
Oh, interesting, right.
So it's like someone's running an older version of iOS,
so they have a previous version of your instrumentation,
but then you update
the instrumentation on newer versions, and so you need to fix the payload to sort of
align with the new schema.
Yeah, that's one way.
The other way is that Mattermont Solutions has this server component where you can install
it on-prem, and this server component, the maintenance of this component is something that might
be outside of our control.
But the data that we receive is something that we can modify using the transformations.
Got it.
Yeah.
Okay.
So yeah.
So someone installs it on-prem, but you need to modify the data to sort of align it so
you cross customer analytics.
Super interesting.
Okay.
Well, tell us about the transformation
you built. What was the original problem
you were thinking about when
you saw the competition and wanted
to build something?
It's not something that
it's not out there.
It's something that exists in my
mind as an idea, and I was planning
to experiment with that.
And the challenge is what pushed me to actually go on and implement it.
So the idea is that when you receive events from various sources, from
various teams in the company, you have to agree on the payload so that the
data engineers know what to expect, what are the expected fields, properties
that end up as columns in the tables and so
on.
So one way to, there are various ways to try to enforce these contracts that are agreed
between the product teams and the data engineering team.
And one option is to have these contracts in the form of version-controlled files, like
schemas.
And the transformation is
checking whether the events are
adhering to these schemas that you
have specified so far.
Yep, so
you have an event coming in, and
let's say one of the challenges
is either maybe a
versioning challenge like we talked about before, where someone's running an old version of the app and so the scheme is different.
You need to modify that.
Like that could be one way that it doesn't align.
Or the developers implement something that maybe isn't quite accurate or they change something.
And so, you know, as a data engineering team, it's a way for you to flag that in transit to make sure that nothing breaks downstream.
Yeah, exactly.
So let's say that you have anything called that to cart, and you agree that the properties are going to be A, B, and C.
But then for some reason, for some news communication, it's different teams, somebody goes ahead and adds an additional property called d.
So by taking the
schema and depending on how strict
we want to be, we can either
discard the event or we can
send a notification that we noticed
a new event with a different schema
and we need to take action.
All right.
Well, give us just a brief overview of how
this works in RutterStack transformations.
How did you wire it up?
So I used the transformations in the JavaScript part of the transformations.
It's great that RutterStack offers both Python and JavaScript transformations.
I went for the JavaScript part because it was the part I wasn't being that confident.
I'm more familiar with Python, so I wanted this kind of challenge.
So I want to focus both on offering a solution, but also investigate how you can apply good engineering
practices in writing transformations.
So there's already a repo that's public for this, the link is in the submission.
The code there is, it uses a library for passing the schema.
And in the transformation, what you do is you define the schemas, you map the schemas
to the event names, then the transformation checks for each event which schema corresponds
to this event. event, that's the validation and you can decide whether to discard or just look the error
message.
There's also in the repo, there's also some additional code about testing the transformation,
how to set up test events, CI, CD on the transformation and all these personal practices.
Love it.
And we'll make sure to include that in the show notes.
Very cool.
What a creative way to sort of explore
and implementation of data governance
with RedRESAC transformations.
I love it.
What was the most enjoyable part of building
the transformation that you built for the competition?
Seeing it work.
This dopamine rush.
Definitely.
But I think it went really smooth.
I didn't spend a lot of time writing the code.
So I really liked that it was a really fast prototype and then
I feel like I spent more time
in writing tests and setting
up the project structure rather than
actually writing the transformation.
So this thing was really nice
and the
user interface of
RouterStack for testing on RouterStack
the transformation
is also really helpful.
Preston Pyshko Great. Well, congrats again. Super cool project.
Okay, I want to ask you another question because your background is really interesting. So you
studied electrical engineering, then you got into sort of software development, specifically
backend development, and then you got into data engineering.
That's a super interesting story.
But at the beginning, you were in school with Costas.
And so I want to hear maybe like your best and worst memory of Costas
when you were in school with him.
I think it's the same thing.
I think me and Costas were going to the same lab in order to get free internet.
Ah, free internet in the lab.
So, I won't say our age, but back then we had dial-up modems, so we didn't have a lot of internet available in our home.
So we used to go to specific labs or run some errands for some assistants to the lab so
that we can get access to the lab and be able to stay there and code or search the internet
program or talk to IRC.
Yeah, IRC. Is that where you met? Is doing that?
It was this... We were in the same semester.
So, what's that?
Yeah, we were in the same class.
I mean, it's been a while.
Yeah.
We don't want to disclose our age, but it's been a while. Yeah. We don't want to disclose our age, but
it's been a while, so...
Okay.
I do have to ask a question here.
Back then, wait, back then
at the university,
there were a couple of very specific
spots where you could
meet with people, right? One was
the coffee shop at the school, where you would end up there meeting with people
and drinking coffee.
And then it was the labs where we would do something like what Yanis was describing.
Because keep in mind that back then, having access to good internet connection was pretty much non-existent in Greece.
So, okay, that was one of the benefits of being at the school of electrical and computer engineering in Greece.
You had access to a very fat pipe for that time, right?
Yeah, it was one of the main reasons or i shot my pxd
that's that okay i do have to ask though surely at night you weren't like just working on school
work like of course you played games in the lab, right, with other people from school, with the internet connection.
Yeah, so, okay, now you're getting into the interesting parts of life.
The problem is that the more questions you ask,
the easier it's going to be for people to figure out our age.
That's the problem here.
I didn't mention any names of games.
I'm just saying, based on my own experience.
Yeah, but you have to do that at the end.
We have to talk about it.
I think two main things.
One was Quake Arena.
We had a server at the university and we had a home with that and I think that was hosted in CS Lab if I remember
correctly. I don't remember where it was hosted. I started... The person was hosting it. It was
Jorgo Skalas who was hosting that on his own personal server. Anyway, and then there was a
lot of...
People were getting together,
especially, I think, in Shoplamp
and playing StarCraft.
Yes, I think something like that.
But, I mean, one of the finest memories I have
is with SquidCalinga,
where we were attending a class, let's say,
and everybody logged into the server.
We used the names of the professors as nicknames.
And it was funny because it was, you know, old CRT screens.
And whenever the professor who was teaching at that moment
started working towards the back back you could hear alt up and the click
on the screen so it was like a wave coming to the back to each one of the funniest memories
i have unbelievable i love i love that that sounds like you know i love that this is happening in the
context of a phd that's just so great. No, that was before the PhD.
Ah, okay.
You matured, yes.
Yeah, yeah.
Okay, so electrical engineering, Quake Arena.
Yanni, why did you get into the world of software?
I went to this school because of software.
I liked computers since I was a young one.
I got to study something related of software. I liked computers since I was a young one. I wanted to study
something related to software. So, how it felt? I mean, all the pieces fell into place and
I started that. So, even though I was in an electrical engineering department. Well, practically it was electrical and computer engineering.
I focused mostly on the software part because I liked it most.
Then I tried a bit academia because it felt, you know, something like the next
step to try after a PhD in Greece.
A variety of reasons were there, but I always wanted to also, you know,
I didn't want to be only the guy who teaches software.
I also wanted to write software.
And part of that, part of the economic crisis back in Greece, back then I moved completely
to the industry at that point.
And I've been enjoying it since then.
Yeah. industry at that point. And I've been enjoying it since then. Yeah, and something that's...
We need to clarify something here.
The school we attended
was like the School of Electrical
and Computer Engineering, so
the schools were never separated.
In the technical
university, we weren't, at least.
So if you wanted to go and
study computer engineering, you had to go and study computer engineering,
you had to torture yourself
with electrical engineering for
a while.
Together with a couple of other things, too.
Actually,
I have to be
honest with myself here. Although at the
beginning, I didn't enjoy it that much.
We had all this variety of different
stuff to learn and go through. At end it was like a very interesting experience to learn all
these different things and have like a much more let's say complete like engineering training
ranging from like classical electrical engineering to stuff with telecommunications to electronics to software.
Game theory. It was even theoretical stuff at the moment.
Yeah, it was pretty theoretical, but anyway, it was good at the end. We suffered a little bit,
but at the end I think it paid off. So, Yanni, let's talk a little bit about this journey, right?
Because, okay, we've been around
for a while. Software
and the industry was obviously
completely different back then when
we graduated or even when we
entered the school.
Today, as you
said, you have the title of data
engineer. Let's talk about
this journey a little bit
and your experience, right?
How you have experienced the change in the industry.
And let's focus on some things that you,
at least from your perspective,
you find interesting to share and maybe surprising also.
So as I said earlier, I started as a Java backend engineer.
So, Java was the hot thing back then.
So, it was slow.
We were relatively slow when compared to other programming languages.
But it was building up at the moment.
And there was a great community back in Greece at that time. I tried that,
liked it. And we're talking about, you know, early days of Spring and, you know, just moved
me away from servers and servlets and all this stuff. Gosh, I forgot the name. Then
I had an opportunity to start working remotely. It remotely around 2012, something like that.
Then I started working for a data science team.
So initially as a Java backend engineer, who was responsible for integrating machine learning algorithms with the rest of the systems.
So the interesting thing there was that it was the first time I started working with
machine learning and data science.
It was still, you know, kind of the early days of this revolution nowadays.
And the feeling I had when I left university is that there are things related to machine
learning, data science and so on, but it was a bit of romantic.
It wasn't easy to apply them in the industry while we were studying. But I joined that company exactly at the point of this renaissance,
let's say, that started with Scikit and NumPy and all these tools.
It was really interesting times because it's not as easy as it is now.
So in order to run a Scikit background, you had to compile the whole thing from scratch.
So it was challenging even to get things done. We're talking about really zero-dot-something
versions. But what I really liked and what really surprised me back then was how if you
have a business objective and the proper data and you store the data,
you can use algorithms to make estimates and make guesses or to help improve or optimize
your objective. And this was really interesting to see in action.
Yeah, that's cool. By the way, I mean, we have, let's say traditionally,
when we were talking about like ML and data science,
we always have like Python in our mind, right?
Like that's like, let's say the most common like language
and ecosystem that is used.
But you mentioned that like you were doing like backend stuff
like in Java, right?
So how did this work?
Like how do you, let's say, bridge bridge Java with the world of Python?
Initially, we started implementing some of the algorithms in Java back then.
So it was basic. It was rather simple algorithms like Apriori or FPGrowth or similar.
But then at some point you needed to work with logistic regression or some other things
there.
You needed to work with Python because there were a lot of libraries. So there was a layer of integration that was responsible for gathering the data
and sending them to an inference endpoint.
So the Java part was gathering the data,
doing all the aggregation and preparation, and then sending them to the Python code.
Okay. So Java is doing more of the data engineering part of the work, right?
Pretty much.
But this evolved back at this company.
It was Upwork at the U.S. main desk back then.
So we actually built some tooling that allowed us to have models that were versioned, that
we could deploy and allow to work asynchronously and independently.
So you could use this tool to give training of models and to keep a log of your experiments
and then the Java code would only need to point to the proper model.
Yeah, you were doing like MLOps work
back when MLOps was not a term, right?
Yes, exactly.
But it's not only this.
The other part that's really important
is about making sure that you have the data.
So that's definitely important.
So for example, you might need a custom profile.
So you need the daily snaps,
because it's hard to go back historically every day
and calculate the profile.
And you also need to store this so that you have historical data
so that you can train your model without
having recent data creeping in as past data and all these kind of problems that you can get
in ML. So yeah, that's definitely also part of it. It was part of the work and I think it's still one of the most interesting parts. Yeah, absolutely.
So let's talk a little bit more about that because actually it's interesting.
So you mentioned a few of the challenges that you had back then, like having these ML workloads.
How did you deal with them back then
and how you would deal with them today
so we can see how these 10 years
have changed the way that we are doing things
in data engineering?
Yeah, so back then, it's funny
because it looks like a full circle.
So one thing that we had back then is capturing the data.
So we were capturing the data, storing them into a file system or a tree,
and then moving them to a data warehouse.
And then we used SQL queries for doing the transformation.
And the output of the transformation was the same data for the model.
And something similar of the transformation was the training data for the model.
And something similar for the prediction, although you might need to call some APIs
in order to get more recent data, because they might not be yet available in the data
warehouse.
So that was one thing.
This things over time, you know, we found these tools that were made available
with the advent of cloud computing and all these nice tools.
So it's still pretty common to, when you have data, to just dump them to an S3 bucket, for
example, but you have them available and then you decide what to just dump them to an S3 bucket, for example,
but you have them available and then you decide what to do with them.
But then you also need to load them to somewhere to perform the transformation.
So for the transformation part, you can either use something like Spark
or the different variations that you have out there.
You can use SQL, you can use SQL using something
like Presto or Athena, or you can use a data warehouse to load the data to the warehouse. So
there are a lot of options. The other things, how to open, all these things. And then it's also always depends, it also always depends on the use case.
So in some cases where you just need some offline computation, so you can just
create a batch of that runs every night, let's say, and calculate some results.
And then you cast these results into a database so that it's faster to query them. Or you might need streaming queries, so you might need to
offer a stream like Kafka or whatever, and for each item that's coming out of this stream to
perform a prediction. So it really depends on the use case
and what you want to achieve.
It's like everything in software.
You have to understand what's your objective
and then start working towards
what are the best technologies to use.
Yeah.
So if you had to, let's say,
someone comes to you and is's like i'm considering like
getting into like data engineering like it's a software engineer but they haven't like worked
like in data engineering before and they ask you like okay what are like the most common like use
cases right like what are the most common things that like as a data engineer, you see there, right? What would these be? What's the
first thing that comes in your mind? Let's say three, four most common use cases that pretty
much every organization out there when it comes to data engineering deals with.
The first one is data collection. So you have various
sources or ingestion.
So you have various sources and you
want to load them to your systems
or to at least store them
in a temporary
place so that you can use them downstream.
And this can be either from
databases
or
other systems. It can be from user actions and events, and you might need
this for product analytics and so on.
And the second part is some transformation in order to build some end results that, you
know, you gather the data from the various sources and you
want to combine them in order to build a story or to try to understand what's happening.
So this is another common case.
You definitely need at some points to send the data to some other systems like Salesforce
or HR systems or whatever.
So kind of reverse CTL so that it's available to sales
to do this disintegration.
And there's also the data science machine learning part.
So these are the most common things.
I think I might be forgetting something, but yeah.
Why DS and ML is different than, let's say, the rest of the stuff that you're doing with data?
So in ML, there's a lot of exploration. So ML is about optimizing things most of the time.
And actually, this is one of the most important things
when working with data and especially with ML.
The first thing you need to understand is
what is your business objective
and what you need to understand is what is your business objective and what you want
to achieve.
One of the most common cases that you might not go as planned is that there is no clear
objective.
So usually your objective is not to achieve specific precision and recall, for example,
but your objective is to improve sales, for example,
or to improve the lifetime prediction or to improve certain prediction and so on.
And then you use the models and actually that's why they are called model because
we're trying to model the problem in order to provide an estimation and so on. And these are proxy
metrics that you can use to work towards your goal. So this is the most important thing to remember.
Yeah. And how does it work between the data engineer and the ML engineer? Because
the data engineer, let's say you are responsible for making sure that the data engineer and the ML engineer, right? Because the data engineer, let's say you
are responsible for making sure that the data is available, there are pipelines that they prepare
the data, blah, blah, blah, all that stuff. And then you have the ML engineer who, as you very
well said, it's all about experimentation, right? It's all about being scrappy in a way. There's no
order, right? You have to get in front of a bunch of data and trying to do something.
So how have you seen successfully, and if you also have seen some unsuccessful attempts,
that would be also great to hear from you, working together as data engineers and ML
engineers?
I think for the ML engineers, the most important
part is to have ease of access to the data and the data being easy to use. So usually data
scientists and machine learning engineers are fluent enough in SQL or in other languages so
that they can build some transformation in order to be able
to use the models.
What might be challenging is the whole integration with other systems.
Although, you know, it's a blurry line there.
Where is the border of ML engineering and data engineering?
So let's say that you have a monolith.
Let's say that your company's architecture is a monolith,
and you want to get the data in order to work with this data.
The ML engineer can't go directly to the production database
and use the data from there,
because they might run heavy queries,
which is really common,
so they might need a replica,
and they might need to combine it with data coming from
CDP or from something
external.
So they need to have enough freedom
in order to be able to achieve
their goals.
So how would you define
the boundaries between data engineering
and ML engineering?
Where do you think that these boundaries should be? So how would you define the boundaries between data engineering and ML engineering?
Where do you think that these boundaries should be set?
It's really hard to answer this, I think.
I mean, these terms are continuously evolving over the past years. I mean, and you know, quite often the type in one company is something different in another company.
So I think there's a lot of overlap. that the data engineer is the person who is closer to the ingestion and loading the data and
taking the data quality and all of the things. The ML engineer is responsible mostly for
making sure that the data are in good enough format so that the data science models can use
them. But again, it's a blurry line. It's a lot of overlap there.
Yeah, I'm excited. So let me ask the question in a bit of a different way.
So what is something that you have to do as part of an ML task that you hate doing as a data
engineer, that you wouldn't like to do? Like in an ideal world,
you wouldn't have liked to deal with that.
I love software.
So I've got all these hats
and it's hard to say I hate.
So I like challenges.
And so I think, yeah,
what most people hate is cleaning data
and they expect that the data engineer
has clean data,
but that really is hard. In most of the cases, cleaning the data engineer has clean data, but that really happens. In most
of the cases, cleaning the data is 80% of your time or even more. And that's what's helping.
I wouldn't want as an ML engineer to have to write ingestion pipelines for multiple sources.
So for example, I would prefer that this is a solved problem when it comes to, you
know, to clean data so that data are gathered in a way so that I can process them
all together.
I don't have to build custom logic to load everything.
Can you elaborate a little bit more on that? You mentioned JSON.
What's the hard part
or let's say the annoying part
of dealing with that data?
Word list formats.
If I am to say I don't like
something, it's CSV.
So for example, CSV has a lot of standards.
CSV is not a single format and it's also when misuse does a single format.
But you need to define the separators, escape characters, what you do with escape characters,
special characters.
And then you have all these peculiarities that some tools have.
For example, Redshift has its own peculiarities
about handling CSV and stuff like that.
So I don't know if this is what you're asking.
Actually, it's a very great topic.
I have more questions here. So let's go through a little bit of the, let's a very great topic. I have more questions here.
So let's go through a little bit of the flow of the work there
until the data gets to the ML engineer.
So the data comes from various sources and obviously in different formats,
different serializations.
And even in the same, let's say serialization,
you might have like different schemas, right? So, and going back, like, for example,
the way, the reason, like what you submitted and won, like in the content was about like
taking the schema of some events, right? So, this first part of dealing with the data, right?
It's like you can have data coming in Avro,
data coming in Protobuf, CSV, JSON,
and I don't know, what else?
How big part of work that the engineer has to do
is to deal with all these different formats
and making sure that they don't get into the way
of whatever happens later on, right?
Yeah.
So you need to think about the layers, let's say,
of the data or the zones that are sometimes called.
So you have to have something like a landing zone
where all this data land on your system
and you need to start processing and adding checks if possible to make sure that if something
changes, you either identify it fast enough or you raise an error.
If something breaks, you can figure this out as soon as possible.
So yeah, luckily nowadays it's easy to ingest most of these formats, and most of them are
pretty common on how to handle them. them, there is a need for you to know the specifics of each format.
I mean, because I think the biggest problem is the representation of the data, not the
format of the data, the representation.
So by this I mean, how would you model something that is optional?
Would you consider null as a valid value or something as a missing value?
And let's say that you have a JSON document.
And what does it mean that the property is missing on a specific row?
Does it mean that it's unknown or that the user didn't define it?
So this is a bit of the annoying part
because it requires a lot of back and forth with the source.
And sometimes you don't have access to the team that creates this data.
But yeah, so you definitely need this first layer to clean the data
and to have them in a format that it's pretty solid. Not super solid, it's
still flexible, it doesn't vary from the original source, but it does the basic cleaning, renaming,
applies your basic conventions and so on. So if we were to talk about data quality,
like what are the parameters of data quality?
We talked about the semantics of how data is represented
in the different formats and all these things.
What else creates problems with data quality?
That's a great question, and I don't have one answer. So
I think that each organization defines data quality in a different way. There are various
dimensions of data quality that you can discuss about, but depending on your use cases and what you want to achieve, you might want to focus on some of them.
So you can think about consistency,
like having multiple sources of truth for the same data,
and whether these sources are consistent,
whether you have duplicate values, etc.
You can think about completeness,
whether you have missing data, which is also important.
You can think about accuracy, how you present the data to reality.
Whether the data are in the expected format.
So let's say that you have a date.
You need to know that it's in the expected format, so let's say that you have a date, you need
to know that it's in the proper format so that it does not get misinterpreted.
Whether the data are fresh, presence is another one I can think from the top of my mind.
And there's also two more that are sometimes overlooked.
So one is accessibility.
So how easy is it to access data?
So does it take a long time for some member of the team to get access to the data?
Do they have to wait for some, I don't know, either technical or business reason.
And finally, how easy it is to use the data.
So if you just give someone
an S3 packet with all the files,
it might not be easy for them to use.
But if you've done the ingestion
and you have proper naming
in the columns, etc.,
it would be way more easy for them to work on that.
Again, there might be way more,
and it definitely depends on the use case.
For example, if you are working on open-source datasets,
some of these things might be more important than the rest.
Or you might want to also have versioning as part of the data quality.
So, yeah, definitely a lot of things.
Yeah.
So, okay, dealing with data quality pretty much, I guess, on a daily basis,
what do you think is missing right now in terms of tooling out there
to make your life easier?
I think there's a lot of tools out there right now in terms of tooling out there to make your life easier? I think there's a lot of tools
out there right now.
I think
that
they're trying to...
You have a lot of freedom
with most of these tools.
Actually, this is
especially for modern projects, this is one
of the main challenges. There is a lot of freedom on how you structure your project.
So there are emerging practices right now. So some of these things have been solved in the past,
but you know, we have to adapt them to the new tooling, etc.
So, solid definitions of data quality and solid examples on how to measure it is one thing.
And then, the other challenge I see is that
most of the tools focus on specific parts of data quality.
So, for example, you might have a tool
that focuses on identifying missing values.
But you might not be able to reuse this tool
in order to find whether the distribution of the values
changes over time.
So you will need a different tool for that.
So it's becoming challenging.
It's a lot of tools to achieve the same goal.
Yeah, makes sense.
That's interesting.
I feel like if you think about it,
data quality itself requires a lot of processing on its own, right?
There's a lot of analytics that you need to do on the data
just to measure these things.
It's interesting. So, okay, one last question from me and then I'll hand the microphone back to Eric.
What is one thing that has happened in the past couple of months or a year or whatever that in your space, in data engineering,
really got you excited for the future. And you can't include Rutter Stack in your...
Yeah, yeah. Okay. Yeah.
So it can be a tool, it can be a new technology, it can be
like a practice, whatever. So I really like how TPT is maturing over time.
So that's one thing.
What I really liked was this rapid AI ecosystem with the data frames and how you can use them. I haven't used it in production,
just for personal experimentation,
but this sounds like a really interesting approach.
Cool, that's awesome.
Eric, all yours again.
All right, well, I'm actually going to conclude
on a question for both of you,
and that is, are there any games
that you still play?
Either on the PC or
with a console or
even on your phone. Candy Crush doesn't
count.
Okay, Jan, do you want to go
first?
My kid
owns my consoles.
So
I don't have a lot of time for games,
but usually it's me helping him on some of the games.
So we really enjoy playing games together.
We have a Nintendo Switch, so we have this Mario Party
and Super Mario Kart and all these things.
But lately, he's been really excited about an older game called Subnautica.
It's a survival game, and he likes exploring the world there.
Very cool.
And me, unfortunately, I'm not allowed to get close to computer games.
Is that because of consequences you've experienced in the recent past?
Yeah.
I don't know.
I hope in the future I'll be able to play again, to be honest.
By the way, one of the things that I noticed at some point is,
okay, we used to like to play like Quake Arena, for example, right?
Back then, when we were in our early 20s, or late teens or whatever,
we were doing pretty amazing stuff.
I remember, especially some folks that were playing with us i mean it was like so
hard like to beat them like the how fast they were like all that stuff and then i remember like
trying to play one of these games again like after like a couple of years and i felt so old
like my like like you can't like like there's like zero chances of like being able like to felt so old.
There's zero chances of being able to compete. You lost your edge.
Yeah.
I remember I had a friend,
another guy who was
the same age.
They come back from work
and they get on Xbox
a gang
of old dudes.
They get on one of these
first-person shooters online.
They know that it's going to be a massacre.
They are all going to die.
They are not going to enjoy.
But they figured out a way to enjoy
not enjoying the game
by just being all together,
making fun, having a beer,
and getting on the game
and getting massacred by kids.
So, I don't know.
I see myself probably being one of these guys one day.
But we'll see.
Love it.
Well, thank you for sharing stories about Quake Arena
and naming your characters after your professors.
Yanni, incredible story.
Thank you so much for sharing.
We learned a ton,
especially about data engineering, ML,
and the influence of software development
on data engineering.
So thank you so much,
and congrats again on winning
the Ruddersack Transformations Challenge.
Thanks for having me.
Costas, what an awesome episode with Yanni.
I mean, it's clear that the big takeaway is that if you neglect your Quake Arena practice,
those skills will atrophy over time and will cause regrets for you.
I actually made me think about Duke Nukem.
You remember Duke Nukem?
Yeah, I do.
That was, again, like you had those friends
who were just like, how did you get so good at this?
It's amazing.
It's interesting how, I mean, if you think about,
because we had this this conversation with james and like i started like remembering like how we were you know like playing games and
stuff like that back then and so there were like a couple of things like in quake arena that
okay you've had like first of all like it was crazy to see with a railgun,
like, the aim that some people had
and, like, how they could do, like, headshots.
That was, like, crazy.
I mean, I don't know what kind of, like, reflexes, like,
I never, like, managed to get to that level.
But, like, there were people that, like,
when they entered the arena, like,
you would just leave because it didn't make sense.
It was almost like cheating, you know?
And they were not cheating.
Yep.
And usually this was the result
of spending way too many hours
playing instead of studying.
Oh, 100%.
Like an effect on your...
Oh, yeah.
I mean, you're talking about people
who would take the mouse apart and
like clean the ball and like clean the mouse pad before the game you know because they had like a
very ball the ball the ball like something that doesn't exist anymore okay yes exactly yeah but
super important because like you know once you got good, you could tell if the ball got dirty.
Like, it wasn't.
Yeah, 100%.
And, yeah, measuring the ping to the server, like, because.
Yeah.
That was.
Oh, yeah.
So good.
The other thing that I think, like, it's a testament of, like,
the human creativity here is that there was, like,
this thing going, like, the rocket jump, right?
Which, technically, right? Which technically,
but with the default settings,
you couldn't do it
because you were actually
exploding yourself, right?
But we were changing the settings
so you could use the rocket jumping
and that completely changed
the way that you were playing, right?
So actually,
it's very interesting to see how people were not just like playing but also how to like innovating on top of like the
game to make it like a new game right 100 i think that's actually a really good you know that was
really fun to talk about that when we think about the episode and talking with Yanni, you know, who now works as a data engineer at Mattermost, you know, who does really interesting work around super high security team collaboration for the Air Force and for, you know, Bank of America and other huge companies. He's a systems thinker, right? He breaks down systems. I mean,
he studied electrical engineering and we got a really interesting view at sort of his art going
from electrical engineering, backend software development, ML engineering, and then now data
engineering. And hearing about that story was absolutely fascinating. But it's true.
I mean, it sounds funny, but the way that you talked with him about trying to break
down the Quake Arena game and like execute that, you know, during class and other things
like that, it was a bunch of really smart, creative people like solving a systems problem.
Right.
And so that's really, really cool to me to hear his story and
i think anyone who's interested in sort of transitioning from different disciplines and
taking the best of that discipline with you to the next one this is a really great episode
oh yeah 100 like young's like give like i think, a very pragmatic description
of how the fundamentals at the end do not change.
I think he mentioned also a couple of times
of how we go back in circles in a way.
And things that we were doing in the past,
we do again today and all these things.
And that's not actually a bad thing.
It's a good thing.
Innovation doesn't mean throwing away
completely what was happening in the past
and bringing a completely different paradigm.
It's much more, let's say, iterative in a way.
And there are fundamentals that remain there,
no matter what.
Some things cannot change.
Like, the fundamentals are there.
And so investing time in, like, learning these fundamentals
and enjoying working with these fundamentals,
I think it's probably, like, the most important thing
that, like, someone can do in their career.
And it doesn't matter.
Like, if you have them, you can go through software engineering,
backend engineering, frontend engineering,
ML to data engineering, and whatever is next.
So I think it's a great episode for anyone who wants to learn about that.
I agree.
Well, thank you for joining us.
Definitely subscribe if you haven't.
Tell a friend.
Give us feedback.
Head to the website, fill out the form.
Send us an email. Actually, send an email to the website. Fill out the form. Send us an email.
Actually, send an email
to brooks at datastackshow.com.
He'll respond faster
than the air costs us.
And we will catch you
on the next one.
We hope you enjoyed
this episode of
The Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified
about new episodes
every week.
We'd also love
your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.