The Data Stack Show - 141: A Journey From Backend Engineer to Data Engineer with Ioannis Foukarakis of Mattermost

Episode Date: June 7, 2023

Highlights from this week’s conversation include:Ioannis’ background and journey in data (2:42)Rudderstack’s transformations feature and examples of its application (4:20)Winning the transformat...ions contest at Rudderstack (7:21)How Ioannis’ transformation project works for data governance (9:40)Memories from college for Ioannis and Kostas (12:30)Getting into the world of software development (17:27)The changes in data and engineering over the years (20:29)Bridging java with python (23:15)Dealing with ML workloads in the past vs. workflows of today (26:30)Data engineers and ML engineers (33:12)Dealing with data in the early stages to ensure reliability later on (38:39)What creates problems with data quality? (42:11)Exciting developments in data engineering (46:48)Final thoughts and takeaways (51:12)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, fun episode. So Rudderstack, the company that helps us put on the show, recently ran a competition around transforming data. And we are going to talk to the winner of that competition. His name is Yanni and he works at
Starting point is 00:00:40 a company called Mattermost, but you actually know Yanni from your days in the university. So I have a feeling this is going to be an extremely fun conversation. I'm going to ask the obvious question, what did he build for this competition? Little preview. It's a pretty cool data governance flavored feature that relies on the concepts of data contracts, but it kind of runs in transit in the pipeline. So pretty interesting approach. So I want to dig into that with him because I think it was a pretty creative effort. But you obviously know a lot about Yanni.
Starting point is 00:01:16 So what are you going to ask? Yeah, I think it would be great to go through his journey because he, just like me, has been around for a while. And he has an interesting journey from graduating to doing a PhD, going into the industry, doing backend engineering to ML engineering to data engineering. So I think he has a lot to share about this journey and in a way how the industry has evolved. And then I think it would be great also to spend some time with him
Starting point is 00:01:56 and learn from his experience about data engineering, ML engineering, the boundaries between the two, and what it takes to make sure that both functions operate correctly. So let's do that and chat with him. And I'm sure there are going to be some fun moments remembering the past there. So let's see. Let's do it. Yanni, welcome to the Data Stack Show and congratulations on winning Rudder Stack's Transformations Challenge. It was really cool to see all the submissions
Starting point is 00:02:32 and you won. Thank you. First of all, thanks for having me. It's great to talk with all of you. Thank you for your words for the submission. I think I was pretty lucky because there were a lot of great submissions out there. Cool. We'll talk about that challenge and we want to hear what you built because it actually
Starting point is 00:02:54 relates to data quality, data contracts, data governance, lots of topics that we've covered on the show that are super relevant. But first, give us your background. You actually have a connection to Costas in your past, which I want to dig into a little bit later. But yeah, give us your background and tell us what you do for work today. Yeah. So I'm a data engineer at Matter at Most. I received my PhD in electrical engineering a few years ago. That's where I actually know Kostas from. After receiving the PhD, I started working as an adjunct lecturer, teaching object-oriented programming with Java, database systems, and software engineering.
Starting point is 00:03:43 Then I moved to the industry, initially as a Java backend engineer, and then later as a machine learning engineer. But, you know, these things are kind of connected and I gradually moved to the latest field, which is data engineering. Love it. And just give us a quick overview.
Starting point is 00:04:03 You work at Mattermost. What does Mattermost do? Just give us a quick... while meeting nation-state level security and compliance requirements. They have this really nice tool and a lot of customers that range from US Air Force to Bank of America, Tesla Motors, Meta, Facebook, and all these great companies. Wow. Incredible. Okay, well, let's talk about the transformations challenge really quick. So Rudders Act, and of course I work for Rudders Act, so I'm familiar with this, but we want to hear it in your words. Our customers love our transformations feature.
Starting point is 00:04:53 First of all, tell us, you explain transformations to us. What is RutterSack's transformations feature? And maybe what are some of the ways that you use it at Mattermost? So, transformations is a way to modify incoming events or filter events before they reach your, the final destination. So as soon as the client fires an event and it's detected by RutterStack, RutterStack runs this transformation. The transformation follows its logic and then stores the result. You can think of it as changing the order where the load and transformation happen. So, it's up to you to decide whether you do the transformation or which transformation
Starting point is 00:05:45 you do after the data are loading the database or before. Got it. And what are some of the ways that you use transformations that matter most? Because you stream data from multiple iOS, Android, web, etc. Yeah, exactly. So we don't currently use transformations, we're investigating and we have a lot of data coming from clients and
Starting point is 00:06:13 we were thinking about modifying the organization of the data and how eventually starting the data warehouse. So one of the things that we were thinking is whether we can filter some events that were coming as noise. But there's also some bugs that might happen.
Starting point is 00:06:35 And these bugs might exist in servers that have an older version of the code. And you can't wait for the customer, you can't force the customer to upgrade something that's installed on-prem. So we can use the transformation in order to reconcile for these bugs that we might identify. Oh, interesting, right. So it's like someone's running an older version of iOS,
Starting point is 00:07:01 so they have a previous version of your instrumentation, but then you update the instrumentation on newer versions, and so you need to fix the payload to sort of align with the new schema. Yeah, that's one way. The other way is that Mattermont Solutions has this server component where you can install it on-prem, and this server component, the maintenance of this component is something that might be outside of our control.
Starting point is 00:07:28 But the data that we receive is something that we can modify using the transformations. Got it. Yeah. Okay. So yeah. So someone installs it on-prem, but you need to modify the data to sort of align it so you cross customer analytics. Super interesting.
Starting point is 00:07:42 Okay. Well, tell us about the transformation you built. What was the original problem you were thinking about when you saw the competition and wanted to build something? It's not something that it's not out there.
Starting point is 00:07:58 It's something that exists in my mind as an idea, and I was planning to experiment with that. And the challenge is what pushed me to actually go on and implement it. So the idea is that when you receive events from various sources, from various teams in the company, you have to agree on the payload so that the data engineers know what to expect, what are the expected fields, properties that end up as columns in the tables and so
Starting point is 00:08:26 on. So one way to, there are various ways to try to enforce these contracts that are agreed between the product teams and the data engineering team. And one option is to have these contracts in the form of version-controlled files, like schemas. And the transformation is checking whether the events are adhering to these schemas that you
Starting point is 00:08:54 have specified so far. Yep, so you have an event coming in, and let's say one of the challenges is either maybe a versioning challenge like we talked about before, where someone's running an old version of the app and so the scheme is different. You need to modify that. Like that could be one way that it doesn't align.
Starting point is 00:09:13 Or the developers implement something that maybe isn't quite accurate or they change something. And so, you know, as a data engineering team, it's a way for you to flag that in transit to make sure that nothing breaks downstream. Yeah, exactly. So let's say that you have anything called that to cart, and you agree that the properties are going to be A, B, and C. But then for some reason, for some news communication, it's different teams, somebody goes ahead and adds an additional property called d. So by taking the schema and depending on how strict we want to be, we can either
Starting point is 00:09:51 discard the event or we can send a notification that we noticed a new event with a different schema and we need to take action. All right. Well, give us just a brief overview of how this works in RutterStack transformations. How did you wire it up?
Starting point is 00:10:09 So I used the transformations in the JavaScript part of the transformations. It's great that RutterStack offers both Python and JavaScript transformations. I went for the JavaScript part because it was the part I wasn't being that confident. I'm more familiar with Python, so I wanted this kind of challenge. So I want to focus both on offering a solution, but also investigate how you can apply good engineering practices in writing transformations. So there's already a repo that's public for this, the link is in the submission. The code there is, it uses a library for passing the schema.
Starting point is 00:11:06 And in the transformation, what you do is you define the schemas, you map the schemas to the event names, then the transformation checks for each event which schema corresponds to this event. event, that's the validation and you can decide whether to discard or just look the error message. There's also in the repo, there's also some additional code about testing the transformation, how to set up test events, CI, CD on the transformation and all these personal practices. Love it. And we'll make sure to include that in the show notes.
Starting point is 00:11:48 Very cool. What a creative way to sort of explore and implementation of data governance with RedRESAC transformations. I love it. What was the most enjoyable part of building the transformation that you built for the competition? Seeing it work.
Starting point is 00:12:10 This dopamine rush. Definitely. But I think it went really smooth. I didn't spend a lot of time writing the code. So I really liked that it was a really fast prototype and then I feel like I spent more time in writing tests and setting up the project structure rather than
Starting point is 00:12:32 actually writing the transformation. So this thing was really nice and the user interface of RouterStack for testing on RouterStack the transformation is also really helpful. Preston Pyshko Great. Well, congrats again. Super cool project.
Starting point is 00:12:51 Okay, I want to ask you another question because your background is really interesting. So you studied electrical engineering, then you got into sort of software development, specifically backend development, and then you got into data engineering. That's a super interesting story. But at the beginning, you were in school with Costas. And so I want to hear maybe like your best and worst memory of Costas when you were in school with him. I think it's the same thing.
Starting point is 00:13:21 I think me and Costas were going to the same lab in order to get free internet. Ah, free internet in the lab. So, I won't say our age, but back then we had dial-up modems, so we didn't have a lot of internet available in our home. So we used to go to specific labs or run some errands for some assistants to the lab so that we can get access to the lab and be able to stay there and code or search the internet program or talk to IRC. Yeah, IRC. Is that where you met? Is doing that? It was this... We were in the same semester.
Starting point is 00:14:15 So, what's that? Yeah, we were in the same class. I mean, it's been a while. Yeah. We don't want to disclose our age, but it's been a while. Yeah. We don't want to disclose our age, but it's been a while, so... Okay. I do have to ask a question here.
Starting point is 00:14:34 Back then, wait, back then at the university, there were a couple of very specific spots where you could meet with people, right? One was the coffee shop at the school, where you would end up there meeting with people and drinking coffee. And then it was the labs where we would do something like what Yanis was describing.
Starting point is 00:14:57 Because keep in mind that back then, having access to good internet connection was pretty much non-existent in Greece. So, okay, that was one of the benefits of being at the school of electrical and computer engineering in Greece. You had access to a very fat pipe for that time, right? Yeah, it was one of the main reasons or i shot my pxd that's that okay i do have to ask though surely at night you weren't like just working on school work like of course you played games in the lab, right, with other people from school, with the internet connection. Yeah, so, okay, now you're getting into the interesting parts of life. The problem is that the more questions you ask,
Starting point is 00:15:56 the easier it's going to be for people to figure out our age. That's the problem here. I didn't mention any names of games. I'm just saying, based on my own experience. Yeah, but you have to do that at the end. We have to talk about it. I think two main things. One was Quake Arena.
Starting point is 00:16:19 We had a server at the university and we had a home with that and I think that was hosted in CS Lab if I remember correctly. I don't remember where it was hosted. I started... The person was hosting it. It was Jorgo Skalas who was hosting that on his own personal server. Anyway, and then there was a lot of... People were getting together, especially, I think, in Shoplamp and playing StarCraft. Yes, I think something like that.
Starting point is 00:16:53 But, I mean, one of the finest memories I have is with SquidCalinga, where we were attending a class, let's say, and everybody logged into the server. We used the names of the professors as nicknames. And it was funny because it was, you know, old CRT screens. And whenever the professor who was teaching at that moment started working towards the back back you could hear alt up and the click
Starting point is 00:17:25 on the screen so it was like a wave coming to the back to each one of the funniest memories i have unbelievable i love i love that that sounds like you know i love that this is happening in the context of a phd that's just so great. No, that was before the PhD. Ah, okay. You matured, yes. Yeah, yeah. Okay, so electrical engineering, Quake Arena. Yanni, why did you get into the world of software?
Starting point is 00:17:59 I went to this school because of software. I liked computers since I was a young one. I got to study something related of software. I liked computers since I was a young one. I wanted to study something related to software. So, how it felt? I mean, all the pieces fell into place and I started that. So, even though I was in an electrical engineering department. Well, practically it was electrical and computer engineering. I focused mostly on the software part because I liked it most. Then I tried a bit academia because it felt, you know, something like the next step to try after a PhD in Greece.
Starting point is 00:18:42 A variety of reasons were there, but I always wanted to also, you know, I didn't want to be only the guy who teaches software. I also wanted to write software. And part of that, part of the economic crisis back in Greece, back then I moved completely to the industry at that point. And I've been enjoying it since then. Yeah. industry at that point. And I've been enjoying it since then. Yeah, and something that's... We need to clarify something here.
Starting point is 00:19:09 The school we attended was like the School of Electrical and Computer Engineering, so the schools were never separated. In the technical university, we weren't, at least. So if you wanted to go and study computer engineering, you had to go and study computer engineering,
Starting point is 00:19:26 you had to torture yourself with electrical engineering for a while. Together with a couple of other things, too. Actually, I have to be honest with myself here. Although at the beginning, I didn't enjoy it that much.
Starting point is 00:19:41 We had all this variety of different stuff to learn and go through. At end it was like a very interesting experience to learn all these different things and have like a much more let's say complete like engineering training ranging from like classical electrical engineering to stuff with telecommunications to electronics to software. Game theory. It was even theoretical stuff at the moment. Yeah, it was pretty theoretical, but anyway, it was good at the end. We suffered a little bit, but at the end I think it paid off. So, Yanni, let's talk a little bit about this journey, right? Because, okay, we've been around
Starting point is 00:20:27 for a while. Software and the industry was obviously completely different back then when we graduated or even when we entered the school. Today, as you said, you have the title of data engineer. Let's talk about
Starting point is 00:20:44 this journey a little bit and your experience, right? How you have experienced the change in the industry. And let's focus on some things that you, at least from your perspective, you find interesting to share and maybe surprising also. So as I said earlier, I started as a Java backend engineer. So, Java was the hot thing back then.
Starting point is 00:21:11 So, it was slow. We were relatively slow when compared to other programming languages. But it was building up at the moment. And there was a great community back in Greece at that time. I tried that, liked it. And we're talking about, you know, early days of Spring and, you know, just moved me away from servers and servlets and all this stuff. Gosh, I forgot the name. Then I had an opportunity to start working remotely. It remotely around 2012, something like that. Then I started working for a data science team.
Starting point is 00:21:55 So initially as a Java backend engineer, who was responsible for integrating machine learning algorithms with the rest of the systems. So the interesting thing there was that it was the first time I started working with machine learning and data science. It was still, you know, kind of the early days of this revolution nowadays. And the feeling I had when I left university is that there are things related to machine learning, data science and so on, but it was a bit of romantic. It wasn't easy to apply them in the industry while we were studying. But I joined that company exactly at the point of this renaissance, let's say, that started with Scikit and NumPy and all these tools.
Starting point is 00:22:56 It was really interesting times because it's not as easy as it is now. So in order to run a Scikit background, you had to compile the whole thing from scratch. So it was challenging even to get things done. We're talking about really zero-dot-something versions. But what I really liked and what really surprised me back then was how if you have a business objective and the proper data and you store the data, you can use algorithms to make estimates and make guesses or to help improve or optimize your objective. And this was really interesting to see in action. Yeah, that's cool. By the way, I mean, we have, let's say traditionally,
Starting point is 00:23:44 when we were talking about like ML and data science, we always have like Python in our mind, right? Like that's like, let's say the most common like language and ecosystem that is used. But you mentioned that like you were doing like backend stuff like in Java, right? So how did this work? Like how do you, let's say, bridge bridge Java with the world of Python?
Starting point is 00:24:09 Initially, we started implementing some of the algorithms in Java back then. So it was basic. It was rather simple algorithms like Apriori or FPGrowth or similar. But then at some point you needed to work with logistic regression or some other things there. You needed to work with Python because there were a lot of libraries. So there was a layer of integration that was responsible for gathering the data and sending them to an inference endpoint. So the Java part was gathering the data, doing all the aggregation and preparation, and then sending them to the Python code.
Starting point is 00:25:01 Okay. So Java is doing more of the data engineering part of the work, right? Pretty much. But this evolved back at this company. It was Upwork at the U.S. main desk back then. So we actually built some tooling that allowed us to have models that were versioned, that we could deploy and allow to work asynchronously and independently. So you could use this tool to give training of models and to keep a log of your experiments and then the Java code would only need to point to the proper model.
Starting point is 00:25:49 Yeah, you were doing like MLOps work back when MLOps was not a term, right? Yes, exactly. But it's not only this. The other part that's really important is about making sure that you have the data. So that's definitely important. So for example, you might need a custom profile.
Starting point is 00:26:12 So you need the daily snaps, because it's hard to go back historically every day and calculate the profile. And you also need to store this so that you have historical data so that you can train your model without having recent data creeping in as past data and all these kind of problems that you can get in ML. So yeah, that's definitely also part of it. It was part of the work and I think it's still one of the most interesting parts. Yeah, absolutely. So let's talk a little bit more about that because actually it's interesting.
Starting point is 00:26:55 So you mentioned a few of the challenges that you had back then, like having these ML workloads. How did you deal with them back then and how you would deal with them today so we can see how these 10 years have changed the way that we are doing things in data engineering? Yeah, so back then, it's funny because it looks like a full circle.
Starting point is 00:27:24 So one thing that we had back then is capturing the data. So we were capturing the data, storing them into a file system or a tree, and then moving them to a data warehouse. And then we used SQL queries for doing the transformation. And the output of the transformation was the same data for the model. And something similar of the transformation was the training data for the model. And something similar for the prediction, although you might need to call some APIs in order to get more recent data, because they might not be yet available in the data
Starting point is 00:27:57 warehouse. So that was one thing. This things over time, you know, we found these tools that were made available with the advent of cloud computing and all these nice tools. So it's still pretty common to, when you have data, to just dump them to an S3 bucket, for example, but you have them available and then you decide what to just dump them to an S3 bucket, for example, but you have them available and then you decide what to do with them. But then you also need to load them to somewhere to perform the transformation.
Starting point is 00:28:34 So for the transformation part, you can either use something like Spark or the different variations that you have out there. You can use SQL, you can use SQL using something like Presto or Athena, or you can use a data warehouse to load the data to the warehouse. So there are a lot of options. The other things, how to open, all these things. And then it's also always depends, it also always depends on the use case. So in some cases where you just need some offline computation, so you can just create a batch of that runs every night, let's say, and calculate some results. And then you cast these results into a database so that it's faster to query them. Or you might need streaming queries, so you might need to
Starting point is 00:29:31 offer a stream like Kafka or whatever, and for each item that's coming out of this stream to perform a prediction. So it really depends on the use case and what you want to achieve. It's like everything in software. You have to understand what's your objective and then start working towards what are the best technologies to use. Yeah.
Starting point is 00:30:01 So if you had to, let's say, someone comes to you and is's like i'm considering like getting into like data engineering like it's a software engineer but they haven't like worked like in data engineering before and they ask you like okay what are like the most common like use cases right like what are the most common things that like as a data engineer, you see there, right? What would these be? What's the first thing that comes in your mind? Let's say three, four most common use cases that pretty much every organization out there when it comes to data engineering deals with. The first one is data collection. So you have various
Starting point is 00:30:45 sources or ingestion. So you have various sources and you want to load them to your systems or to at least store them in a temporary place so that you can use them downstream. And this can be either from databases
Starting point is 00:31:02 or other systems. It can be from user actions and events, and you might need this for product analytics and so on. And the second part is some transformation in order to build some end results that, you know, you gather the data from the various sources and you want to combine them in order to build a story or to try to understand what's happening. So this is another common case. You definitely need at some points to send the data to some other systems like Salesforce
Starting point is 00:31:41 or HR systems or whatever. So kind of reverse CTL so that it's available to sales to do this disintegration. And there's also the data science machine learning part. So these are the most common things. I think I might be forgetting something, but yeah. Why DS and ML is different than, let's say, the rest of the stuff that you're doing with data? So in ML, there's a lot of exploration. So ML is about optimizing things most of the time.
Starting point is 00:32:31 And actually, this is one of the most important things when working with data and especially with ML. The first thing you need to understand is what is your business objective and what you need to understand is what is your business objective and what you want to achieve. One of the most common cases that you might not go as planned is that there is no clear objective.
Starting point is 00:32:57 So usually your objective is not to achieve specific precision and recall, for example, but your objective is to improve sales, for example, or to improve the lifetime prediction or to improve certain prediction and so on. And then you use the models and actually that's why they are called model because we're trying to model the problem in order to provide an estimation and so on. And these are proxy metrics that you can use to work towards your goal. So this is the most important thing to remember. Yeah. And how does it work between the data engineer and the ML engineer? Because the data engineer, let's say you are responsible for making sure that the data engineer and the ML engineer, right? Because the data engineer, let's say you
Starting point is 00:33:45 are responsible for making sure that the data is available, there are pipelines that they prepare the data, blah, blah, blah, all that stuff. And then you have the ML engineer who, as you very well said, it's all about experimentation, right? It's all about being scrappy in a way. There's no order, right? You have to get in front of a bunch of data and trying to do something. So how have you seen successfully, and if you also have seen some unsuccessful attempts, that would be also great to hear from you, working together as data engineers and ML engineers? I think for the ML engineers, the most important
Starting point is 00:34:26 part is to have ease of access to the data and the data being easy to use. So usually data scientists and machine learning engineers are fluent enough in SQL or in other languages so that they can build some transformation in order to be able to use the models. What might be challenging is the whole integration with other systems. Although, you know, it's a blurry line there. Where is the border of ML engineering and data engineering? So let's say that you have a monolith.
Starting point is 00:35:04 Let's say that your company's architecture is a monolith, and you want to get the data in order to work with this data. The ML engineer can't go directly to the production database and use the data from there, because they might run heavy queries, which is really common, so they might need a replica, and they might need to combine it with data coming from
Starting point is 00:35:27 CDP or from something external. So they need to have enough freedom in order to be able to achieve their goals. So how would you define the boundaries between data engineering and ML engineering?
Starting point is 00:35:44 Where do you think that these boundaries should be? So how would you define the boundaries between data engineering and ML engineering? Where do you think that these boundaries should be set? It's really hard to answer this, I think. I mean, these terms are continuously evolving over the past years. I mean, and you know, quite often the type in one company is something different in another company. So I think there's a lot of overlap. that the data engineer is the person who is closer to the ingestion and loading the data and taking the data quality and all of the things. The ML engineer is responsible mostly for making sure that the data are in good enough format so that the data science models can use them. But again, it's a blurry line. It's a lot of overlap there.
Starting point is 00:36:47 Yeah, I'm excited. So let me ask the question in a bit of a different way. So what is something that you have to do as part of an ML task that you hate doing as a data engineer, that you wouldn't like to do? Like in an ideal world, you wouldn't have liked to deal with that. I love software. So I've got all these hats and it's hard to say I hate. So I like challenges.
Starting point is 00:37:14 And so I think, yeah, what most people hate is cleaning data and they expect that the data engineer has clean data, but that really is hard. In most of the cases, cleaning the data engineer has clean data, but that really happens. In most of the cases, cleaning the data is 80% of your time or even more. And that's what's helping. I wouldn't want as an ML engineer to have to write ingestion pipelines for multiple sources. So for example, I would prefer that this is a solved problem when it comes to, you
Starting point is 00:37:48 know, to clean data so that data are gathered in a way so that I can process them all together. I don't have to build custom logic to load everything. Can you elaborate a little bit more on that? You mentioned JSON. What's the hard part or let's say the annoying part of dealing with that data? Word list formats.
Starting point is 00:38:22 If I am to say I don't like something, it's CSV. So for example, CSV has a lot of standards. CSV is not a single format and it's also when misuse does a single format. But you need to define the separators, escape characters, what you do with escape characters, special characters. And then you have all these peculiarities that some tools have. For example, Redshift has its own peculiarities
Starting point is 00:38:51 about handling CSV and stuff like that. So I don't know if this is what you're asking. Actually, it's a very great topic. I have more questions here. So let's go through a little bit of the, let's a very great topic. I have more questions here. So let's go through a little bit of the flow of the work there until the data gets to the ML engineer. So the data comes from various sources and obviously in different formats, different serializations.
Starting point is 00:39:22 And even in the same, let's say serialization, you might have like different schemas, right? So, and going back, like, for example, the way, the reason, like what you submitted and won, like in the content was about like taking the schema of some events, right? So, this first part of dealing with the data, right? It's like you can have data coming in Avro, data coming in Protobuf, CSV, JSON, and I don't know, what else? How big part of work that the engineer has to do
Starting point is 00:40:01 is to deal with all these different formats and making sure that they don't get into the way of whatever happens later on, right? Yeah. So you need to think about the layers, let's say, of the data or the zones that are sometimes called. So you have to have something like a landing zone where all this data land on your system
Starting point is 00:40:25 and you need to start processing and adding checks if possible to make sure that if something changes, you either identify it fast enough or you raise an error. If something breaks, you can figure this out as soon as possible. So yeah, luckily nowadays it's easy to ingest most of these formats, and most of them are pretty common on how to handle them. them, there is a need for you to know the specifics of each format. I mean, because I think the biggest problem is the representation of the data, not the format of the data, the representation. So by this I mean, how would you model something that is optional?
Starting point is 00:41:25 Would you consider null as a valid value or something as a missing value? And let's say that you have a JSON document. And what does it mean that the property is missing on a specific row? Does it mean that it's unknown or that the user didn't define it? So this is a bit of the annoying part because it requires a lot of back and forth with the source. And sometimes you don't have access to the team that creates this data. But yeah, so you definitely need this first layer to clean the data
Starting point is 00:42:01 and to have them in a format that it's pretty solid. Not super solid, it's still flexible, it doesn't vary from the original source, but it does the basic cleaning, renaming, applies your basic conventions and so on. So if we were to talk about data quality, like what are the parameters of data quality? We talked about the semantics of how data is represented in the different formats and all these things. What else creates problems with data quality? That's a great question, and I don't have one answer. So
Starting point is 00:42:48 I think that each organization defines data quality in a different way. There are various dimensions of data quality that you can discuss about, but depending on your use cases and what you want to achieve, you might want to focus on some of them. So you can think about consistency, like having multiple sources of truth for the same data, and whether these sources are consistent, whether you have duplicate values, etc. You can think about completeness, whether you have missing data, which is also important.
Starting point is 00:43:28 You can think about accuracy, how you present the data to reality. Whether the data are in the expected format. So let's say that you have a date. You need to know that it's in the expected format, so let's say that you have a date, you need to know that it's in the proper format so that it does not get misinterpreted. Whether the data are fresh, presence is another one I can think from the top of my mind. And there's also two more that are sometimes overlooked. So one is accessibility.
Starting point is 00:44:07 So how easy is it to access data? So does it take a long time for some member of the team to get access to the data? Do they have to wait for some, I don't know, either technical or business reason. And finally, how easy it is to use the data. So if you just give someone an S3 packet with all the files, it might not be easy for them to use. But if you've done the ingestion
Starting point is 00:44:39 and you have proper naming in the columns, etc., it would be way more easy for them to work on that. Again, there might be way more, and it definitely depends on the use case. For example, if you are working on open-source datasets, some of these things might be more important than the rest. Or you might want to also have versioning as part of the data quality.
Starting point is 00:45:09 So, yeah, definitely a lot of things. Yeah. So, okay, dealing with data quality pretty much, I guess, on a daily basis, what do you think is missing right now in terms of tooling out there to make your life easier? I think there's a lot of tools out there right now in terms of tooling out there to make your life easier? I think there's a lot of tools out there right now. I think
Starting point is 00:45:30 that they're trying to... You have a lot of freedom with most of these tools. Actually, this is especially for modern projects, this is one of the main challenges. There is a lot of freedom on how you structure your project. So there are emerging practices right now. So some of these things have been solved in the past,
Starting point is 00:45:59 but you know, we have to adapt them to the new tooling, etc. So, solid definitions of data quality and solid examples on how to measure it is one thing. And then, the other challenge I see is that most of the tools focus on specific parts of data quality. So, for example, you might have a tool that focuses on identifying missing values. But you might not be able to reuse this tool in order to find whether the distribution of the values
Starting point is 00:46:41 changes over time. So you will need a different tool for that. So it's becoming challenging. It's a lot of tools to achieve the same goal. Yeah, makes sense. That's interesting. I feel like if you think about it, data quality itself requires a lot of processing on its own, right?
Starting point is 00:47:00 There's a lot of analytics that you need to do on the data just to measure these things. It's interesting. So, okay, one last question from me and then I'll hand the microphone back to Eric. What is one thing that has happened in the past couple of months or a year or whatever that in your space, in data engineering, really got you excited for the future. And you can't include Rutter Stack in your... Yeah, yeah. Okay. Yeah. So it can be a tool, it can be a new technology, it can be like a practice, whatever. So I really like how TPT is maturing over time.
Starting point is 00:47:50 So that's one thing. What I really liked was this rapid AI ecosystem with the data frames and how you can use them. I haven't used it in production, just for personal experimentation, but this sounds like a really interesting approach. Cool, that's awesome. Eric, all yours again. All right, well, I'm actually going to conclude on a question for both of you,
Starting point is 00:48:23 and that is, are there any games that you still play? Either on the PC or with a console or even on your phone. Candy Crush doesn't count. Okay, Jan, do you want to go first?
Starting point is 00:48:39 My kid owns my consoles. So I don't have a lot of time for games, but usually it's me helping him on some of the games. So we really enjoy playing games together. We have a Nintendo Switch, so we have this Mario Party and Super Mario Kart and all these things.
Starting point is 00:49:05 But lately, he's been really excited about an older game called Subnautica. It's a survival game, and he likes exploring the world there. Very cool. And me, unfortunately, I'm not allowed to get close to computer games. Is that because of consequences you've experienced in the recent past? Yeah. I don't know. I hope in the future I'll be able to play again, to be honest.
Starting point is 00:49:39 By the way, one of the things that I noticed at some point is, okay, we used to like to play like Quake Arena, for example, right? Back then, when we were in our early 20s, or late teens or whatever, we were doing pretty amazing stuff. I remember, especially some folks that were playing with us i mean it was like so hard like to beat them like the how fast they were like all that stuff and then i remember like trying to play one of these games again like after like a couple of years and i felt so old like my like like you can't like like there's like zero chances of like being able like to felt so old.
Starting point is 00:50:27 There's zero chances of being able to compete. You lost your edge. Yeah. I remember I had a friend, another guy who was the same age. They come back from work and they get on Xbox a gang
Starting point is 00:50:44 of old dudes. They get on one of these first-person shooters online. They know that it's going to be a massacre. They are all going to die. They are not going to enjoy. But they figured out a way to enjoy not enjoying the game
Starting point is 00:50:59 by just being all together, making fun, having a beer, and getting on the game and getting massacred by kids. So, I don't know. I see myself probably being one of these guys one day. But we'll see. Love it.
Starting point is 00:51:16 Well, thank you for sharing stories about Quake Arena and naming your characters after your professors. Yanni, incredible story. Thank you so much for sharing. We learned a ton, especially about data engineering, ML, and the influence of software development on data engineering.
Starting point is 00:51:34 So thank you so much, and congrats again on winning the Ruddersack Transformations Challenge. Thanks for having me. Costas, what an awesome episode with Yanni. I mean, it's clear that the big takeaway is that if you neglect your Quake Arena practice, those skills will atrophy over time and will cause regrets for you. I actually made me think about Duke Nukem.
Starting point is 00:52:07 You remember Duke Nukem? Yeah, I do. That was, again, like you had those friends who were just like, how did you get so good at this? It's amazing. It's interesting how, I mean, if you think about, because we had this this conversation with james and like i started like remembering like how we were you know like playing games and stuff like that back then and so there were like a couple of things like in quake arena that
Starting point is 00:52:39 okay you've had like first of all like it was crazy to see with a railgun, like, the aim that some people had and, like, how they could do, like, headshots. That was, like, crazy. I mean, I don't know what kind of, like, reflexes, like, I never, like, managed to get to that level. But, like, there were people that, like, when they entered the arena, like,
Starting point is 00:53:02 you would just leave because it didn't make sense. It was almost like cheating, you know? And they were not cheating. Yep. And usually this was the result of spending way too many hours playing instead of studying. Oh, 100%.
Starting point is 00:53:18 Like an effect on your... Oh, yeah. I mean, you're talking about people who would take the mouse apart and like clean the ball and like clean the mouse pad before the game you know because they had like a very ball the ball the ball like something that doesn't exist anymore okay yes exactly yeah but super important because like you know once you got good, you could tell if the ball got dirty. Like, it wasn't.
Starting point is 00:53:47 Yeah, 100%. And, yeah, measuring the ping to the server, like, because. Yeah. That was. Oh, yeah. So good. The other thing that I think, like, it's a testament of, like, the human creativity here is that there was, like,
Starting point is 00:54:01 this thing going, like, the rocket jump, right? Which, technically, right? Which technically, but with the default settings, you couldn't do it because you were actually exploding yourself, right? But we were changing the settings so you could use the rocket jumping
Starting point is 00:54:17 and that completely changed the way that you were playing, right? So actually, it's very interesting to see how people were not just like playing but also how to like innovating on top of like the game to make it like a new game right 100 i think that's actually a really good you know that was really fun to talk about that when we think about the episode and talking with Yanni, you know, who now works as a data engineer at Mattermost, you know, who does really interesting work around super high security team collaboration for the Air Force and for, you know, Bank of America and other huge companies. He's a systems thinker, right? He breaks down systems. I mean, he studied electrical engineering and we got a really interesting view at sort of his art going from electrical engineering, backend software development, ML engineering, and then now data
Starting point is 00:55:20 engineering. And hearing about that story was absolutely fascinating. But it's true. I mean, it sounds funny, but the way that you talked with him about trying to break down the Quake Arena game and like execute that, you know, during class and other things like that, it was a bunch of really smart, creative people like solving a systems problem. Right. And so that's really, really cool to me to hear his story and i think anyone who's interested in sort of transitioning from different disciplines and taking the best of that discipline with you to the next one this is a really great episode
Starting point is 00:55:58 oh yeah 100 like young's like give like i think, a very pragmatic description of how the fundamentals at the end do not change. I think he mentioned also a couple of times of how we go back in circles in a way. And things that we were doing in the past, we do again today and all these things. And that's not actually a bad thing. It's a good thing.
Starting point is 00:56:26 Innovation doesn't mean throwing away completely what was happening in the past and bringing a completely different paradigm. It's much more, let's say, iterative in a way. And there are fundamentals that remain there, no matter what. Some things cannot change. Like, the fundamentals are there.
Starting point is 00:56:46 And so investing time in, like, learning these fundamentals and enjoying working with these fundamentals, I think it's probably, like, the most important thing that, like, someone can do in their career. And it doesn't matter. Like, if you have them, you can go through software engineering, backend engineering, frontend engineering, ML to data engineering, and whatever is next.
Starting point is 00:57:12 So I think it's a great episode for anyone who wants to learn about that. I agree. Well, thank you for joining us. Definitely subscribe if you haven't. Tell a friend. Give us feedback. Head to the website, fill out the form. Send us an email. Actually, send an email to the website. Fill out the form. Send us an email.
Starting point is 00:57:26 Actually, send an email to brooks at datastackshow.com. He'll respond faster than the air costs us. And we will catch you on the next one. We hope you enjoyed this episode of
Starting point is 00:57:36 The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.
Starting point is 00:57:44 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.