The Changelog: Software Development, Open Source - Productionising real-world ML data pipelines (Interview)

Starting point is 00:00:00 Bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at ChangeLog because of Rollbar. Check them out at Rollbar.com. And we're hosted on Linode cloud servers. Head to Linode.com slash ChangeLog. This episode is brought to you by DigitalOcean. DigitalOcean's developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99 uptime SLA, and 24-7, 365 world-class support to back that up. DigitalOcean makes it easy to deploy, scale, store, secure, and monitor your cloud environments.

Starting point is 00:00:45 Head to do.co.changelog to get started with a $100 credit. Again, do.co.changelog. What's up? Welcome back, everyone. This is the Changelog, a podcast featuring the hackers, the leaders, and the innovators in the world of software development. I'm Adam Stachowiak, Editor-in-Chief here at Changelog. On today's show, we're talking with Yatunde Data from Quantum Black for a deep dive on Kedro. Kedro is a workflow tool that helps structure reproducible, scalable, deployable, robust, and versioned data pipelines. We cover what Kedro is all about and how it's changing the landscape of data pipelines in Python, the ins and the outs of open sourcing Kedro, and how they found early success by sweating the details.

Starting point is 00:01:29 We also talked with Yutunde about her passion project, a virtual reality film which debuted at the Sundance Film Festival in January. So Yutunde, we're here to talk about Kedro, which is a Python library dealing with data pipelines. First of all, thanks so a Python library dealing with data pipelines. First of all, thanks so much for joining us on the show. Thank you so much for having me. And we should give a thanks to Waylon Walker, who requested this episode just recently. Actually, sometimes our requests take months, maybe even years for us to put the show together.

Starting point is 00:01:58 This one was relatively recently. I should mention to our listeners that we do take requests at changelog.com slash request. If you have a topic or a guest or anything you'd like to hear on the show, just head over there and let it be known. We're happy to make shows that y'all want to listen to. So we have Yutunda here, who is the product manager at Quantum Black and is on this Kedro project. Tell us about Quantum Black and what y'all are up to. Sure. So Quantum Black is an advanced analytics company that was acquired by McKinsey a few years ago.

Starting point is 00:02:34 So we're in the consulting gig. And what we basically think of ourselves of is we're the black ops teams that go out to different companies around the world and across many different industries. And what we do is effectively deliver functional machine learning code in production. So kind of like our success model looks at not only do we solve very difficult use cases that have very challenging design problems, but we also have clients that are able to maintain their own code bases when we go. So that's what we do.

Starting point is 00:03:06 And Kedro is your first open source product that I read that's the case. Tell me about the experience. Why is it open source and all that stuff? It's actually very interesting that you've asked that. Kedro, yes, is the first open source product that we've ever had coming out of McKinsey and Quantum Black. It was quite an experience open sourcing it because it's very difficult to get corporate open source rolling, especially in a place where it hasn't happened before.

Starting point is 00:03:33 We open sourced Kedro purely for client need. So what would happen effectively was that our data science teams would go out and use Kedro on their engagements wherever they were going for their consulting work. And they would encounter, obviously work with the client's data scientists and data engineers and find that everyone enjoyed, like really enjoys like using Kedro. They suddenly have questions around what do we do to keep on using the tool after we leave? And the answer to us became, let's actually open source this tool so that our clients have access to it. And we'll be able to access upgrades, we'll be able to access an open source community so that they can help further engage with them as they use the tool. So that became the primary lever for us open sourcing. But we're very excited about open

Starting point is 00:04:22 source in Quantum Black. We just recently open sourced another tool called CausalNix. CausalNix is a causality data science library, which helps somewhat tackle that problem between causality and correlation when you're busy assessing different data sets. We think of it as kind of like a way to really give back to the community and really make a stake in some of the thought leadership pieces that we maintain at Quantum Black. We have a very exciting data science R&D function that is quite active with trying to solve issues around explainability, around fairness, around live model performance tracking. And really being able to share snippets of that knowledge, I think, is very exciting for us. That is very exciting. So I should mention that Waylon, when he requested the show,

Starting point is 00:05:08 he did say that Kedro is changing the landscape of data pipelines in Python. I'm curious if you agree with that. And then secondly, if you think, I mean, it seems like to me, if it wasn't open source, that could not be the case. Maybe it could be anyways. But I'm wondering your thoughts

Starting point is 00:05:20 on the open sourcing of Kedro and how that has helped it to perhaps change this landscape, as he says? I'm so excited that Waylon finds that Kedro is changing the landscape for data science. So maybe I should actually describe what Kedro does. Please do. So we think of Kedro as your way to apply software engineering best practice to your data science and data engineering code. So it follows this whole principle that if we get everyone operating at a higher standard for software engineering

Starting point is 00:05:49 best practice, we make code that is reproducible, it's deployable, it's versioned, and it makes it easier to put that code into production when it's time to go live on models. It kind of fits into this whole problem space where for the longest time, data scientists have not necessarily, no, they've used code to solve the problem as opposed to seeing code as the end goal, which is what is required for that code to be functional in the machine learning practice

Starting point is 00:06:19 and to have value for business. So Kedra essentially says, you have your standard way of working, we're just going to make some slight modifications to it so that you have more robust code. And in the end, you get like this production ready code. So I'm really excited that he says that we're changing the basis of like, how data science should be done. Because we believe in like, getting to the heart of users and really just having empathy for your workflow and being able to enforce like software engineering best practice is easy using Kedro. Other exciting things that are happening because teams are using Kedro from what we've seen from like the open source community

Starting point is 00:06:54 is that people are I think following the trend of creating reusable analytics code as well because there's a certain workflow consistency that happens when you have well structured code. And they're now able to build their own reusable analytics libraries on top of Kedro, which has been quite exciting to see. So I was digging through Kedro docs as I was trying to figure out exactly what is this thing that I'm looking at, which is a fun endeavor for many open source projects. It's like, okay, I see what's written on the label. What exactly is going on here? First of all, the documentation is really good, which is indicative of the Python community. I'm not sure about the data science community, but y'all are killing it there. And the Hello World is a very nice example of knowing exactly what Kedro starts you off with.

Starting point is 00:07:39 And as I was reading it, it was more akin to me as a software developer as a conventional framework. A thing that establishes norms, gives you buckets, very much kind of a convention over configuration concept of, here's where this goes, here's where you need to put your pipelines here, put your data here, et cetera, et cetera. And if we all kind of follow these norms, life is better, life is smoother, it's more production ready. Is that kind of the idea? That's actually completely the idea that it's based on. You'll note that, I don't know if you're familiar with a tool called Cookie Cutter.

Starting point is 00:08:14 Cookie Cutter follows this whole methodology. It's another open source project that if we work in a standard way, so if we have the same setup in terms of our directories and where we place certain files, it guarantees workflow consistency that makes it easier for you to work with yourself in future if you have to look back at your code

Starting point is 00:08:33 and also easier for other people to work with you because in a sense, your code becomes self-documenting because you know where things are stored. There's a derivative of cookie cutter called cookie cutter data science and Kedron is derivative of cookie cutter called cookie cutter data science. And Kedron is actually built on top of cookie cutter data science,

Starting point is 00:08:48 but takes into account how teams also need to construct data and ML pipelines, as well as also thinking about things like how do we do data abstraction when trying to load data? Because there's many different ways of loading data using different Python libraries. And also things like how do we do versioning of these data sets or models themselves so that we can actually reproduce runs. Because the complexity of like reproducing code, well, if you were building a software application and reproducing a previous version of the code is simply, it might just be having a look at the logs to see what happened when a failure happened and then having a look at the logs to see what happened when a failure happened, and then having a look at the code that created that, and any other inputs that you have. In machine learning,

Starting point is 00:09:29 that becomes a bit more complex because of the different variables that you would have had that would have made that machine learning pipeline run, for instance. And Kendra kind of tries to, I think, I did say it's basically software engineering best practice. It does try to implement those good sense principles in the data science world. So that's awesome and definitely something that is needed in the data science world. As has been reported to me, I'm not living there, so just going based on what you're saying and what people on Practical AI have also said our other show about, machine learning. Now, I'm coming from the other direction, so I have a software background and less of a machine learning background. And I'm not going to speak for our audience, but I'm sure there's plenty of them out there who are in similar place as me. So when I look at

Starting point is 00:10:12 Kedro, I see the conventions, I see the best practices in place. And I'm like, yes, this makes total sense. But then I get to the other bits that I'm less familiar with. And I start to wonder like, what are these things and how do they fit together for me to implement something in production and make it useful? So when we start talking about data pipelines, nodes, etc, etc, can you explain some of the jargon, perhaps, and some of the things involved in starting a Kedro project and maybe taking one into a production application? Okay, so getting started with Kedro means after you've pip installed the library, simply running a CLI command for Kedro new will have you creating your project template. It's actually where you find the Hello World example is actually based.

Starting point is 00:10:59 So you can get started in like less than 30 seconds with your first Kedro project. If you execute another command after you've created your project template, Kedro run, you would have made your first Kedro pipeline run. So when I break down the components that are needed to actually understand Kedro, there are five, you'll understand that we have the project template, which is basically our series of files and folders that are generated by the cookie cutter data science

Starting point is 00:11:25 template. And it's kind of like best practice for where do I store bits and parts of my code. Although there are quite a few directories that will focus on even where you put your Jupyter notebooks to where you store your results. The two most important folders in that directory are where you put your configuration, which basically looks at, if you wanted, in simply explained terms, how do we remove hard-coded variables from our machine learning code? And there are two types you typically find,

Starting point is 00:11:54 file paths and parameters. So how do we remove that from our machine learning code so that it's more reproducible? So you'll put your configuration in a specific place. We also talk about our data catalog, which is one of the library components of Kedro, which is basically a series of extendable data connectors. If you extend our abstract dataset class, you can actually customize and create your own datasets for things that we don't typically support. But we support most file formats, so CSV all the way to Hadoop files and Spark tables as well.

Starting point is 00:12:28 Then we talk about the source folder being the next most important directory for you. And in the source folder, you'll actually find the concepts of nodes and the pipeline. Your nodes are just pure Python or PySpark functions that accept an input and have an output. You use these nodes to actually construct a pipeline because it's basically a series of nodes which have their inputs and outputs mapped to each other. So the pipeline can actually essentially work out where your dependencies are when you're building the pipeline. And that is the basis essentially of a Kedra pipeline. And that is the basis essentially of a Kedro pipeline. And those are the primary concepts. They're obviously additional features built on top of all of these features, including the data catalog, allowing you to version your

Starting point is 00:13:18 machine learning models and your data sets. And there's all sorts of ways that you can extend Kedro however you feel. Okay, I think I'm getting it. So you have your data catalog, which is your input data, whether it be CSVs or JSON or SQLite, right? The data catalog manages both loading and the saving of your data. So if you specify where the output should be saved, then Kedro will handle that too. Oh, okay. So it's on both ends. Then in the middle, then Kedrel will handle that too. Oh, okay. So it's on both ends. Then in the middle, then,

Starting point is 00:13:50 you have your pipeline, which is stitching together these nodes. Is that correct? Like you're saying, the nodes are pure functions, so maybe it trains a model, maybe it does something else. I don't know what predicts something.

Starting point is 00:14:02 And then you have your pipeline, which is basically saying call function a then b then c then d or whatever is there more to a pipeline than just like here's the order of operations i know that's basically it there are all sorts of things that you can do with the pipeline so there are additional features that you can do with it but you've basically nailed down why kitra is so easy to learn okay because it's basically specifying which function do i need what is its input what is this output and then let me put them all in this like pipeline format the pipeline syntax is very easy very cool so then the last bit is like the output then you say this goes back to

Starting point is 00:14:36 the data catalog but what is the results of these pipelines producing is it on a case-by-case basis based on what you're trying to gather from your model? Or is it always the same kind of thing that comes out on the other end? What are our results look like? It depends on what you're using Kedro for. So an output could be a machine learning model. So even just a pickle file, what you would then use to then test future predictions with. Or it could be, I don't know, maybe a table that you need to be loaded into some sort of like database. Okay.

Starting point is 00:15:11 So it depends on like what you actually need the output to be because that's what you would set up your pipeline to do. I think I'm getting it now. So when we go back to the Hello World, because it's a simple one for us all to wrap our head around, it's the Iris dataset, which is a well-known data set of the different pictures of flowers, right?

Starting point is 00:15:29 These three different types of iris plants. And the goal is to classify, right? So you give it, do you give it an image and it says what kind of an iris it is? Or do you give it a, maybe just give her some measurements of the petals? Talk us through the Hello World. We can use that as a reference point for conversation cool um so the way that this pipeline is set up it has four nodes so one which will actually take

Starting point is 00:15:55 in an iris data set so the actual values and it will split the data into train and test samples and also do some sort of like data pre-processing as well to clean it up so it's in a format that can be used. Then the next three nodes will train a model, then create the prediction model for you. And then how this pipeline ends, it ends slightly differently. So this is why it always links back to what problem were you trying to solve?

Starting point is 00:16:21 Because it allows you to report accuracy. So you eventually in the last node can feed in a value and then report accuracy on which based on which values which flower which iris flower are we looking at i see so it's different than like you said it's all based on what you're trying to accomplish and in this case that's what it's trying to output and so there you have it the accuracy is important yes and that's why it's trying to output. And so there you have it. The accuracy is important. Yes. And that's why it will actually just tell you what the accuracy is at the end of the pipeline.

Starting point is 00:17:13 So if you did a, like I mentioned, Kedra new and then Kedra run, once you've changed into the project directory that's created for you, then it'll actually just tell you what the accuracy of this model is. How often do you think about internal tooling? I'm talking about the back office apps, the tool the customer service team uses to access your databases, the S3 uploader you built last year for the marketing team, that quick Firebase admin panel that lets you monitor key KPIs, and maybe even the tool that your data science team had together so they can provide custom ad spend insights.

Starting point is 00:17:35 Literally every line of business relies upon internal tooling, but if I'm being honest, I don't know many engineers out there who enjoy building internal tools, let alone getting them excited about maintaining or even supporting them. And this is where Retool comes in. Companies like DoorDash, Brex, Plaid, and even Amazon, they use Retool to build internal tooling super fast. Retool gives you a point, click, drag and drop interface that makes it super simple to build these types of interfaces in hours, not days. Retool connects to any database or API.

Starting point is 00:18:08 For example, to pull data from Postgres, just write a SQL query and drag and drop a table onto the canvas. And if you want to search across those fields, add a search input bar and update your query, save it, share it. It's too easy. Learn more and try it free at retool.com slash changelog. Again, retool.com slash changelog. So your first open source project at Quantum Black and so far a success, perhaps changing the landscape of data pipelines. I like that. I like that line. And with any open source project, there's always big wins and there's often big fails, struggles along the way. You mentioned the reasoning behind it was your clients needed some sort of a sustainability plan for these tools that you were working on for them or for their models to continue on after a contract was over.

Starting point is 00:19:11 What have been some of the struggles or the things you had to consider as you took Kedro open source? Any insights you can share with the community? One of the most surprising ones was actually our name. So we were known by something else internally. And when we were like, well, we want to go open source, we're going to go public, and everyone was on board, we kind of heard from legal that, hey, we actually need to check out your name to check that it's not infringing on anyone's trademarks. So I think this is a bit unusual for open source projects, where it's just like a personal project. You would never think that, you know, I need to check my name for trademark infringement and it needs to be kind of like

Starting point is 00:19:49 unique and still Kidder is a bit of an abstract name for what the tool does but we stuck by it and it works but I've actually come to discover that a lot of corporate open source projects including some friends at Uber have spoken about doing the same thing. So I was like, well, at least it's not unique to us that we had to undertake this. Another challenge that we had was really thinking about how do we support our users when they're using Kedro? Because our initial request for a public Slack or a public Gitter was initially paused and they told us we need to do some risk assessment on those platforms to work out, for instance, how do we enforce a code of conduct for users' behavior on the platforms as well so that we have free and fair communications on them. So we knew that beyond GitHub issues and Stack Overflow, both of those channels that we do watch, we needed to do something else.

Starting point is 00:20:45 So you actually see that we spent a lot of time on our documentation. So I'm really glad that you're enjoying the documentation because we knew that when we have users coming to Kedro, they want to get started very quickly. And if they run into issues, they need to be able to go to the docs and be able to troubleshoot their way through at the very least if they're not going to talk to us on GitHub issues, which is sometimes a big thing for people to post questions

Starting point is 00:21:08 or even stack overflow. But you'll be excited to know that we will be getting public communication channels soon. So that's in the works for us to eventually release to the users. And then the third thing was that we didn't expect the community to pick up this quickly. That was something that we were not prepared for on the team. So Kendra was maintained by nine people. So it's me as product manager, we've got

Starting point is 00:21:30 Ivan Danov as the Kedro technical lead, and then an amazing group of like software engineers, machine learning engineers, visual designers, and a front-end engineer who maintains the data pipeline visualization tool called Kedra Viz. And we were not ready for, I think, how many GitHub issues were created, pull requests were created because people wanted to contribute code back to Kedra and make it better with us, and how many questions we get on Stack Overflow. So that was a bit overwhelming. And I think we're still finding different strategies to manage it. The one that we do do on the team, because we're inevitably the ones that know the most about Kedro

Starting point is 00:22:11 and how to fix different issues, is that we have a rotating role on the team called the wizard. And it's basically we have a wizard and then the rest of the group are the warriors. And if you're the wizard for the week, your job is to basically field all user queries, both in our internal channels as well as our external channels as well,

Starting point is 00:22:31 to try and make sure that users get like quick and speedy responses to their questions or to any issues or feature requests that they've raised. So that was one of the things that made, one of the strategies that have made things a bit better. But we are looking at ways to even scale support for our open source users in the future. So we're looking into new roles that McKinsey and Quantum Black

Starting point is 00:22:53 have never hired before, because we've never open sourced anything. But developer advocates or DevRel, I think it's sometimes called developer relations, to come on to the team and help us really scale out what that model will look like for us. So yeah, those are some of the unexpected, I think three of the unexpected things that we didn't realize when we were open sourcing would happen. Awesome. Very good. Let's go back to number one. Let's go back to the name. What does it mean, Kedro? It means the center in Greek. So we kind of look at it, Kedro as being the center of your data pipeline

Starting point is 00:23:25 because of the way that it forms kind of like infrastructure-stash foundation of your analytics project or ETL pipeline or whatever you need to build. So Kedro means the center. The center. And so after you came up with the name, that's when the, was it a trademark search went out?

Starting point is 00:23:44 I'm sure there was other searches as well did you search on github and twitter and domains were these also things that you took into account before picking the name yes it was actually a lot more involved naming and quantum black has been quite exciting because ketra is one of three main products and we'll be expanding like the range of products that we have so naming is um it's always an issue in labs where we sit because we sit within quantum black labs which does all the product engineering and yeah the naming exercise was a stakeholder management exercise because we had to have happy branding because i was helping with the product marketing and our head of a global head of marketing Catherine Shenton

Starting point is 00:24:25 but we also had to make sure that the team is also happy to represent the name as well so beyond having I think agreed consistency on a few names they also had to go through the social media check they also had to go through reference meaning checks or even checks in other languages as well because Quantum Black is a global organization and then they had to go through reference meaning checks or even checks in other languages as well, because Quantum Black is a global organization. And then they had to go through the legal check to check wherever there were trademarks for this name in the many jurisdictions that McKinsey operates in before we came to Kedro. So Kedro was marked as the least risky name. What were some of the riskier names? Can you share them or are they just on the cutting room floor, never to be mentioned again?

Starting point is 00:25:04 Some of the more interesting names were, I think when there was no agreement on names, consisted of names like Burano, Pumlico, which is a plumbing company in London. And the list of five names that included Kedro in the end that went for final legal verification included Braze, which kind of worked because it referred to welding or stitching together pieces, which is what the pipeline does. We had Knittic as well, which is kind of with a K, which was trying to also playing with that whole thing of knitting things together. Knitting. Yeah, Knittic. You see, it doesn't work when you try and say it. So that's one of the reasons why

Starting point is 00:25:40 this name failed. And then Spindle, because we're trying to reference a whole thing of many threads joining together. I then Spindle, because we're trying to reference a whole thing of many threads joining together. I like Spindle. Yeah. But all of those names failed the legal verification of that figure in the end. There's no worse feeling than when you come up with the perfect name, and then you go do all your due diligence,

Starting point is 00:26:00 and somebody else is using it. And you're like, no! Yeah. The checks also included like github repositories as well because we knew we'd have to survive um and that's it so it was really just it was an adventure to find the name and i'm really glad that we settled on kedra it was the one that fit some good things about the name kedra so i think i either wrote this up at one point or i need to write up the anatomy of a good name and i'll say two syllables, great. The fact that you can spell it

Starting point is 00:26:25 easily after hearing the word, great. It's not ambiguous in that way. But it is intriguing because when you hear it, you don't know what it means. There's no immediate attachment, at least in my mind, maybe to native Greek speakers that there are, but to an American mind, there's no immediate attachment to anything. So it kind of stands on its own. That's true. That's true. We never thought of it, and thank you. I will give those reasons to the team.

Starting point is 00:26:52 You get the seal of approval. Let's move on to the point, too, that you mentioned, kind of the community side and the documentation and how to figure out kind of where the community meets, the code of conduct, all of these different things. I'm curious because you seem to have checked all the boxes. We look at a lot of open source projects. I look at Kedro and thought, okay, Apache license, they got a license, they got a code

Starting point is 00:27:12 of conduct, they have a documentation. As I said, I've been impressed with the documentation. So it seems like there was at least knowledge of we need to do this correctly. And then also, here is our effort at correctly. So I'm curious curious was there research gone into how to open source a project or had people on the team done it before how did you guys know like what boxes to check and which things you really needed to address before you could open source it so the motivation for great documentation actually came from our technical lead for a while

Starting point is 00:27:43 he'd been passionate about the idea of creating an end-to-end tutorial. So you'll see Chapter 3 in the documentation. It takes you through this amazing space flights tutorial. It takes about two hours to run through everything and gets you acquainted from beginner all the way to intermediate functionality in Kedro. I think it was Ivan's passion for the users

Starting point is 00:28:04 and them being able to learn and understand the tool because Kedro. I think it was Ivan's passion for the users and them being able to like learn and understand the tool because Kedro documentation before that had literally just been kind of like the API docs and the user guide where we describe kind of like how the individual parts of the library like the pipeline, the nodes and the data catalog work. So really kudos to Ivan thinking and pioneering that we need to do better in terms of like how we explain these things and also using this as a solution for the fact that we couldn't host a Slack channel

Starting point is 00:28:32 when we open sourced. But in terms of like, how do we set up the code of conduct and how do we think about like, kind of like this practice with how we set up like a GitHub repository. I used to spend weekends going through kind of like people talking about how to do open source community management and what does best

Starting point is 00:28:50 practice look like for running a community. So it was important for me that code of conduct went in so that we'd have a way to enforce if something were to go wrong on the repository, luckily nothing has. And we'd have a way to communicate with the users by referring them back to the code of conduct one and then to also being able to take action to resolve things so yeah it was kind of like just like how do we have an empathetic view to how someone will perceive a product and how do they get on the best side of that and trying to put yourself in the users or the first viewers shoes. On this note, additionally, I do know that open source does have diversity issues as well. So I'm really excited to, you

Starting point is 00:29:31 know, eventually see hopefully PRs from like women or women identifying people on Kedro. So yeah, at all times, like we do also have a style guide for how we communicate with our users as well. That was also something that we set up. So how do we say, you know, thank you for contributions? How do we respond to questions as well then? And that is a team standard, because we never want to create an environment where someone is offended and doesn't want to come back and interact with us on our different channels because of maybe a comment or whatever the case might be. So those are things that we did look at. Very cool. So I will say to your third point, which was you were a bit surprised by the level of success or people glomming on and using the project, I would say probably has to do

Starting point is 00:30:17 with the stuff that we just talked about and the thoughtfulness and the TLC that you put in around open sourcing and doing things well. That being said, I'm curious if you had some sort of a launch plan or like, were there ways that you said, okay, Kedro is out there. We want people to use this. We want people to try it. Were there press releases? Was there blog posts? Or was this just a thing that grew organically after open sourcing it? That is a fantastic question. And there was a huge release.

Starting point is 00:30:47 I mean, it was McKinsey's first open source product. So everyone was very excited. The other thing that we realized in hindsight is that I'm really passionate about product marketing because you can kind of solve it as a, it's a user problem. Because how do I make sure that you're getting the right information so that I don't waste your time?

Starting point is 00:31:04 And you also learn like what value something might have for you because it was, you consumed it in a short time and in a form that you needed it in. So I kind of approach product marketing on this, like I approach like product management and it's like, how do we optimize for the time that you have? So I was able to construct a massive marketing campaign with the Kedra team and with our head of marketing and with McKinsey Branding in order to actually deliver what was our massive open source release in June of last year. So press releases went out. Articles were released on like towards data science. There was social media.

Starting point is 00:31:41 There were some webinars done. And yeah, we just wilded out a little bit on being able to, I think, basically make history for the firm and release their first open source tool in a place where McKinsey is recognized as having this like amazing intellectual property that would never open source. So it was really just exciting for us. So that's why we went big on that launch. Do you think the success of Kedro will lead to McKinsey open sourcing more things, or is this more of a one-off because of the client need? Good question. And it's actually reflected, I can actually show you an example where that's not actually the case, because it must have been now, we're going on to two weeks ago,

Starting point is 00:32:21 we released our second open source project, Causal Next, which is essentially a causality Python library that helps data scientists address the question of causality versus correlation in your data sets. And we released this one purely because it was an exercise of how do we do, how do we showcase some of the R&D work that we have done on client projects

Starting point is 00:32:45 and eventually have made their way into more formal products because we've been able to try and test those methods. The really cool thing about causal next is that the research that it's built on was actually presented at NeurIPS, not at the last NeurIPS, but the NeurIPS before. And we had quite a few data scientists that were intrigued about this whole no-tiers approach for assessing causality and now this year we're finally releasing the open source library that well the library that we're able to build using that that theory so yeah we working out still working out what does our open source strategy look like for the firm

Starting point is 00:33:21 there's still some many interesting questions about like, how do we tackle it? You know, what do we decide to open source? How, what is our process like finally nailed out? Because I still think that there are places that we could optimize the process a bit better. But it's just an exciting place that we find ourselves in. We hope that it inspires many more people within the firm to consider open sourcing. I think I do get emails with people asking like, how did you open source Kedro in hopes that they can do the same? So maybe it was just the first of many. I really do hope. For those curious about causal next, I have scooped it up and we'll include a link in our show notes. So you can click there and check it out. Hey, what's up? Adam Stachowiak here. I got a question for you. Have you heard of our newest podcast yet called Brain Science? I'm not going to be offended if you haven't. It's okay. But

Starting point is 00:34:15 here's how you find out more about the show. Go to changelog.com slash brain science. We have 10 episodes on the podcast feed. So have fun, go back and listen to all 10 and subscribe to get more. I actually get to co-host this podcast with a doctor. That's what makes this podcast legitimate. If it was just me, it would not be as cool, but I get to team up with Muriel Reese. She is a practicing clinical psychologist, and it's so much fun digging into deep subjects around being human. Here's a preview of episode number 10. We're talking about shame. We haven't talked about this as it's relevant to creativity. And if you can see that when we are

Starting point is 00:34:55 trying to navigate shame, this sense of inadequacy, do you think you're going to be more apt to be creative or less apt to be creative? I would guess less apt because I'm trying to focus on fight, flight, or freeze in those moments, and I've got no time to be creative. I got to be the most necessary Adam possible to get through, right, rather than be creative. Yeah. So let me tell you the dynamic. Adam, I need you to be remarkably creative so you can come up with the best, most user-friendly way for this to work. Except you suck, you didn't do it enough, and you need to do better. Don't ever tell me that again, Mariel. That is not nice. But I can understand how in that kind of moment. So if you're leading teams out there, don't lead with shame. Okay. Well, it's really recognizing the way that you have to write.

Starting point is 00:35:50 If you can see shift your mind into seeing this through a way of management, like I need to manage how I interface with other people, especially around creative endeavors. Yeah. That I need to be deliberate about identifying what they're doing well and even saying, like, create clarity, like what you want them to approve upon. All right. To keep listening, head to changelog.com slash brain science slash 10. And that will take you to the episode titled Shame on You, where we examine the hustle of not enough, how shame relates to imposter syndrome, our fight, flight, or freeze lizard brain response to threats, and so much more.

Starting point is 00:36:32 Again, changelog.com slash brainscience slash 10, or search for Brain Science in your favorite podcast app and subscribe. We'd love to have you as a listener. Switching gears a bit, let's talk about fun stuff. You are involved in something that looks very cool. The development of what you call a social impact virtual reality film as a Sundance New Frontier Lab fellow. Tell me about that. Sounds intriguing. Sounds like you've been digging around on the Internet, which is cool. Maybe just a little bit.

Starting point is 00:37:24 I was a Sundance New Frontier Lab fellow along with a co-creator, one of my best friends, Sharifa Ali. We've been best friends since like high school, grade 10 in 2018. And the reason we were Sundance New Frontier Lab fellows was because we've been working on, at that point it had been for two years, one year, two years, two years, we've been working on a film, a virtual reality film called Otomu. So Otomu is based on this Kenyan myth that if you walk around a sacred tree, seven times you change gender. So go from male to female. And it's kind of like this interesting comment on gender fluidity ideals, which are kind of seen as un-African. So obviously it now tackles this whole thing of why is modern day Kenya so adverse to gender fluidity ideals and homosexuality as well.

Starting point is 00:38:08 So there are a lot of issues that we cannot dig up. Obviously you look at like religion and colonialism as factors that would impact that. Moving on from there, we were excited. It's also two weeks ago to actually head to Sundance 2020 and actually premiere a Tomu. So we've been iterating on the piece since then, and we were able to actually showcase the piece at Sundance. So yeah, that was a really, really cool experience,

Starting point is 00:38:36 and there's still more work to do, because that was essentially, Otomu is in its fourth iteration right now, and there still will be a fifth, which really focuses on how do we distribute the film to different partners in the US, Europe, and also back home for us. Sharifa is from Kenya, from South Africa.

Starting point is 00:38:58 So how do we do that in an accessible way? So this was Sundance Film Festival 2020 just happened, man, a couple of weeks ago. So a fresh thing that just happened. Curious how you even film a virtual reality film. And when you have an iterating, are you you're filming more things? Or tell me about the process? Sure, the way that Atomo is set up, it's built in Unity, essentially, what you do is you get it's also a dance piece so we had access to two dancers who would essentially do motion capture to actually capture how they're moving you can build an avatar from that and place it in any environment in our case we placed you in the

Starting point is 00:39:37 kokura forest where mugamu trees which is the sacred tree are typically placed and that's essentially how the environment is built. There was also some thought around how do we do different user experiences as well? Because we did optimize the piece for the Oculus Quest. We specifically waited for the Oculus Quest to come out because we knew it was a higher resolution virtual reality headset. So higher than the Go, for instance,

Starting point is 00:40:02 but still more accessible in terms of price than the Rift was, the Oculus Rift. So we specifically designed it for that. And we also were fascinated about the multiplayer experience which we've had some moderate success with. I think the next iteration will fully nail down what the multi-user experience is supposed to look like. But that's essentially how you do it.

Starting point is 00:40:22 So you can decide. I think there's many different ways of filming virtual reality experiences. I think you have seen the 360 video kind of like documentary pieces, which are still released. But a lot of pieces will focus on using motion capture to build avatars and then eventually constructing environments around the different avatars that you will have. So in terms of presenting that at a film festival, is it just a room with a bunch of these Oculus quests and you go in and have, because it seems like an individual experience versus a shared experience. That's a good question.

Starting point is 00:40:55 So how our piece was constructed is basically that. So when we were presenting at the Sundance New Frontier Space within this year's festival, we were in the biodigital theater, which was a specific area set aside for what they called multi-user virtual reality experiences. In Atomu, we're able to get seven users in at the same time. So that means you each put on your headset, each have your own set of controllers, and you can each affect the piece based on how you're interacting with it. In the piece, you are ancestors who are basically saying that this way of life is right,

Starting point is 00:41:33 and you're following a character called Waikiki as they make not a gender-based transformation around the tree, but what we call the journey to be the most honest version of themselves. So you help this character along the journey as they kind of like dance and struggle around the tree in their journey. So it was important to us to have the multi-user experience. It was a design point for us because the artistic intention for it is everyone understands what it's like to not be your true self in everywhere.

Starting point is 00:42:06 It's a human experience of sometimes being unsure of yourself because, I don't know, someone's made a comment or doubting yourself or even hating parts of yourself. And you being able to support someone on that journey means we would love for you to be able to know that you can support other people in their lives on their pursuit to be the most honest versions of themselves. And you also recognize that you're not alone in that experience as well,

Starting point is 00:42:31 because there are other people going through the same thing. So that was why we opted for the multi-user experience. We still had some technical challenges with it. The system that we're using currently is quite expensive, which goes against one of our design principles of making this piece accessible. But we're looking at different ways to actually try and code it,

Starting point is 00:42:53 kind of like hack the Oculus Quest, if they don't help us in the end, to do the multi-user experience without having additional gear. Because that is really, really important for us in terms of distribution. Wow, so how was it received? Very well received.

Starting point is 00:43:07 Sharif and I also made certain choices to whether or not we went explicit on the way that we described some things versus like very abstract. And we kind of like strayed towards the still kind of like abstract focus, but people got the intent. People understood why they were in the piece, why they were the ancestors, and why it was a multi-user experience. They also understood how the piece was constructed

Starting point is 00:43:31 in terms of like Wakiki's journey and understood everything that happened there. So we have received like really fantastic feedback. We have obviously opened it up and told people that, hey, we're still iterating on this piece, so any critical feedback that you have, do let us know, we're still iterating on this piece. So any critical feedback that you have, do let us know so that we can build it into the piece. So yeah, it's been very well received. And we're very happy. This has been a long, long time project. But yeah,

Starting point is 00:43:58 I'm really excited for, I guess, 2020 and finally completing Otomu and seeing the impact that it was supposed to create. Yeah, I was just going to ask if it seems like this is like a living project or if it's just one that's still being formulated and will eventually come to its natural conclusion. It sounds like there will be a completion step, like when you brush off your hands and say we're finished. I don't know. Virtual reality experiences, at least the model that currently exists, means that, I guess because the tech is moving, but it's not really moving that fast, means that they live longer lives than I think mainstream films or mainstream media would. really old oculus rift um can still watch movies from today for instance so in terms of like overall end of tomo i'm not sure that there is because we are considering different models of distributing

Starting point is 00:44:53 the piece in museums and even schools as well as working with like non-profits as well because the intent of the piece obviously is the journey to be the most honest version of yourself but we can bring someone to understand that it's actually about how do we deal with LGBTQ issues and accepting people that have different sexuality preferences so really being able to use it with non-profits is also important for us too so we hope that the piece lives on because the intent and the meaning behind why it was made still remains relevant. My ideal world, I think, not that I know a tomu could achieve this. It would be wild if it could. It would be that the piece actually wouldn't need to exist because everyone was just comfortable in their skin and was accepted.

Starting point is 00:45:41 But we have to do our part. That's why we hope the piece lives on. Very cool, Yutunde. How do people get in touch with you out there on the internet? was accepted but we have to do our part that's why we hope the peace lives on very cool you today how do people get in touch with you out there on the internet where can people reach you on the internet i'm quite active on twitter um so you will find me tweeting away i'm at yetu data so y-e-t-u-d-a-d-a i'm on instagrams as well but i don't really use that one anymore and you can yeah if you if you find me on LinkedIn, I don't, I know you only then will accept you.

Starting point is 00:46:08 So Twitter is probably your best bet. Twitter is probably your best bet. Yeah, that's the easiest place to reach me. Otherwise, you'll find me on the Kedro GitHub repository or on Stack Overflow, especially when I'm the wizard answering questions. So yeah, those would be the two channels to find me. Very good. Well, thank you so much for joining us today. This has been

Starting point is 00:46:29 a very interesting dive into Kedro and also into Otomu and the work you're doing at the Sundance Film Festival and beyond. Good luck to you on both endeavors. Any final words from you before we say goodbye? Quantum Black Labs specifically, so the product engineering part of Quantum Black,

Starting point is 00:46:47 we're hiring. We need help finding amazing people that can do software engineering. So from full stack engineering all the way to even just being like Python focused devs, we're looking for you. We're also looking for product designers as well. If you spike on UX, you will be loved. If you spike on the visual side, you're looking for you. We're also looking for product designers as well. If you spike on UX, you will

Starting point is 00:47:06 be loved. If you spike on the visual side, you will also be loved. And product managers as well. So we're really rolling out the team. And as mentioned, specifically on Kedra, we're looking for developer advocate and dev relations people. And if you can do Python as well as being one of those people, then we love

Starting point is 00:47:22 you more. So yeah, if you're interested, I think the best place is to head through to actually the McKinsey website, which would host our job offers. But otherwise you can just like ping me on Twitter and send me your CV and I put it straight through to the recruiters. It makes your life so much easier.

Starting point is 00:47:40 There you go. Hit her up on Twitter and get that ball rolling. Well, thanks to Tunde. This has been awesome. And to everybody else, we'll talk to you next week. All right. Thank you for tuning into the Change Log. If you're not subscribed yet to our weekly email,

Starting point is 00:47:55 you are missing out on what's moving and shaking in the world of software and why it's important. And, of course, it's 100% free. Fight your FOMO at changelog.com slash weekly. When we need music, we summon the Beat Freak Breakmaster Cylinder. Our sponsors are awesome. Support them. They support us.

Starting point is 00:48:12 We got Fastly on bandwidth, Linode on hosting, and Rollbar on air tracking. Thanks again for tuning in. We'll see you next week. Thank you.

The Changelog: Software Development, Open Source - Productionising real-world ML data pipelines (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.