Disseminate: The Computer Science Research Podcast - Andra Ionescu | Topio: The Geodata Marketplace | #31

Episode Date: April 25, 2023

Summary: The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s dat...a marketplaces remain mere data catalogs. In this episode, Andra tells us about her vision for marketplaces of the future which require a set of value-added services, such as advanced search and discovery. Also, she tell us about her and her team's effort to engineer and develop an open-source modular data market platform to enable both entrepreneurs and researchers to setup and experiment with data marketplaces. Tune in to learn more about Topio a real-world web platform for trading geospatial data, that is currently in a beta phase.Links: Topio MarketplaceAndra's HomepageAndra's Twitter Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. I'm delighted to say that I'm joined today by Andra Ionescu, who will be talking about her work on the Topio Marketplace. And this is work that will feature at ICWE. And Andra also had a demo on it at EDBT that won the Best Demo Award. So congratulations for that, Andra also had a demo on it at EDBT that won the Best Demo Award. So congratulations for that, Andra. So yeah, so more about Andra.
Starting point is 00:00:46 She's a PhD student at the Technical University of Delft in the Netherlands. And her research interests are data marketplaces and data set augmentation. So Andra, welcome to the show. Thank you. Thank you. I'm happy to be here. Great. So let's jump straight in then.
Starting point is 00:01:03 So I've given you a brief introduction there, but maybe you can tell us a little bit more about yourself and how you became interested in data management research. All right. Hello, everyone. My name is Andra, and I'm a fourth-year PhD student at TU Delft. Well, fourth year means that this is my last year in my PhD journey and I'm close to graduation, hopefully. How I got into data management? Well, I got into data management
Starting point is 00:01:34 since my master's. So my master was in data science also at TU Delft and for my master's thesis I had the opportunity to collaborate with Christos Koutras who was my thesis supervisor back then and now he's actually my colleague. And we worked together on data integration, we created Valentine which is a benchmarking tool for data integration. We did this together with more colleagues, of course. And this was a challenging experience as a master student. And I worked on this big project with tools and topics new to me. So I loved it.
Starting point is 00:02:17 I really liked it. And I expressed my intent to pursue a PhD career, PhD trajectory. And luckily, Asterios Katsifodimos, who is my supervisor now, had the position open. And this is how I got into data management. So it's basically thanks to my master's. Oh, amazing. It's quite a similar sort of experience I had in mind. I had a really nice master's experience that kind of encouraged me to then pursue a PhD.
Starting point is 00:02:49 So that's fantastic. So today we're going to be talking a little bit about data marketplaces, right? So maybe for the new initiated, you can kind of start off and give us some background and explain to us what they are. Well the data marketplace is actually a marketplace but for data for our data sets so data is treated as a commodity and it's traded between providers and consumers so someone has data that they want to to trade, to share with others in exchange for maybe some money. And other people are interested in actually getting more data. And it's not necessarily about, you can take an example, us as researchers, because we need data,
Starting point is 00:03:41 but also companies who are working on a million things, they also need data. So data market platforms trade these data sets and can generate revenue by providing external extra services to help both providers and consumers. Awesome. Yeah, I'm forever reading about what we've had over the past five, ten years. Data is the new oil, right? So they're always saying, right?
Starting point is 00:04:10 So it's only fitting that it has a marketplace to buy and sell the commodity, right? So yeah, that's really cool. So maybe you could tell us a little bit more about what existing platforms are there out there and what are the problems with them, essentially? All right. Well, there are a lot of platforms out there and I can answer this question from different angles because you can look at the landscape and say that there are too many platforms and why not have one and sell everything on that single platform?
Starting point is 00:04:43 But this can become messy and hard to maintain. On the other hand, there are a lot of marketplaces which are specialized for different data types or businesses or fields. For example, geospatial data as we developed the platform. And there's also the data management perspective. And we can look at the struggles regarding traditional data management challenges, such as profiling, integration, metadata curation, enrichment, data search recommendation. And I think some of the platforms that are out there handled and solved one of these problems, multiple problems, to some extent or completely.
Starting point is 00:05:35 We don't know because they are businesses in the end. So in the end, I think it's all about the gain and the benefit. Are there any sort of names that I'll be aware of in the data marketplace? I guess the big players maybe have some platforms that they have, but I'm not really aware of any of them, really. I mean, I've never used them. I've never looked for them. So, I mean, that could be my problem, right? So I'm never going to buy data, so I'll sell it.
Starting point is 00:06:02 So we have AWS Data Exchange, DataRate. These are more industry-focused based. Right, yeah. And for GeoData, there's Carto and here. To some extent, they are data market platforms, and they do other things as well. So I don't think there's one that has only one focus I see I see okay so I guess kind of building on that then she can give us maybe the yellow
Starting point is 00:06:30 bitch elevator picture top you then so how do you go go about addressing some of these problems you've mentioned and maybe as well can you maybe ask folks on why geospatial data specifically well we wanted to to focus on geospatial data specifically? Well we wanted to focus on geospatial data as open source data as well because geodata is well there are a lot of companies and businesses around geodata, but there's not one place for geodata also in the context of Europe. They are more focused on the global expansion and so on. So I wanted to make it more focused, more specific, also because it's open. So you need to go a bit focused and specific. And yeah, I would say that Topio is an instance of an open source market platform. It's designed with openness, reusability in mind. We have a lot of usable libraries,
Starting point is 00:07:35 which you can use not necessarily to build a marketplace, but also independently. And I see Topio as a joint effort to build an open source platform. Awesome. Cool. So when you were going about sort of designing, it's obviously you've surveyed the existing marketplace and the existing kind of the landscape of data marketplaces. But you also, I know in your upcoming ICWE paper, you did a survey, right, where you kind of asked both the people who consume data and those who produce data to kind of motivate the design and make it usable, like you said a moment ago. So can you maybe tell us a little bit more about this study and what you found from it?
Starting point is 00:08:22 Sure. So we conducted this service with the goal to understand the needs, the requirements, the preferences of both providers and consumers from very diverse backgrounds. So we have participants from geography, information technology, marketing, with different roles in the organization and different business fields. And to summarize the findings, we observed a high interest into being part of a data market platform and selling data that way. But there are a lot of challenges. So the consumers have challenges regarding standardization of pricing, of contracts, of payment, fees, commissions, this sort of business things. And the providers, on the other hand, expect easy access to data, transparent terms and
Starting point is 00:09:28 conditions, transparent costs. So there are a lot of bureaucratic and financial aspects which probably they are an impediment with the other platforms who make a business out of it. And aside from these parts, both consumers and providers want the same things, which is perfect. So they want the same data formats. They want to use the same services. They want high data quality and possible templates for licenses, contracts.
Starting point is 00:10:05 So the good part of the surveys is that they turn out to be aligned in terms of requirements, which is perfect, I would say. Yeah, that's a great find, right? Because if you find sort of a disconnect between the producers and the consumers, you then need to do something to sort of kind of realign them, which makes the challenge even harder, right? So it's good that they both want to be on the same page because it's just now giving them the platform to do that. So let's dig into, on that, let's dig into Tokyo a little bit more then. So can you tell us, given what you found from your
Starting point is 00:10:38 study, can you maybe describe the architecture of the marketplace a little bit more and the various components that make up the marketplace. All right. Well, this is going to be like a 30-minute talk. Go for it. Yeah, yeah. Just go for it, yeah. Probably like the ICW presentation. Well, it is a good practice.
Starting point is 00:11:01 Join it. All right. Back to the question. So the platform, as we described it in our paper, has five major components. Ingestion of the data, search and discovery, recommendation, profiling, and delivery. Of course, around these components, we also have other components regarding the workflow of the platform, the legal aspects, like I mentioned before, the contracts, licensing, things that we intentionally left out of the papers, because this is not really our domain.
Starting point is 00:11:48 So we were merely developers for this component. And of course, there's also the UI part. That's a different story because you have to make the platform look nice and also be functional for the users. So I will focus on the backend side. So more specifically, also the research side. So the more specifically also the research side. So first the assets are ingested and stored in Topio. So a data asset is upload, it's versioned, it's curated and then stored.
Starting point is 00:12:19 From there, the asset is directly delivered to consumers in their preferred format. So data assets lifecycle includes publishing, purchasing, delivery. Then we developed value added services, such as the data discovery, the recommender system, profiler, because we wanted to increase the benefits for the consumers. So these benefits are twofold. We want them to better understand the value of an asset based on the metadata that we compute through the profiling service
Starting point is 00:12:54 because we want them to understand what the data set can be used for, what is the value of it, and then make an informed purchase. And the second benefit is also easier access and discovery, personalized recommendations of related and complementary data. We understand how difficult it is to find data, so having an engine that does that for you, based on your searches and behavior in the platform i think that's valuable for the consumers and of course also for the providers because we they can sell their assets in the end and make them discoverable and not buried down somewhere in the corner of the the platform so to say awesome cool so yeah there's two things that kind of jump out that i want to
Starting point is 00:13:43 dig into in a little bit more detail. How the heck you go about Working out the value of a piece of data like and how do you go about pricing assets in? Pricey keys indeed a hot and debated topic and now especially in data management as well We investigated the possibility of making our own pricing strategy and deriving pricing from selling subsets of data sets, for example, or views. But this became very challenging. I mean, I think you can do a whole PhD on this topic. So TopView prices datasets in two ways in the end. So first is paper dataset, which means that this is the simplest form of pricing.
Starting point is 00:14:32 And which means that we don't really do the pricing as a platform, but we let the provider offer datasets to the consumers for a fixed price and provide discounts or price per bundle or they do their pricing and the second way is pay per api call on a value added service so when consumers read data from a value added service API, providers can set the price per API call. These calls are logged and then you charge on a per call basis. Similar to cloud platforms, right? You pay for as much as you use. But this is it as far as pricing is concerned from Topio
Starting point is 00:15:22 side. We wanted to let the providers do their own estimation and assessment. Sure, cool. Obviously, I guess there's a whole, you hinted there you could do a whole PhD on the pricing of data. Is there any other sorts of potential strategies you explored initially that then you decided to go, you know what, it's going to simplify proceedings and just make it a lot easier to build this marketplace if we kind of shift that sort of pricing either to the provider to set the price or go for like a pair API sort of model., actively explore other methods because for that we already needed to deploy the platform basically and have the users who will act normally.
Starting point is 00:16:14 Yeah, it will not work if we just pretend to know what we are doing because the the providers um from our surveys at least are people who know how to sell data so they sold data before through other means not necessarily using a data market platform but they know how this business is going cool so yeah let's let's let's then dive into a little bit more maybe about the discovery, the value-added service there. So you mentioned a few different value-added services that you've incorporated in. So let's talk about discovery first. Can you maybe tell us how you go about achieving that essentially in the marketplace? Well, we start, so imagine you just deploy to the platform.
Starting point is 00:17:01 You have some data sets, either open data sets that are free to purchase, by the way, but we just make it discoverable, or providers who already uploaded some data sets. So we first create our own data representation structure, so to say, but we use a graph where we map all the datasets. And then from there, we have two strategies
Starting point is 00:17:31 of making datasets discoverable. So first, we have the joinable, unionable case, where we just look at the directly connected nodes to one given node. In the case of Marketplace marketplace that will be the data set that you're currently inspecting, viewing, so you already found something and you want to to find more to augment it maybe. And there's another component which is linking data sets. So let's say you have some assets which are favorite
Starting point is 00:18:08 and you mark them in your favorite list and then you are browsing other assets. So we can look now on how to connect the assets that you're currently viewing to the assets that you marked as favorites. And we are doing this by looking at transitive paths, so traversing the graph basically from a source to a target. And we implemented this in a Jupyter Notebook environment, which is also provided by Topio.
Starting point is 00:18:41 I guess I want to touch a little bit more on the you mentioned something earlier on about combining data sets and like doing views over them is that something that you currently that functionality that exists or did you decide not to do that or is it basically is the the boundary of each asset essentially like you're not doing any sort of pre-processing in topio to kind of combine two probably similar data sets together into one not doing any sort of pre-processing in Topia to kind of combine two similar data sets together into one bundle? Is it sort of very, everything's very siloed? Is that like, the producer puts it on
Starting point is 00:19:12 and that's what gets sold, essentially? No, we didn't do anything in this direction. I think this would be the data discovery limitations, in a way. Okay. Because we did not explore the fully capabilities of working with geodata. Okay.
Starting point is 00:19:33 And I think this part can, yeah, this can look very nice in the context of geodata, but it's not yet explored. Okay. So it's another PhD explored okay i see so it's a it's another phd's worth of work no i think that you can do this in a one quarter no one quarter of your phd so that would be one year yeah yeah cool awesome so you said that a second ago as well about kind of when when a consumer is inspecting a data set and they can obviously then follow through to see other data sets and the discovery sort of engine within Topia will recommend to them,
Starting point is 00:20:12 will show them things that are similar. But when I'm looking at inspecting a single data set, what sort of stats does a consumer get presented with? Like what's the interface there? Like what are the things that I would potentially be interested i would potentially be interested in and like how do you characterize data assets i guess is my question all right we have a lot okay about that so first is the data the metadata that the providers um input when they upload the data set things particular to to the data set such as the format,
Starting point is 00:20:47 language, and so on. Things that a human can input easily without any problem. Then we have the automated metadata which comes from our profiling service. That's something that we developed. There's also a paper about it from Athena Research Center. They develop Big Data Buoyant, it's called. And it's a profiler specifically for geodata, geospatial data.
Starting point is 00:21:18 And we provide a lot of statistics around the column, distributions, also small graphs for each column. Then there's the map section where you can actually visualize the data set and see which area it belongs to. You have different samples. I think there's four samples for each data set so you can actually scroll through it and see the the values um and i think that that's about it yeah okay nice so you get you as a consumer and you get presented with a whole host a whole a wealth of statistics there to make your decision whether you want to you want to buy buy said data set that's really cool i'm i guess um guess you mentioned a second ago, again, about the implementation of Topio. You mentioned that this is expressed in terms of a Jupyter notebook,
Starting point is 00:22:14 and that's kind of how a consumer would interact with the marketplace. Can you maybe tell us a little bit more about the back end? Like you said, the assets come in. Where are the assets stored in the sense of is there a database in the background there that's storing the information or how is everything sort of hosted i guess it's all in a three buckets and then you build on top of that or yeah how does it look like well um unfortunately i don't have any details about that because um um my co-authors and the partners from Greece, from Athena, they handled all these technicalities. And I know that all the services are deployed in Kubernetes.
Starting point is 00:22:51 That I can tell you. And it was quite complicated at some point. Or at least in my opinion. But there are, I mean, I know for recommender system and for discovery service, they use the graph database. One of them is Neo4j. They also use for the metadata Postgres, but I don't really know about other technical details.
Starting point is 00:23:19 Okay, cool. Yeah, indexing the metadata and searching for the assets that's Elasticsearch based. It's kind of a collection of existing solutions that are kind of together as one, I've been deployed as one sort of whole marketplace together, like a wrapper around it, kind of coordinating all these different services and products and data. So that's the thing, because we have multiple components, which are open source.
Starting point is 00:23:53 You have the GitHub repo and each component, each library basically describes what they are using. So in the end, it was just a matter of putting all the services together so that's why you use kubernetes because it's easily plug and play kind of a microservice architecture i see i see yeah awesome that's really cool i'm yes i guess whoever kind of stitched a lot together using kubernetes had some fun trying to get all that to work yeah fun i would not describe that this fun um cool let's talk a little bit more about the um the usability of this is obviously kind of a big
Starting point is 00:24:35 design principle in designing this marketplace and i know you in your icWE paper you have an initial sort of study on the usability and the performance of the marketplace. So can you share your initial findings on that, please? Of course. So we used the beta version to assess the data lifecycle, basically, in the platform. So we measured the time that the users spend on publishing and purchasing because ultimately this is the most important thing for the data market platforms can you publish your assets and do you have any problems publishing your assets and then once you found something that you want to buy can you actually buy the assets do you have any difficulties in buying assets so we evaluated novice and expert suppliers um the positive
Starting point is 00:25:35 outcome was that most suppliers actually added more data metadata so this means that they understand the need for sharing metadata and to make it as explicit as possible because in the end this makes it very easy for for the user to discover and purchase their assets another interesting observation and this is regarding the pricing so when the suppliers uploaded the data they also have the option to create services. And they spent a lot of time on this process because they didn't know how to price the services. So they didn't know how to price their data that was easy peasy, lemon squeezy, but when it was about services, which is basically a new market activity, they needed more time. So more consideration was needed to allocate the right price for the service.
Starting point is 00:26:34 And one expected finding was that the consumers actually didn't have any problem buying the asset. And we are very happy with this because in the end, we wanted to go for a e-shop experience. And people nowadays, I think they are used with finding an item, put it in a cart, buy it, done. That's about it.
Starting point is 00:26:59 Yeah, click, click, click, done, right? Yeah, it's great. It's deadly that the amount of clothes I buy online because of that, they make it too easy to buy things right but anyway that's really cool so i mean the findings are obviously very very positive and kind of motivate you continuing this this this line of research and but can you give us some sort of idea in terms of the sort of volume of data that sort of flows through the marketplace. Do you have any sort of numbers on that?
Starting point is 00:27:25 And maybe how much money, I guess, transfers through Topio at any point? I don't have this data point. I know that for the time when I had a demo, there were hundreds of datasets, some open source, some private. But in terms of transactions, I don't have these numbers okay no no problems i guess though i guess there must be like quite a bit of activity in there for you to sort of get feedback from like producers and consumers right so i guess that's a good indicator that um it's it's useful right if people are using it so um that's really cool um yeah so that obviously
Starting point is 00:28:03 paints um uh top here in a very in a very good positive light and shows that it's kind of going in the right direction but are there any sort of things that are probably suboptimal with Topio at the moment or kind of what are the general limitations of the marketplace at the moment today? We definitely have limitations so one thing that I can think about, it's also the discovery service, because I see that it's performing a bit slow. So when we will have even more data or consumers, it will become quite slow.
Starting point is 00:28:40 So we need more optimizations in that part. Definitely more research on data versioning, provenance, watermarking, even segmentations of the assets, right? To create different smaller views, but based on the geo coordinates. One thing that we got feedback actually at EDB team, and it was a very interesting point that I haven't thought about it before. To you as an user, you have a data set.
Starting point is 00:29:14 And scenario will be that you can use the marketplace to find related assets to the one that you have. So that would mean that you can upload your assets without actually becoming a consumer. Because for now, we have this workflow that if you want to do anything in the platform, so sell data, you have to become a provider, use the notebooks, and so on, you have to become a consumer. It's not difficult to do this, but there's an extra step, right, that we didn't take into account in the beginning. So I would say this is a limitation. This is a nice segue into my next question is, how do we go about addressing these limitations?
Starting point is 00:29:58 And where do we go next in general with Topio? It seems like it's a very big project, right? There's a lot of different people involved. So I guess what's the big picture sort of view? And then we can maybe go into your specific next steps. Well, for now that we made the beta version, there are still some services in the alpha phase that are not yet deployed in the platform. For example, the recommender system is still not deployed because we were waiting for more users, more data.
Starting point is 00:30:33 With the recommender system, you need activity in the platform in order to actually build something that is working in your advantage. But for now, we are taking a break because it's been a very intense effort. And yeah, as I said, I'm towards my last year. So I have other things to worry about as well. Yeah, that's from my side. I don't know what my partners have planned
Starting point is 00:31:06 because, as I said, we are in a bit of a holiday mood now. Shutting down for summer, right? Yeah. Awesome. So there's no sort of plan initially or the immediate future to integrate chat GPT into the marketplace, right? Well, this is so new that we couldn't the marketplace, right? Well, this is so
Starting point is 00:31:26 new that we, I mean, we couldn't foresee it, right? Oh yeah, I would enjoy it. That's cool. Awesome. Maybe chat GPT will become a data market platform. Maybe so, right? Hey chat, can you find a data set for me to
Starting point is 00:31:43 buy? You can see it happening, right? As a sort of software developer, as a kind of data engineer, how can I go about leveraging the things in your research and top it up? I guess the answer to that is I should just go and use it, right? And have a play around with it. But yeah, I guess bigger question, what impact do you think it can have? So aside from actually using the platform um i think the impact that we have is that we open source the libraries so you can
Starting point is 00:32:14 have a look at them you can use the services that we developed um i can think for example the profiler you can use that for other purposes or just have a look at how we implemented the entire library. Maybe there's some thing there that is useful for your work, your research. I can think of also data discovery, augmentation. There are a lot of components there that can
Starting point is 00:32:43 be used for other scenarios. So I think the goal was to show that it's possible to have small, well, relatively small components that they can combine them and have this instance of an open source platform. Amazing. Just a random thought that's popped into my mind um whilst we were talking there was that so you know the data sets they get so the producers of the data here maybe this information you might not be aware but is it sort of individual level like so is it me selling my personal data
Starting point is 00:33:16 or is it more sort of business level sort of people say hey we've got this customer data set that we want to sell like what's the granularity on the data set there? I think there's no limits. Okay. But thinking about geo data especially, that can be business level because we're thinking now about personal data or customer data, but with geo data, you actually have data about your surroundings. And other companies use this data to actually do the research on where to place the next store,
Starting point is 00:33:56 for example, is the infrastructure good, and so on. So looking a bit further from our own data, our customers' data, there is a lot of data out there that can be just used for other purposes. And it's not personal. It's just about our surroundings. Yeah, I was just thinking that maybe I could sell my Google Maps history. And then if I was feeling a bit, needed some I don't know some beer money or something I could say I'll sell all of my tweet data and the locations where I sent these tweets from so I don't know yeah I don't think anyone would pay much for it maybe I don't know I'm not sure if you're allowed to do that by the way but I don't
Starting point is 00:34:41 know have you checked the terms and conditions actually the point and the GDPR will protect me and I'm allowed to have the right to my own data I don't know actually have you checked the terms and conditions? Actually, good point. The Cholo GDPR will protect me and I'm allowed to have the right to my own data. I don't know, actually, that's a very interesting question. Well, you can have your data, but I'm not sure if you are allowed to sell it, so to profit from it. Interesting. That's interesting. Anyway, yes.
Starting point is 00:35:01 Right. Where were we? Yes, so my next question is, across the time sort of working on this project, what's maybe the most interesting lesson that you've learned while working? I think, so because it's such a big project, I think the number one lesson will be plan ahead
Starting point is 00:35:23 and double the time that you think it it takes because not once we we rushed to ship something because we underestimated or because coding you know i mean i think it works but it doesn't and then i have a bug and i'm stuck one week in a bug or something like this. And the usual say, it works and I don't know why. I don't know why it works. It doesn't work and I don't know why. So in software development, there's a lot of uncertainty. I know you cannot plan for it, but at least take it into account. And besides the time management, there's also a lot of people management. Right.
Starting point is 00:36:08 Because you have to communicate your requirements, your needs, your expectations, so that others can help you, right? Because in the end, it's about being a team. Yeah, yeah, it's funny. The estimation for software, like how long it'll take to deliver this thing is it's basically like a guessing game right because i can say i can say yeah i'll have
Starting point is 00:36:29 this done in in a week or two and guarantee you it'll be at least a month well basically the rule of thumb whatever i say double it and that's the minimum right and that's even that's even if with a good backwind yeah probably will be longer but anyway yeah that's that's a really interesting point you raised there yeah so my next question is uh from sort of the initial conception of the idea obviously i don't know what point in the project's life cycle you joined in at but whether sort of maybe up until this the icwe paper were there like things that you tried along the way that failed that maybe the listener would find interesting? Oh, definitely, definitely. So for the data discovery service, for example,
Starting point is 00:37:13 I think we changed the data representation model three times until we were confident that this is the right way. And that took a lot of work. And I also think that at some point we just circled around but yeah i mean that's research right uh you have to try all the options until you're sure that it's the right one it's very annoying very time consuming i guess there's a lot of um wasted implementation effort there um changing the data representation i guess like you did it once and then it was like you want to do it again really and then over and over again
Starting point is 00:37:51 yeah i'm not sure if it's wasted but it's definitely not something that you want yes yeah i agree yeah it's not wasted right because it was part of the journey to get into the end goal yeah work so it was it was a good thing that you tried it and it didn't work but yeah i guess there is at times that can be i guess frustrating right yes yes so um right so yes it is obviously most of your research has been within the topio project is there any other research that the list that you've done over the course of your PhD that the listener would find interesting? Yes, so I mentioned data discovery multiple times
Starting point is 00:38:33 because this is actually my main focus. So my research is in DBML, I would say, so databases for machine learning. I'm working on data augmentation, on feature discovery to improve machine learning performance. I'm currently working on my next submission. I don't want to give too many details, just that my research is in data augmentation. Stay tuned.
Starting point is 00:39:01 Maybe we'll do another podcast about it. You'll have to do another one when when it comes out for sure yeah great stuff so yeah kind of getting on from that then how do you go about sort of now this is one of my favorite questions i love asking this question and seeing what um answers i get to it from different people because it's always different it's all right how do you go about generating ideas and then selecting what to work on? So what's your creative process? Oh, OK.
Starting point is 00:39:30 Now I have to go back in the beginning. So I started from something that I knew and then I did a lot of reading. And I come with tens of ideas from conferences, but lately, because I'm the pandemic generation, so I didn't attend anything in my first years. But because I have too many ideas, I'm lucky to have my supervisors who kind of tone me down and make sure that I'm on the right path.
Starting point is 00:40:00 Because every time I want to just go and start working on a different project which is shinier and nicer and so on. So I think I'm very lucky to have in total three supervisors, including my promoter, and they help tremendously with feedback. I think I also circle around an idea a lot until I start the development, so the coding, for example. But it's still progress.
Starting point is 00:40:30 I mean, it's still movement. And I also think it's very important to like what you're doing. Because you can jump on an idea and start working on it, and everybody loves it, but you don't like it and then you're gonna have a horrible time right i think what's very important is to to like what you're doing and to have the right team to support you technically on a technical level sorry and also on an emotional level also a good mentor helps so besides your supervision team a mentor is is very good because he she can look at the the problem from a different perspective since he's not
Starting point is 00:41:13 actually involved in the problem yeah so how do you go about sort of sourcing out a mentor then or was it sort of something that the university that Delve has in place for to match you with a mentor or was it something you seeked out yourself? I know that the university has some programs to match you with a mentor and especially in the beginning when you start but yeah because of the pandemic I think we I was not matched with with the mentor, that I'm sure. So I found my mentor. I found my mentor. Yeah. I mean, work events, conferences or other events in the country,
Starting point is 00:41:58 the local events. And, yeah, we just got friends and that was about it. Yeah, I think that's a fantastic answer to that question. I mean, having a mentor is great, right? And if you find someone that can, if you find a good mentor, it's like invaluable, right? I think I totally agree with that. That's a great answer to that question.
Starting point is 00:42:19 We've just got two more questions now. The first is, what do you think is the biggest challenge in data management research now? I'm smiling and laughing at the same time. The biggest challenge, I would say, is reproducibility. People are reluctant to share resources, data, especially code. And I think this hinders the progress. I mean, you know, it's, why should I reinvent the wheel when the wheel is there?
Starting point is 00:42:54 Well, because I don't have access to the wheel, so I have to reinvent it. So then there's no way to actually advance it because we'll just keep on doing the same thing. You know, there are efforts on improving this and other communities are doing much better from this perspective. Maybe they have other issues as well but reproducibility I think it's or sharing the resources because people have different definitions of reproducibility. But I think this is one of the biggest challenges in data management.
Starting point is 00:43:31 Yeah, for sure. And that is something we should as a community strive more for, right? Totally agree with that, for sure. And also about data, because many times people just want, well, people, and I say for like a mean reviewers wants to see want to see experiments on real data but nobody's sharing anything how can we actually work on real data yeah we need then partnerships right with industry and then you can share anything because the the businesses don't want to share anything so it's like a vicious circle so yeah we definitely need
Starting point is 00:44:12 to be more open more reproducibility completely completely agree with you on that one andrew and cool yeah so last last word now it's the last question and it's what's the one thing you want the listeners to take away from this podcast episode today? Well, definitely have a look at Topio. It's beta.topio.market. I think I forgot already. We'll put a link to it in the show notes. Don't worry about it.
Starting point is 00:44:39 Yeah, we'll put it on all the socials so the listener can go and figure it, find it and play around with it. But I think on a general level, let's say, I think it's very important to be aware that it's a joint group effort to create something big and impactful. So for the PhD students, no matter how lonely your PhD trajectory is, just find your team, attend whatever, everything, anything to find your team. It will make a very big difference. And with this opportunity, I like to thank my students, my research engineer, collaborators, industry partners, mentors, supervisors. You see, it actually takes a village.
Starting point is 00:45:29 That's one takeaway. It takes a village to graduate. That is a great message to end it on. That is awesome. Well, great. Yeah, so let's wrap it up there. Thanks so much, Andrew, for coming on the show. It's been great to talk to you.
Starting point is 00:45:43 And if the listeners are interested to learn more about Andrew's work, we'll put links to everything at the top here in the show notes. And if you enjoy listening to the show, please consider supporting the podcast through buying me a coffee. It really helps cover all of our hosting costs, et cetera. So, yeah, please do that if you enjoy the show. And we'll see you all next time for some more awesome computer science research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.