Disseminate: The Computer Science Research Podcast - Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #36

Episode Date: July 17, 2023

Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existi...ng table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paperBuy Me A Coffee! Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. A reminder that if you enjoy the show, please do consider supporting us through Buy Me A Coffee. It really helps us keep making the podcast. It's with my great pleasure that I'm joined today by Madelon Hulsebooths, who will be telling us everything we need to know about Git tables, a large-scale corpus of relational tables. So Madelon is a PhD student at the Intelligent Data Engineering or the Indy Lab, should I say, at the University
Starting point is 00:00:52 of Amsterdam. Madelon, welcome to the show. Hi Jack, it's a pleasure to be here. Thanks for the invitation. Fantastic. So let's jump straight in then. So can you maybe tell us a little bit more about yourself and how you became interested in data management or data engineering research? Yeah, absolutely. Yeah, so my name is Madelon Hulsebos indeed. And I'm from the Netherlands, actually, so I'm Dutch. And I started actually with a bachelor in like policy analysis very different fields but also exciting and doing lots of simulations on data before i transitioned into computer science and really fell in love with that field at first during my masters at two delves i well it was really when the hype around data science and machine learning really got started. I think this was back in 2016.
Starting point is 00:01:45 And I decided to really focus on that. So after I graduated, I became a data scientist. Well, actually, not quite. I first did some research at the MIT Media Lab for half a year or something, which was really a great opportunity where I actually developed SHERLOCK, which is a machine learning model
Starting point is 00:02:04 for semantic type detection in tables. And that's actually where my interest in this field started. But I thought, OK, I want to really make tools that are used in practice. So I thought, OK, what is a good opportunity? I really like data science, so let's see if I can really do some kind of research job, but then in industry as a data scientist. So I became a data scientist, and then I realized that actually most of my time was spent on building data validation pipelines, data preparation pipelines, and so on. And in the meantime, I really saw my
Starting point is 00:02:47 work on semantic type detection actually get a lot of impact in practice as well. And people were very interested in his work. So that pulled me back into research. And that's where I started to focus more on data management research. And I think actually there's so much potential in the intersection of AI and data management. And I think we see the signs of that actually, well, now since a year or something with the whole generative AI hype, of course. But there is so much potential when you apply this kind of technology to tables and databases in general. So I'm really excited about it and to continue my research career. Fantastic. That's a great backstory there. You also see, I don't know, the shift over the last five years.
Starting point is 00:03:39 And even in that conference proceedings, it's sort of like as ML and AI start to make its way into data management and all the possible opportunities that are there for this sort of this sort of intersection of the two fields which is which is which is great and cool so let's let's talk about git tables and so give us the elevator pitch for it what is it why do we need it yeah so um that ties into of course the story that i that i just. So I think it is really important to unlock the value of the data that resides in databases through machine learning. And one thing that you need to train and use machine learning models is, of course, data. And that is actually what motivated the development of GitTables. So GitTables is a corpus of tables extracted from GitHub, in particular, CSV files from GitHub,
Starting point is 00:04:32 because you can find basically anything on there. Yeah, so we, of course, now have only a subset of tables extracted because there is a long, long, you know, pipeline to go through. But GitHub really, I think it stores now 90 million CSV files, which is huge. So our objective is, of course, to get them all out and make this a fruitful resource for machine learning in data management applications but also data analysis for example so huge potential that we probably get into later on yeah for sure and i kind of had a few questions kind of fall out of that for me there it's only csv files is it isn't but we could maybe touch
Starting point is 00:05:19 on this later on maybe but what about other types of files that are stored in github is it is that something that you're kind of looking at bringing in as well for now i think so my main focus is now on csv files but also because there are so many stored github of course i check also for example um like real spreadsheets excel files and so on but the scale is really smaller um and i want to get like a corpus that is as big as possible to really make these machine learning models powerful so that's why now we focus on chv files but you can find anything and i think that's really you know the potential that we show um as a starting point but it really depends on on the interest of applications there sure sure so yeah kind of on that then so how does git tables compare with
Starting point is 00:06:14 what's already out there or maybe differ with what's already out there yeah so our problem is that we as i as i mentioned before we we were working on this model, Sherlock, for semantic type detection on tables, basically mapping colon to a real-world concept. And that actually motivated, well, many people in use cases, so for example, people from Microsoft that wanted to integrate this model into their tools. And one thing we noticed from the feedback is that people were clearly having different data.
Starting point is 00:06:49 So what we see in databases is very different from the data sets that were around and that we use to train Sherlock. This data was actually extracted. So these tables that we trained these models on were actually extracted from the web. So basically web pages and then tables presented on there. But you can imagine that these tables are much smaller and are very, very different from the kinds of tables that we find in databases.
Starting point is 00:07:18 So a few aspects that make these data sets very different is one, they are much smaller. So tables on the web are much, much smaller, but two also the content is very different. So tables in CSV files or in other applications, they're typically very messy and they contain way more numeric data. So I think those are a few of the like selling points of Git tables, let's say, in context of other table corpora.
Starting point is 00:07:53 And I think the semantics of what these tables really contain, so the meaning of this data is also very different. And that's also what we show in the paper, for example. Yeah, so an example of that is that the most common attribute in tables is the ID type, as we call it. While in web tables, this is really not one of the most common types around. So I think that clearly demonstrates the difference and complementary value of git tables in well in relation to other data sets okay cool then so yeah so obviously when you were going through this process of collecting all of this all of these um these tables out of out of github how did you approach them what were your design table design principles you were looking for when you were going about designing git tables what was your guiding sort of philosophy with it yeah we had some some very you know very clear criteria
Starting point is 00:08:50 that we had in mind from the observations that we had on other corpora so one was we needed many tables to fuel machine learning models so we needed scale and second we needed scale. And second, we needed relevant semantics. So the type of data that you find in databases. And we needed also, so we needed kind of coverage. And then we also needed the semantics as in enrich these tables with metadata that we can use to actually learn machine learning in a supervised way. So we wanted to have kind of annotations on columns to, for example, enable type detection models. So just on the first principle there, the scale, but you obviously want a lot of this
Starting point is 00:09:37 so it's useful to machine learning models. What is that tipping point? When does it become useful? How much data do you need before these these things actually i guess yeah become useful yeah that's a that's a good question i haven't really run a like an analysis of that but what i went with for semantic type detection for example is that i wanted to have at least a thousand columns per type but it really depends on the application but but also on, for example, now we have all kinds of like pre-trained models, right? So they might get far with only, you know, a small data set of tables. And I think therefore with these million tables that we now have with Git tables, we might actually facilitate fine tuning of pre-trained models as well,
Starting point is 00:10:26 which have been trained on way more tables, perhaps from the web. So I think that's a good opportunity as well. But yeah, of course, we are really keen to get most of these tables out of there. But it will be a hard task because GitHub can be really restrictive on the API and the load that it allows. That's a nice little segue there into the next question. So how did you actually go about creating this and walk us through the construction pipeline? Yeah, so actually it's pretty basic. So we just extract.
Starting point is 00:11:00 So our first goal is to extract as many CSV files as possible. And because GitHub has all kinds of rate limit restrictions, we had to segment our queries. And we do so by adding an attribute on the file size. So we only extract files for a certain keyword. We always need to have a keyword, of course, to search GitHub for CSV files. And then depending on whether a keyword appears in a CSV file, you get the results. But as we also show in the paper, if you look for CSV files with the term ID,
Starting point is 00:11:43 you get 60 million CSV files already. And of course, you cannot just extract them all in one go. So I think GitHub only allows you to go through 1000 items per time per query. So we then segmented our queries based on the file size. So we first extracted, you know, between 50 and 100 kilobytes, for example. So that was the first step. And then, well, when we have all those CSV files, which takes basically most of the time of the entire construction pipeline, from those CSV files, we then have to parse them to tables. And that sounds pretty straightforward, but these CSV files are so messy and they follow so not the standards of CSV files that you have many comments, for example,
Starting point is 00:12:34 on the first few lines, which is not as we intended, right? So we implemented some heuristics to filter such cases out, but this will still be a challenge, an open challenge. And actually, GitTables is used now to also build better CSV parsers. So I'm really excited about that. But yeah, we then parse these CSV files to tables with a basic parser from Pandas. And then we also curate these tables
Starting point is 00:13:06 based on whether they have PII data, so personal identifiable information. So for example, if we know that the table contains personal data, then we fake some of these values. So for example, we fake the names or the addresses in a given table. And we also filter out tables that do not come from GitHub repositories with a license. So when we first released Git tables, we were actually, well, I think we had a slightly bad timing with the release of Copilot. And Copilot was trained on all code in GitHub, also code without permissive licenses.
Starting point is 00:13:58 But there was a lot of ethical concern around that, which is rightful, right? I think that's really good that we take these considerations. But we had some ethics review, actually, that's kind of restricted or well kind of informed us to also filter out tables that didn't come from repositories with a proper license. So that reduced the size of the corpus from 1.6 or 7 million to one million. And yeah, so that's that's one limb or well, one rule that we applied. And then we had our collection of, well, final tables, let's say. And from there, we also annotated these tables, as I suggested, because we are interested in having column types. And we employed very basic type annotation methods, basically checking the column name, the similarity of the column name
Starting point is 00:14:53 with the types in our ontology. So within our interest, and if there's like a syntactic match, then we annotated the column name with the type from schema.org or DBP. Yeah. And we also had like a, an embedding based approach where we embedded the column name and the types, and then just calculated the cosine similarity.
Starting point is 00:15:16 And based on that, we informed whether there are so much. So yeah, I think that's, that concluded the construction pipeline. Nice. So how long did it, how long did it take end-to-end to run this? If I just, I don't know, run the full thing today, like how long are we talking? Yeah, so it's actually months. Yeah, it's very hard to get them all out.
Starting point is 00:15:34 And this is, so you can only run a few number of queries per hour. Right, okay. So based on our segmentation, because our objective is to get as many CSV files out there as possible. So we just have a very high number of queries. I'm not sure what the number of queries is in total, but we have so many queries,
Starting point is 00:16:02 it just takes months to get them out. Wow. It's very topical at the minute, given Elon Musk's recent activity on Twitter, but with rate limits, right? So it's at the forefront of everyone's mind at the minute. In the meantime, I think Microsoft actually bought GitHub in the meantime, further restricting its rate limits. So I think, yeah yeah it will take some time before we get to 10 million tables for example but this is clearly our objective so it's still running away today it's still just churning away in the background it's currently paused but i need to i need to resume the extraction pipeline but there was some issue that we are now under, like we're
Starting point is 00:16:45 getting mainly the smaller CSV files out there, out on GitHub. So I need to redo the segmentation a little bit and then we can continue because we also want to have like a larger tables. And although the average number of columns and rows is already way higher than the average number of columns and rows in web tables, for example. So the web-based table corpora is already higher, but still there are many more files out in GitHub that are much larger than we now have.
Starting point is 00:17:23 Yeah, on that, how, so what's the frequency of which new tables enter GitHub as well? Like, I guess obviously that's growing over time as well. Are you kind of keeping up with that? Or is it like, I don't know how fast new data has been kind of deposited in GitHub? Yeah, that's a good question.
Starting point is 00:17:42 So I think so many things on GitHub change every day. It's very hard to keep track of that. And what we also need to do is figure out how we can get rid of duplication, for example. does is that it doesn't return forks which is good but you still you still might have some like copies you know across different repositories so that's something that we need to figure out but yeah that's for for later uh work i guess and now people just have to deduplicate themselves when they use git tables so this this whole pipeline from sort of the pars and the annotations all it's all, none of it's manual, right? You never have to go in and say like, okay, it's all automatic. There's no, how would that go with like sort of working out,
Starting point is 00:18:33 okay, these first two lines are just text and to get rid of those. Like that must've been quite like an iterative process to sort of finally finish on something that I can just let it run and figure out all these edge cases because the state space is huge there. Yeah, so we, I went through this iteratively, as you said,
Starting point is 00:18:51 just checking out what the errors is, what kind of files couldn't be parsed, and then adjust the parsing configuration based on that so that we could still maximize the number of csv files that we could parse um but yeah eventually this is a fully automated pipeline so it runs and to ends automatically so that's great yeah it was definitely an iterative process to to come to a full pipeline that we were happy with yeah Yeah, yeah, I can imagine. Because, I mean, people do some crazy stuff, right?
Starting point is 00:19:27 There's loads of mad stuff out there. I've seen that. Yeah, I mean, this has been quite a while ago, actually, that we first published this data set. But I've seen very interesting things, indeed. I just don't know how some people make CSV files or produce them. It's very hard. But yeah, so as I said, I'm just really glad that some people now also use this data set because we also publish the raw CSV files and they actually use it to build better CSV parsers.
Starting point is 00:20:00 And I think that's really nice because we need them desperately. And I was actually surprised that I couldn't find a parser that, you know, just could figure out the structure of these CSV files automatically. It's amazing to have that feedback, Luke. I mean, it's like the most rewarding thing when you're doing research, right, is actually people go and use it. It's the best thing about it, right? It makes it all worthwhile. Absolutely. Just another quick question on the annotation method you use in that that kind of step of the pipeline how computationally
Starting point is 00:20:30 intensive is that i think so we really used basic methods that were very fast so we used i think fast text which is an embedding model that's well very efficient um so i don't think that takes up much time to be honest the main the main time consuming thing is really extracting csv files from github through the api yeah yeah is is it a way you can kind of i don't know is it a paper if you have like a payment scheme where you can pay more and you can get a better rate or is it all just basically this is this is it this is all you're getting and this is the rate or is there i think actually that's a good point i think enterprise users might have a more convenient rate limit to be honest but i mean we're at the university so yeah i did it from my personal
Starting point is 00:21:21 accounts for example with my personal token. So no enterprise budgets there. Yeah, no, you need somebody on the inside in GitHub so they can open up the tap sphere so you can get it all out post there. I actually asked them, but they said, well, if we want data from GitHub, we also need to go through the API. Oh, really?
Starting point is 00:21:41 Oh, man. That sounds quite efficient. Yeah, that's surprising. Cool. But I man that's that sounds very efficient but yeah yeah that's that's surprising um cool but i think that's good yeah yeah yeah yeah cool right so yeah so you perform some analysis in in and that you talk about in your paper of like what you found in in kind of the v1 version of git table this this 1 million tables so kind of yeah what were your findings the findings well first i was surprised with the diversity of tables that i found indeed as you said there might be like school like csv files
Starting point is 00:22:13 for school projects but i found also many like database snapshots on i don't know nba players NBA players or also much more biological data and so on, medical data. So I was surprised by the diversity of the semantic coverage there. of these tables is that despite we expected more numeric data, it is actually, I think 58% is numeric. And that's something interesting, I think, for future work to create subsets of data based on, for example, the distribution of like atomic data types, like numeric or string data, but also semantic distribution, so that we have domain-specific data sets, for example.
Starting point is 00:23:13 But that was something that I found very interesting. Still, the number of numeric columns is larger than we find on tables on the web but um yeah this is this was an interesting finding um as i mentioned i think in the introduction um the top type that we found in tables was the id type which i think makes a lot of sense but that was an interesting finding nice nice so yeah you kind of you've also sort of taken this and then to demonstrate the utility and say like how how it's better than kind of stuff you can get off like web tables or whatever you've you've used it in three applications so can you can you tell us about these these applications and what they were and kind of what the additional value that git tables um delivered
Starting point is 00:23:59 was yeah so we of course built git tables to address a need on semantic column type detection because people needed to retrain their classifiers and because the data wasn't representative and the types weren't relevant. So what we did is use GitTables for semantic type detection. And as you can see in the paper as well, you can use GitTables very well to train a classifier for a given number of semantic types. And we compare it with VisNet. VisNet is a collection basically of all existing corpora. So tables from the web, tables from open data portals and whatnot. And what I found most interesting about this comparison is that we also trained a semantic type detection model on VisNet and then evaluated this on Git tables.
Starting point is 00:24:56 And there you see that the performance really drops from 0.77 to 0.66. And I think this illustrated to me that indeed all these existing corpora that we find out there don't really generalize to tables that we cannot easily find on the web. So there is a clear data distribution gap
Starting point is 00:25:23 between these existing corpora and git tables so i think this was for me the the most interesting takeaway from this experiment although we of course also show that you can use git tables to you know train a classifier from you know training it on git tables and evaluating them on git tables as well but i think this gap was very interesting to me yeah for sure that's fascinating i mean it it just goes to show you that kind of there is there was like some degree of sampling bias in the other in the web tables right and this sort of like finds out the distribution is less it's not bias right it's not like data but i guess it maybe is biases i'm not sure i'm not sure on the correct technology i know there's sampling bias for sure but um but
Starting point is 00:26:04 yeah anyway it's been a long time since I did statistics and and machine learning and all those sorts of things so yeah um yeah I can imagine yeah it's I mean I found also very different results across different sets but I think what always remained constant is this gap in generalizability from models trained on VisNet to Git tables. So I think that really illustrates the complementary value of Git tables. And I think that's pretty cool. And that actually ties into the second application that we considered, which is actually benchmarking. So I think Git tables, you can extract many subsets based on the application need that you have.
Starting point is 00:26:46 So you might find very large tables in there. You might find smaller tables. You might find different like atomic data type distributions and just filter based on that. So that's what I'm involved in another project where we do that. Or you might filter down on domains. So I think that's cool. But we integrated Git tables in the SemTab challenge, which stands for Semantic Table to Knowledge Graph Matching Challenge,
Starting point is 00:27:17 where we try to, well, enhance knowledge graphs based on data found in tables. So this is a challenge that runs at IceWig. knowledge graphs based on data found in tables. So this is a challenge that runs at IceWig. And there we've been always using tables from the web, which more easily are linked to knowledge graphs. But what was very interesting, what we found there is that when you try to do this for Git tables, you don't have this one-on-one match between, for example, column cell values or cell values and, you know, entities on Wikidata or DBPDL or something like that. So
Starting point is 00:28:00 I think what we saw there is that the performance of these matching-based tool systems, the performance really dropped tremendously when we evaluated them on Git tables. And in the second year, actually, when we run the same competition with Git tables, we see that now the systems are actually more or better able to generalize to git tables as well so they don't lean as much on just matching strings to each other which is very straightforward and obvious for tables on the web nice nice yeah i mean it's it seems like a kind of good contribution to the area. It's delivering a lot of value on so many different fronts.
Starting point is 00:28:48 I can see it being very popular for many applications for many years to come. And I guess, where do you go next with it now then? And addressing the existing limitations of it to deliver more value? Yeah, so I think for Pit Tables, what Liza has for me is to just get all the CSV files, right? So we want to have an even larger corpus. So I think that's the main future work for GitTables, although I really invite people to contribute, for example, or let me know if they have a better parser so that we can
Starting point is 00:29:24 redo the parsing, for example. I think I'm interested in creating different subsets of Git tables. Also, well, as we said, perhaps some domain-specific subsets. But I think there's many interesting potential in the applications of Git tables. So, for example, you can think of SQL recommendation given certain tables, right? What kind of analysis can you do on them? But I think on the data management side, I think you can also inform, for example, query optimizers if you know the semantics of these tables. So I think there are many applications to still explore
Starting point is 00:30:07 um and i think get tables can be a useful resource to do so nice yeah i guess once you've once you've i mean here's a question for you do you have an estimated deadline say like say we resume the um the the pipeline today do you know when you'll get them all like is there like a a future date where it's like i'm gonna have it assuming that you kind of are keeping it with a piece of new data coming in obviously like but is it like okay i don't know 2025 september the 6th that's the day ah okay i think actually um i'm a little bit delayed because we actually aim to have them already in 2023. I think we won't make it with the current rate limitations, but I expect and I hope to have at least another version of GitTables, a much larger one, in 2024,
Starting point is 00:30:57 probably near the end. That's the release date to look out for then. This discussion is actually a great motivator to, you know, resume the pipeline and get back to it. Get things going again. Awesome. Cool. So I know this is obviously, we've obviously had a very big impact so far and there's people using it to kind of improve the CSV parsers and things like that.
Starting point is 00:31:20 So, I mean, kind of bigger picture, kind of what more impact do you think your work can have yeah we touched on it a little bit but like and also kind of how can people in their day-to-day sort of working lives like leverage the things you found and use git tables yeah so i think actually across the entire analysis pipeline there are so many applications to explore because many of these tasks from, you know, data exploration to data storage, all the way to data analysis, data visualization, and so on, they all operate on tables. And I think that, you know, so many tasks are part of this pipeline that can benefit
Starting point is 00:32:01 from learned models over tables. So that's something that I am really trying to push a little bit to start exploring more applications across this pipeline. Also organizing, by the way, a workshop on this at NeurIPS on table representation learning. And I think it's really interesting to see applications such as question answering. And I think that can also be very interesting to try out for Git tables. But yeah, I think there's just huge potential in trying different applications.
Starting point is 00:32:37 And for example, data validation is another one that I'm really interested in. So for example, can we predict relevant data validation rules from the contents, right? So if we see different configurations of data validation pipelines for given data sources, then we might be able to infer reasonable rules for new data sources. So I think that's something that I'm interested in as well. And I think actually that would be one of the examples
Starting point is 00:33:10 that have major impact in practice as well. And I think actually, you know, my research has been really driven by practice. So what actually drove me back into my PhD from being a data scientist is the feedback that I got from people in practice that were using semantic type detection models. And I think there's just much potential given that the entire data landscape is dominated by tables, right? So I think there's just you know so many practical applications possible if if we use this data source right i mean i i think it's nice as well that you kind of
Starting point is 00:33:52 you've been out kind of out there in the wilding industry and seen there's a need for it and then like okay i'm going to come and address this i think that when you've got that sort of bigger picture view it makes sort of the day-to-day grind of the phd so much more sort of i don't tolerable in a way because like you know i'm working towards something that's gonna be of use to people right um absolutely it's a great motivator and um it was really helpful to have done already a little bit of research before i started to inform my proposal basically for my phd research i think that's that's been a really good decision there. Yeah, awesome. As a user, how do I go about using GitTables?
Starting point is 00:34:33 Where's it hosted? How can I go and get the data, basically? So we currently host this data set on Zenodo, and they will make sure that this data persists over time. It's publicly accessible. They have an API. I'm not sure how stable it is, but it should be very easy to get this data out of Zenodo actually. We publish it in subsets. So as I expressed, we have these like topics that we use to query GitHub and we publish the tables per query topic, let's say. You can find it.
Starting point is 00:35:12 I'm not sure. So there are some code, but there is also a website like gittables.github.io. And from there you can basically find the paper, some analysis, some documentation, but also the links to the data set awesome we'll be sure to link that in the show notes as well so the interested listener can go and can go and find it and have a play around with it perfect um yeah so i kind of on this journey you've been on with um with git tables and what's probably the most interesting thing you've learned while working on it? I think what was interesting to me was the range of applications that we could serve.
Starting point is 00:35:51 So I really started with the intention or the purpose of using this data set for semantic type detection for table understanding. But along the way and during the rest of my PhD, I figured that there are so many other applications that can be fueled by Git tables and their models over Git tables. And I think that was one of the key lessons, right? So I got really much visibility to all the application potential. For example, the CSV parsing project project which i think really opened my eyes right so i think that was pretty cool awesome yeah i want some war stories off you now so i'll kind of
Starting point is 00:36:34 across this again across this journey like what were the things that you tried that failed what were the dead ends what can you kind of i don't know to avoid people going down the same maybe the same kind of wrong path maybe yeah what yeah what were the war stories yeah that's a that's a great question so um i think it's so this project i think lasted over seven to eight months in total and during that first month i think two the first two weeks i already discovered the great value of gith GitHub after a week long thinking like, okay, where can we find relevant data? And then I started exploring different data sources. And then I found, okay, we can actually use GitHub. It's like this pot of gold sitting there. But instead of starting to extract all of these tables, we first explored the direction of trying to replicate the semantics in these tables with data from Wikidata.
Starting point is 00:37:35 And that's, of course, a very, I mean, it made sense back then. Okay, let's synthesize tables that look look like tables from from github but aren't it made sense back back in the days but now i think like okay that was like a completely bad thought and that took actually the most most of the time of this project trying to figure out how we can replicate the tables that we found on GitHub. One way why we did this is because if you try to synthesize these tables, you have the ground truth metadata. So we then could use the kind of structure that we had in Wikidata, for example, to make
Starting point is 00:38:20 sure that we know that all the data that we would then put in like cell values for example actually resembled or like were associated with the types and now we just annotated the tables that we extracted from csv files on github which is a bit more noisy but yeah that was why we actually took that other direction as well so of that seven uh seven to eight month sort of journey like how far through is this you said like does the lion share the time before you took that other direction as well so of that seven uh seven to eight month sort of journey how far through is this you said like does the lion share the time before you changed basically yeah i think wow i cannot really remember the exact time spent on that alternative direction but i think like a couple of months probably two to three maybe four months that i spent on that
Starting point is 00:39:05 and then started extracting all these tables yeah it's always hard right when you've kind of gone so far down so you spent so much time you kind of just want to kind of force that thing to where but sometimes you just got to roll it back right and say okay it's going a different direction but yeah it's hard to hard to do that sometimes um for sure yeah uh cool yeah so obviously you do a lot of other things other than just this git table so can you maybe tell the listener a little bit more about the other research you're working on and other things you've got going on yeah yeah absolutely so i'm generally interested in learning from tables and of course now i've been focusing on table understanding.
Starting point is 00:39:45 So for example, semantic type detection, one low hanging fruit project that we had there, but was actually driven also from feedback from industry was, okay, how can we adapt these models to custom types. So for example, if we want to have semantic type detection in Power BI, for example, or Excel, then how can we allow users of these tools to add their custom types? So that's something that I'm working on now. But that's like I said, low-hanging fruit, although very impactful in practice. Another project is more analysis focused.
Starting point is 00:40:27 So in the meantime, we've seen quite some pre-trained models over tables. So really representation learning for tables, for example, for question answering. And it is still very unknown how these models actually work. And I think that's generally the case with many of these representation learning models or generative AI. And I think that's something that is worth exploring as well. So that's what I'm working on as well. And then going forward, I think I will first finish my PhD, hopefully this year. And then I'm very keen on exploring more, you know, more applications of table representation learning.
Starting point is 00:41:18 Fantastic. Yeah, the explainable AI sort of stuff is fascinating, right? I mean, it reminds me of a book I read a while back, the Weapons of Math Destruction book, which I guess is sort of kind of a little bit in that sort of area, but kind of, I don't if you've ever read it it's a really good read if you're interested it's i'd recommend it i will yeah um it's a good it's interesting yeah it just tackles the whole sort of like a lot of its fairness and then explainable ai but yeah kind of working out what these black box models are actually doing and being able to kind of give a reason um but yeah it's it's cool um but yeah no so you're gonna finish the phd and kind of what's next after that you're gonna stick around in research or back to industry or hybrid role i don't know yeah what's the dream when i started the phd i thought okay i will become a research scientist in industry but actually i will probably stay around in academia. So you will hear from me.
Starting point is 00:42:05 Fantastic. That's great stuff. Yeah, I think, so I think there's just such a potential of representation learning, machine learning over tables in this whole, you know, analysis pipeline, for example, that it's too early to quit.
Starting point is 00:42:23 Yeah, fantastic. Cool. And I kind of guess kind of going on to this then. So this next question, by the way, is my favorite question. I love hearing people's answers to this question. So it's kind of all about your creative process and how you go about generating ideas and then selecting what things to work on.
Starting point is 00:42:38 And then obviously maybe as well, knowing when to pull back from an idea like you did with obviously this project. So yeah, tell me all about that. How do you approach this problem that's interesting yeah so i don't really have a structured approach to generating ideas i just take time to think so i really so what motivates me for research i love to think and solve problems find the the important questions to answer. So I think I do have some kind of prioritization approach,
Starting point is 00:43:11 which comes down to, okay, what is impactful in practice? What do people really need? I think there's a societal aspect to that as well. And I just, yeah, so that's also, you know, how I use the feedback that I get from people using the products that I build in practice. And I use that feedback to kind of inform me on the interesting or the hardest challenges that they have. So that's something that inspires me and then I think yeah I just take a lot of time to think of how to to address a certain idea because of course so I started my PhD on the idea okay we need data sets so i was very sure of that but then yeah just taking the time to think really
Starting point is 00:44:09 well through what you know what the proper data source would be and so on i think that's worth thinking uh about very well yeah and i said it's always like i think i really like the there but having thinking about what impact can it have? What problem am I going to solve for somebody? And having that as kind of a key cornerstone of you thinking is, yeah, I really like that angle of it. Cool.
Starting point is 00:44:33 That's great. So another answer to that question. I love it. I've got a massive collection of them all. It's great to hear about kind of how everyone, everyone's different. Everyone has a different answer to that question. Well,
Starting point is 00:44:43 yeah, so it's time for the last word now. So what's the one takeaway you want the listener to get from this podcast today? I hope that people understand the potential impact of learning over tables. Because databases and the whole data landscape is really dominated by tables and we should stop learning about images and videos even plain text maybe and start learning over tables fantastic well let's end it there thanks so much madeline for coming so it's been a pleasure to talk to you thank you jack if the if the listeners interested interested to know more
Starting point is 00:45:24 about madeline's work we'll put the links everything in the show notes so you can go and find those and again if you do enjoy the show please do consider supporting us through buy me a coffee and we'll see you all next time for some more awesome computer science research Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.