Disseminate: The Computer Science Research Podcast - Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #36
Episode Date: July 17, 2023Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existi...ng table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paperBuy Me A Coffee! Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast.
I'm your host, Jack Wardby.
A reminder that if you enjoy the show, please do consider supporting us through Buy Me A Coffee.
It really helps us keep making the podcast.
It's with my great pleasure that I'm joined today by Madelon Hulsebooths,
who will be telling us everything we need to know about Git tables,
a large-scale corpus of relational tables. So Madelon is a
PhD student at the Intelligent Data Engineering or the Indy Lab, should I say, at the University
of Amsterdam. Madelon, welcome to the show. Hi Jack, it's a pleasure to be here. Thanks for
the invitation. Fantastic. So let's jump straight in then. So can you maybe tell us a little bit
more about yourself and how you became interested in data management or data engineering research? Yeah, absolutely. Yeah, so my name is
Madelon Hulsebos indeed. And I'm from the Netherlands, actually, so I'm Dutch. And I
started actually with a bachelor in like policy analysis very different fields but also exciting and doing
lots of simulations on data before i transitioned into computer science and really fell in love with
that field at first during my masters at two delves i well it was really when the hype around
data science and machine learning really got started. I think this was back in 2016.
And I decided to really focus on that.
So after I graduated, I became a data scientist.
Well, actually, not quite.
I first did some research at the MIT Media Lab
for half a year or something,
which was really a great opportunity
where I actually developed SHERLOCK,
which is a machine learning model
for semantic type detection in tables.
And that's actually where my interest in this field started.
But I thought, OK, I want to really make tools that are used in practice.
So I thought, OK, what is a good opportunity?
I really like data science, so let's see if I can really do
some kind of research job, but then in industry as a data scientist. So I became a data scientist,
and then I realized that actually most of my time was spent on building data validation pipelines,
data preparation pipelines, and so on. And in the meantime, I really saw my
work on semantic type detection actually get a lot of impact in practice as well. And people
were very interested in his work. So that pulled me back into research. And that's where I started
to focus more on data management research.
And I think actually there's so much potential in the intersection of AI and data management.
And I think we see the signs of that actually, well, now since a year or something with the whole generative AI hype, of course.
But there is so much potential when you apply this kind of technology to tables and databases in general.
So I'm really excited about it and to continue my research career.
Fantastic. That's a great backstory there. You also see, I don't know, the shift over the last five years.
And even in that conference proceedings, it's sort of like as ML and AI start to make its way into data management
and all the possible opportunities that are there for this sort of
this sort of intersection of the two fields which is which is which is great and cool so let's let's
talk about git tables and so give us the elevator pitch for it what is it why do we need it yeah so
um that ties into of course the story that i that i just. So I think it is really important to unlock the value of the
data that resides in databases through machine learning. And one thing that you need to train
and use machine learning models is, of course, data. And that is actually what motivated the development of GitTables. So GitTables is a corpus of tables extracted from GitHub,
in particular, CSV files from GitHub,
because you can find basically anything on there.
Yeah, so we, of course, now have only a subset of tables extracted
because there is a long, long, you know, pipeline to go through.
But GitHub really, I think it stores now 90 million CSV files, which is huge.
So our objective is, of course, to get them all out and make this a fruitful resource
for machine learning in data management applications but also data analysis for example so
huge potential that we probably get into later on yeah for sure and i kind of had a few questions
kind of fall out of that for me there it's only csv files is it isn't but we could maybe touch
on this later on maybe but what about other types of files that are stored in github is it is that
something that you're kind of looking at bringing in as well for now i think so my main focus is now on csv files
but also because there are so many stored github of course i check also for example um like real
spreadsheets excel files and so on but the scale is really smaller um and i want to get like a corpus that is as big as
possible to really make these machine learning models powerful so that's why now we focus on
chv files but you can find anything and i think that's really you know the potential that we show
um as a starting point but it really depends on on the interest of
applications there sure sure so yeah kind of on that then so how does git tables compare with
what's already out there or maybe differ with what's already out there yeah so our problem is
that we as i as i mentioned before we we were working on this model, Sherlock, for semantic type detection on tables,
basically mapping colon to a real-world concept.
And that actually motivated, well, many people in use cases,
so for example, people from Microsoft
that wanted to integrate this model into their tools.
And one thing we noticed from the feedback
is that people were clearly having different data.
So what we see in databases is very different
from the data sets that were around
and that we use to train Sherlock.
This data was actually extracted.
So these tables that we trained these models on
were actually extracted from the web.
So basically web pages and then tables presented on there. But you can imagine that these tables
are much smaller and are very, very different from the kinds of tables that we find in databases.
So a few aspects that make these data sets very different is one,
they are much smaller.
So tables on the web are much, much smaller,
but two also the content is very different.
So tables in CSV files or in other applications,
they're typically very messy
and they contain way more numeric data.
So I think those are a few of the like selling points of Git tables, let's say, in context of other table corpora.
And I think the semantics of what these tables really contain, so the meaning of this data is also very different.
And that's also what we show in the paper, for example. Yeah, so an example of that is that the most common attribute in tables is the ID type, as we call it.
While in web tables, this is really not one of the most common types around.
So I think that clearly demonstrates the difference and complementary value of git
tables in well in relation to other data sets okay cool then so yeah so obviously when you were going
through this process of collecting all of this all of these um these tables out of out of github
how did you approach them what were your design table design principles you were looking for when
you were going about designing git tables what was your guiding sort of philosophy with it yeah we had some some very you know very clear criteria
that we had in mind from the observations that we had on other corpora so one was we needed
many tables to fuel machine learning models so we needed scale and second we needed scale. And second, we needed relevant semantics.
So the type of data that you find in databases.
And we needed also, so we needed kind of coverage.
And then we also needed the semantics as in enrich these tables with metadata that we
can use to actually learn machine learning in a supervised way. So we wanted to have kind of
annotations on columns to, for example, enable type detection models.
So just on the first principle there, the scale, but you obviously want a lot of this
so it's useful to machine learning models. What is that tipping point? When does it become
useful? How much data do you need before these these things actually i guess yeah become useful yeah that's a that's a
good question i haven't really run a like an analysis of that but what i went with for semantic
type detection for example is that i wanted to have at least a thousand columns per type but it
really depends on the application but but also on, for example,
now we have all kinds of like pre-trained models, right? So they might get far with only,
you know, a small data set of tables. And I think therefore with these million tables that we now
have with Git tables, we might actually facilitate fine tuning of pre-trained models as well,
which have been trained on way more tables, perhaps from the web. So I think that's a good
opportunity as well. But yeah, of course, we are really keen to get most of these tables out of
there. But it will be a hard task because GitHub can be really restrictive on the API
and the load that it allows.
That's a nice little segue there into the next question.
So how did you actually go about creating this and walk us through the construction pipeline?
Yeah, so actually it's pretty basic.
So we just extract.
So our first goal is to extract as many CSV files as possible. And because GitHub has all kinds of rate limit restrictions, we had to segment our queries.
And we do so by adding an attribute on the file size.
So we only extract files for a certain keyword.
We always need to have a keyword, of course, to search GitHub for CSV files.
And then depending on whether a keyword appears
in a CSV file, you get the results.
But as we also show in the paper,
if you look for CSV files with the term ID,
you get 60 million CSV files already. And of course, you cannot just
extract them all in one go. So I think GitHub only allows you to go through 1000 items per time per
query. So we then segmented our queries based on the file size. So we first extracted, you know, between 50 and 100 kilobytes,
for example. So that was the first step. And then, well, when we have all those CSV files,
which takes basically most of the time of the entire construction pipeline, from those CSV
files, we then have to parse them to tables. And that sounds pretty straightforward, but these CSV files are so messy
and they follow so not the standards of CSV files
that you have many comments, for example,
on the first few lines, which is not as we intended, right?
So we implemented some heuristics to filter such cases out,
but this will still be a challenge, an open challenge.
And actually, GitTables is used now to also build better CSV parsers.
So I'm really excited about that.
But yeah, we then parse these CSV files to tables
with a basic parser from Pandas.
And then we also curate these tables
based on whether they have PII data,
so personal identifiable information.
So for example, if we know that the table contains personal data,
then we fake some of these values.
So for example, we fake the names or the addresses in a given table.
And we also filter out tables that do not come from GitHub repositories with a license.
So when we first released Git tables, we were actually, well, I think we had a slightly bad timing with the release of Copilot.
And Copilot was trained on all code in GitHub, also code without permissive licenses.
But there was a lot of ethical concern around that, which is rightful, right?
I think that's really good that we take these considerations. But we had some ethics review, actually, that's kind of restricted or well kind of informed us to also filter out tables that didn't come from repositories with a proper
license. So that reduced the size of the corpus from 1.6 or 7 million to one million. And yeah, so that's that's one limb or well,
one rule that we applied. And then we had our collection of, well, final tables, let's say.
And from there, we also annotated these tables, as I suggested, because we are interested in
having column types. And we employed very basic type annotation methods,
basically checking the column name,
the similarity of the column name
with the types in our ontology.
So within our interest,
and if there's like a syntactic match,
then we annotated the column name
with the type from schema.org or DBP.
Yeah.
And we also had like a, an embedding based approach where we embedded the column name
and the types, and then just calculated the cosine similarity.
And based on that, we informed whether there are so much.
So yeah, I think that's, that concluded the construction pipeline.
Nice.
So how long did it, how long did it take end-to-end to run this?
If I just, I don't know, run the full thing today,
like how long are we talking?
Yeah, so it's actually months.
Yeah, it's very hard to get them all out.
And this is, so you can only run a few number of queries per hour.
Right, okay.
So based on our segmentation,
because our objective is to get as many CSV files
out there as possible.
So we just have a very high number of queries.
I'm not sure what the number of queries is in total,
but we have so many queries,
it just takes months to get them out.
Wow. It's very topical at the minute, given Elon Musk's recent activity on Twitter, but with rate limits, right?
So it's at the forefront of everyone's mind at the minute.
In the meantime, I think Microsoft actually bought GitHub in the meantime, further restricting its rate limits.
So I think, yeah yeah it will take some time
before we get to 10 million tables for example but this is clearly our objective so it's still
running away today it's still just churning away in the background it's currently paused
but i need to i need to resume the extraction pipeline but there was some issue that we are now under, like we're
getting mainly the smaller CSV files out there, out on GitHub.
So I need to redo the segmentation a little bit and then we can continue because we also
want to have like a larger tables.
And although the average number of columns and rows is already way higher
than the average number of columns and rows in web tables, for example.
So the web-based table corpora is already higher,
but still there are many more files out in GitHub
that are much larger than we now have.
Yeah, on that, how,
so what's the frequency of which new tables
enter GitHub as well?
Like, I guess obviously that's growing over time as well.
Are you kind of keeping up with that?
Or is it like, I don't know how fast new data
has been kind of deposited in GitHub?
Yeah, that's a good question.
So I think so many things on GitHub change every day. It's very hard to keep track of that. And what we also need to do is figure out how we can get rid of duplication, for example. does is that it doesn't return forks which is good but you still you still might have some
like copies you know across different repositories so that's something that we need to
figure out but yeah that's for for later uh work i guess and now people just have to
deduplicate themselves when they use git tables so this this whole pipeline from sort of the
pars and the annotations all it's all, none of it's manual, right?
You never have to go in and say like, okay,
it's all automatic.
There's no, how would that go with like sort of working out,
okay, these first two lines are just text
and to get rid of those.
Like that must've been quite like an iterative process
to sort of finally finish on something
that I can just let it run
and figure out all these edge cases
because the state space is huge there.
Yeah, so we, I went through this iteratively, as you said,
just checking out what the errors is,
what kind of files couldn't be parsed,
and then adjust the parsing configuration based on that
so that we could still maximize the number of csv files that
we could parse um but yeah eventually this is a fully automated pipeline so it runs and to ends
automatically so that's great yeah it was definitely an iterative process to to come to a
full pipeline that we were happy with yeah Yeah, yeah, I can imagine.
Because, I mean, people do some crazy stuff, right?
There's loads of mad stuff out there.
I've seen that.
Yeah, I mean, this has been quite a while ago, actually,
that we first published this data set.
But I've seen very interesting things, indeed.
I just don't know how some people make CSV files or produce them. It's very hard.
But yeah, so as I said, I'm just really glad that some people now also use this data set
because we also publish the raw CSV files and they actually use it to build better CSV parsers.
And I think that's really nice because we need them desperately. And I was actually surprised that I couldn't find a parser that, you know, just could figure out the structure of these CSV files automatically.
It's amazing to have that feedback, Luke.
I mean, it's like the most rewarding thing when you're doing research, right, is actually people go and use it.
It's the best thing about it, right?
It makes it all worthwhile.
Absolutely.
Just another quick question
on the annotation method you use in that that kind of step of the pipeline how computationally
intensive is that i think so we really used basic methods that were very fast so we used
i think fast text which is an embedding model that's well very efficient um so i don't think that takes up much time to be
honest the main the main time consuming thing is really extracting csv files from github
through the api yeah yeah is is it a way you can kind of i don't know is it a paper if you have
like a payment scheme where you can pay more and you can get a better rate or is it all
just basically this is this is it this is all you're getting and this is the rate or is there
i think actually that's a good point i think enterprise users might have a more convenient
rate limit to be honest but i mean we're at the university so yeah i did it from my personal
accounts for example with my personal token. So no enterprise budgets there.
Yeah, no, you need somebody on the inside in GitHub
so they can open up the tap sphere
so you can get it all out post there.
I actually asked them, but they said,
well, if we want data from GitHub,
we also need to go through the API.
Oh, really?
Oh, man.
That sounds quite efficient.
Yeah, that's surprising. Cool. But I man that's that sounds very efficient but yeah yeah that's
that's surprising um cool but i think that's good yeah yeah yeah yeah cool right so yeah so you
perform some analysis in in and that you talk about in your paper of like what you found in in
kind of the v1 version of git table this this 1 million tables so kind of yeah what were your
findings the findings well first i was surprised with the
diversity of tables that i found indeed as you said there might be like school like csv files
for school projects but i found also many like database snapshots on i don't know nba players NBA players or also much more biological data and so on, medical data.
So I was surprised by the diversity of the semantic coverage there. of these tables is that despite we expected more numeric data,
it is actually, I think 58% is numeric.
And that's something interesting, I think,
for future work to create subsets of data based on,
for example, the distribution of like atomic data types,
like numeric or string data,
but also semantic distribution, so that we have domain-specific data sets, for example.
But that was something that I found very interesting. Still, the number of numeric
columns is larger than we find on tables on the web but um yeah this is this was an interesting
finding um as i mentioned i think in the introduction um the top type that we found
in tables was the id type which i think makes a lot of sense but that was an interesting finding
nice nice so yeah you kind of you've also sort of taken this and then to demonstrate the
utility and say like how how it's better than kind of stuff you can get off like web tables or
whatever you've you've used it in three applications so can you can you tell us about these these
applications and what they were and kind of what the additional value that git tables um delivered
was yeah so we of course built git tables to address a need on semantic column type detection
because people needed to retrain their classifiers and because the data wasn't representative and the
types weren't relevant. So what we did is use GitTables for semantic type detection. And as you
can see in the paper as well, you can use GitTables very well to train a classifier for a given number of
semantic types. And we compare it with VisNet. VisNet is a collection basically of all existing
corpora. So tables from the web, tables from open data portals and whatnot. And what I found most interesting about this comparison
is that we also trained a semantic type detection model on VisNet
and then evaluated this on Git tables.
And there you see that the performance really drops
from 0.77 to 0.66.
And I think this illustrated to me
that indeed all these existing corpora
that we find out there
don't really generalize to tables
that we cannot easily find on the web.
So there is a clear data distribution gap
between these existing corpora and git tables so i think
this was for me the the most interesting takeaway from this experiment although we of course also
show that you can use git tables to you know train a classifier from you know training it on git
tables and evaluating them on git tables as well but i think this gap was very interesting to me yeah for sure that's
fascinating i mean it it just goes to show you that kind of there is there was like some degree
of sampling bias in the other in the web tables right and this sort of like finds out the
distribution is less it's not bias right it's not like data but i guess it maybe is biases i'm not
sure i'm not sure on the correct technology i know there's sampling bias for sure but um but
yeah anyway it's been a long time since I did statistics and and machine learning
and all those sorts of things so yeah um yeah I can imagine yeah it's I mean I found also very
different results across different sets but I think what always remained constant is this gap
in generalizability from models trained on VisNet to Git tables.
So I think that really illustrates the complementary value of Git tables. And I think
that's pretty cool. And that actually ties into the second application that we considered,
which is actually benchmarking. So I think Git tables, you can extract many subsets
based on the application need that you have.
So you might find very large tables in there.
You might find smaller tables.
You might find different like atomic data type distributions and just filter based on that.
So that's what I'm involved in another project where we do that.
Or you might filter down on domains.
So I think that's cool.
But we integrated Git tables in the SemTab challenge,
which stands for Semantic Table to Knowledge Graph Matching Challenge,
where we try to, well, enhance knowledge graphs based on data found in tables.
So this is a challenge that runs at IceWig. knowledge graphs based on data found in tables.
So this is a challenge that runs at IceWig. And there we've been always using tables from the web,
which more easily are linked to knowledge graphs.
But what was very interesting, what we found there
is that when you try to do
this for Git tables, you don't have this one-on-one match between, for example, column cell values
or cell values and, you know, entities on Wikidata or DBPDL or something like that. So
I think what we saw there is that the performance of these matching-based tool
systems, the performance really dropped tremendously when we evaluated them on Git
tables. And in the second year, actually, when we run the same competition with Git tables,
we see that now the systems are actually more or better able to generalize to
git tables as well so they don't lean as much on just matching strings to each other
which is very straightforward and obvious for tables on the web nice nice yeah i mean it's it
seems like a kind of good contribution to the area.
It's delivering a lot of value on so many different fronts.
I can see it being very popular for many applications for many years to come.
And I guess, where do you go next with it now then?
And addressing the existing limitations of it to deliver more value?
Yeah, so I think for Pit Tables, what Liza has for me is to just get all the CSV files,
right?
So we want to have an even larger corpus.
So I think that's the main future work for GitTables, although I really invite people
to contribute, for example, or let me know if they have a better parser so that we can
redo the parsing, for example.
I think I'm interested in creating different subsets of Git tables.
Also, well, as we said, perhaps some domain-specific subsets.
But I think there's many interesting potential in the applications of Git tables.
So, for example, you can think of SQL recommendation
given certain tables, right? What kind of analysis can you do on them? But I think on the data
management side, I think you can also inform, for example, query optimizers if you know the
semantics of these tables. So I think there are many applications to still explore
um and i think get tables can be a useful resource to do so nice yeah i guess once you've once you've
i mean here's a question for you do you have an estimated deadline say like say we resume the um
the the pipeline today do you know when you'll get them all like is there like a a future
date where it's like i'm gonna have it assuming that you kind of are keeping it with a piece of
new data coming in obviously like but is it like okay i don't know 2025 september the 6th that's
the day ah okay i think actually um i'm a little bit delayed because we actually aim to have them already in 2023. I think we won't make it with the current rate limitations,
but I expect and I hope to have at least another version
of GitTables, a much larger one, in 2024,
probably near the end.
That's the release date to look out for then.
This discussion is actually a great motivator to, you know,
resume the pipeline and get back to it.
Get things going again. Awesome.
Cool. So I know this is obviously,
we've obviously had a very big impact so far and there's people using it to
kind of improve the CSV parsers and things like that.
So, I mean, kind of bigger picture,
kind of what more impact do you think your work
can have yeah we touched on it a little bit but like and also kind of how can people in their
day-to-day sort of working lives like leverage the things you found and use git tables yeah so i think
actually across the entire analysis pipeline there are so many applications to explore because many of these tasks from, you know,
data exploration to data storage, all the way to data analysis, data visualization,
and so on, they all operate on tables.
And I think that, you know, so many tasks are part of this pipeline that can benefit
from learned models over tables. So that's something that I am really trying to push a little bit
to start exploring more applications across this pipeline.
Also organizing, by the way, a workshop on this at NeurIPS
on table representation learning.
And I think it's really interesting to see applications
such as question answering.
And I think that can also be very interesting to try out for Git tables.
But yeah, I think there's just huge potential in trying different applications.
And for example, data validation is another one that I'm really interested in. So for example, can we predict relevant data validation rules
from the contents, right?
So if we see different configurations
of data validation pipelines for given data sources,
then we might be able to infer reasonable rules
for new data sources.
So I think that's something that I'm interested in as well.
And I think actually that would be one of the examples
that have major impact in practice as well.
And I think actually, you know,
my research has been really driven by practice.
So what actually drove me back into my PhD
from being a data scientist is the feedback that I got from people in practice that were using semantic type detection models.
And I think there's just much potential given that the entire data landscape is dominated by tables, right?
So I think there's just you know so many practical applications
possible if if we use this data source right i mean i i think it's nice as well that you kind of
you've been out kind of out there in the wilding industry and seen there's a need for it and then
like okay i'm going to come and address this i think that when you've got that sort of bigger
picture view it makes sort of the day-to-day grind of the phd so much more sort of i don't tolerable in a way because like you know i'm working towards something that's gonna
be of use to people right um absolutely it's a great motivator and um it was really helpful to
have done already a little bit of research before i started to inform my proposal basically for my
phd research i think that's that's been a really good decision there.
Yeah, awesome.
As a user, how do I go about using GitTables?
Where's it hosted?
How can I go and get the data, basically?
So we currently host this data set on Zenodo,
and they will make sure that this data persists over time. It's publicly accessible.
They have an API. I'm not sure how stable it is, but it should be very easy to get this data out
of Zenodo actually. We publish it in subsets. So as I expressed, we have these like topics that we use to query GitHub and we publish
the tables per query topic, let's say.
You can find it.
I'm not sure.
So there are some code, but there is also a website like gittables.github.io.
And from there you can basically find the paper, some analysis, some documentation,
but also the links to the data
set awesome we'll be sure to link that in the show notes as well so the interested listener can go and
can go and find it and have a play around with it perfect um yeah so i kind of on this journey
you've been on with um with git tables and what's probably the most interesting thing you've learned while working on it? I think what was interesting to me
was the range of applications that we could serve.
So I really started with the intention or the purpose of using this
data set for semantic type detection for table understanding.
But along the way and during the rest of my
PhD, I figured that there are so many other applications that can be fueled by Git tables and their models over Git tables.
And I think that was one of the key lessons, right?
So I got really much visibility to all the application potential.
For example, the CSV parsing project project which i think really opened my eyes right
so i think that was pretty cool awesome yeah i want some war stories off you now so i'll kind of
across this again across this journey like what were the things that you tried that failed what
were the dead ends what can you kind of i don't know to avoid people going down the same maybe
the same kind of wrong path maybe yeah what yeah what were the war stories yeah that's a that's a great question so um i think it's so this project
i think lasted over seven to eight months in total and during that first month i think two
the first two weeks i already discovered the great value of gith GitHub after a week long thinking like, okay, where can we find relevant data?
And then I started exploring different data sources.
And then I found, okay, we can actually use GitHub.
It's like this pot of gold sitting there. But instead of starting to extract all of these tables, we first explored the direction of trying to replicate the semantics in these tables with data from Wikidata.
And that's, of course, a very, I mean, it made sense back then.
Okay, let's synthesize tables that look look like tables from from github but aren't
it made sense back back in the days but now i think like okay that was like a completely
bad thought and that took actually the most most of the time of this project trying to figure out
how we can replicate the tables that we found on GitHub.
One way why we did this is because if you try to synthesize these tables, you have the
ground truth metadata.
So we then could use the kind of structure that we had in Wikidata, for example, to make
sure that we know that all the data that we would then put in like cell values for example
actually resembled or like were associated with the types and now we just annotated the tables
that we extracted from csv files on github which is a bit more noisy but yeah that was why we
actually took that other direction as well so of that seven uh seven to eight month sort of
journey like how far through is this you said like does the lion share the time before you took that other direction as well so of that seven uh seven to eight month sort of journey
how far through is this you said like does the lion share the time before you
changed basically yeah i think wow i cannot really remember the exact time spent on that
alternative direction but i think like a couple of months probably two to three maybe four months that i spent on that
and then started extracting all these tables yeah it's always hard right when you've kind of gone so
far down so you spent so much time you kind of just want to kind of force that thing to where
but sometimes you just got to roll it back right and say okay it's going a different direction but
yeah it's hard to hard to do that sometimes um for sure yeah uh cool yeah so obviously you do a lot of other things
other than just this git table so can you maybe tell the listener a little bit more about
the other research you're working on and other things you've got going on
yeah yeah absolutely so i'm generally interested in learning from tables and of course now i've
been focusing on table understanding.
So for example, semantic type detection, one low hanging fruit project that we had there, but was actually driven also from feedback from industry was, okay, how can we adapt these models to custom types. So for example, if we want to have semantic type detection
in Power BI, for example, or Excel,
then how can we allow users of these tools
to add their custom types?
So that's something that I'm working on now.
But that's like I said, low-hanging fruit,
although very impactful in practice.
Another project is more analysis focused.
So in the meantime, we've seen quite some pre-trained models over tables.
So really representation learning for tables, for example, for question answering.
And it is still very unknown how these models actually work.
And I think that's generally the case with many of these representation learning models or generative AI.
And I think that's something that is worth exploring as well.
So that's what I'm working on as well.
And then going forward, I think I will first finish my PhD, hopefully this year.
And then I'm very keen on exploring more, you know, more applications of table representation learning.
Fantastic. Yeah, the explainable AI sort of stuff is fascinating, right?
I mean, it reminds me of a book I read a while back, the Weapons of Math Destruction book,
which I guess is sort of kind of a little bit in that sort of area, but kind of, I don't if you've ever read it it's a really good read if you're interested it's i'd recommend it i will yeah um it's a good it's
interesting yeah it just tackles the whole sort of like a lot of its fairness and then explainable
ai but yeah kind of working out what these black box models are actually doing and being able to
kind of give a reason um but yeah it's it's cool um but yeah no so you're gonna finish the phd and kind of what's next after that you're gonna stick around in research or back to industry or hybrid
role i don't know yeah what's the dream when i started the phd i thought okay i will become a
research scientist in industry but actually i will probably stay around in academia. So you will hear from me.
Fantastic.
That's great stuff.
Yeah, I think,
so I think there's just such a potential of representation learning,
machine learning over tables
in this whole, you know,
analysis pipeline, for example,
that it's too early to quit.
Yeah, fantastic.
Cool.
And I kind of guess kind of going on to this then.
So this next question, by the way, is my favorite question.
I love hearing people's answers to this question.
So it's kind of all about your creative process
and how you go about generating ideas
and then selecting what things to work on.
And then obviously maybe as well,
knowing when to pull back from an idea
like you did with obviously this project.
So yeah, tell me all about that.
How do you approach this problem that's interesting yeah so i don't really have a
structured approach to generating ideas i just take time to think so i really so what motivates
me for research i love to think and solve problems find the the important questions to answer.
So I think I do have some kind of prioritization approach,
which comes down to, okay, what is impactful in practice?
What do people really need?
I think there's a societal aspect to that as well.
And I just, yeah, so that's also, you know, how I use the feedback that I get from people using the products that I build in practice.
And I use that feedback to kind of inform me on the interesting or the hardest challenges that they have.
So that's something that inspires me and then I think
yeah I just take a lot of time to think of how to to address a certain idea because
of course so I started my PhD on the idea okay we need data sets so i was very sure of that but then yeah just taking the time to think really
well through what you know what the proper data source would be and so on i think that's worth
thinking uh about very well yeah and i said it's always like i think i really like the
there but having thinking about what impact can it have?
What problem am I going to solve for somebody?
And having that as kind of a key cornerstone of you thinking is,
yeah,
I really like that angle of it.
Cool.
That's great.
So another answer to that question.
I love it.
I've got a massive collection of them all.
It's great to hear about kind of how everyone,
everyone's different.
Everyone has a different answer to that question.
Well,
yeah,
so it's time for the last word now.
So what's the one takeaway you want the listener to get from this podcast today?
I hope that people understand the potential impact of learning over tables.
Because databases and the whole data landscape is really dominated by tables
and we should stop learning about images and videos even plain text maybe and start learning
over tables fantastic well let's end it there thanks so much madeline for coming so it's been
a pleasure to talk to you thank you jack if the if the listeners interested interested to know more
about madeline's work we'll put the links everything in the show notes so you can go and find those and
again if you do enjoy the show please do consider supporting us through buy me a coffee and we'll
see you all next time for some more awesome computer science research Thank you.