Disseminate: The Computer Science Research Podcast - George Konstantinidis | Enabling Personal Consent in Databases | #14
Episode Date: December 5, 2022Summary: Users have the right to consent to the use of their data, but current methods are limited to very coarse-grained expressions of consent, as “opt-in/opt-out” choices for certain uses. In t...his episode, George talks about how he and his group identified the need for fine-grained consent management and how they formalized how to express and manage user consent and personal contracts of data usage in relational databases. Their approach enables data owners to express the intended data usage in formal specifications, called consent constraints, and enables a service provider that wants to honor these constraints, to automatically do so by filtering query results that violate consent; rather than both sides relying on “terms of use” agreements written in natural language. He talks about the implementation of their framework in an open source RDBMS, and the evaluation against the most relevant privacy approach using the TPC-H benchmark and a real dataset of ICU data. [Summary adapted from George's VLDB paper] Links: VLDB paper GitHub repoHomepageGeorge's LinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
I'm delighted to say I'm joined today by George Konstantinidis, who will be talking about his VLDB22 paper, Enabling Personal Consent in Databases.
George is an assistant professor at the University of Southampton, and he's also a Turing Fellow with the Alan Turing Institute in London.
His research interests are databases, data integration, data sharing, and data knowledge graphs.
Welcome to the show, George.
Hi, Jack. Thanks for inviting me. Nice to be here.
Pleasure to have you. So let's dive straight in.
So can you tell us a little bit about your journey?
How did you get interested in researching databases?
And obviously, specifically, the topic of the show today, personal consent.
Yeah. So since after I graduated from my undergraduate bachelor's degree at the University of Crete in Greece,
I started developing an interest in artificial intelligence and data management at the same time.
And I did my first master's there at the University of Crete
and worked for a bit at the foundation of technology,
research and technology, HELAS.
That's called FORTH, the acronym. So during that time,
I think I started shifting more towards databases because I found it to be more principled.
Back then, AI was mostly working with the model of let's find a problem
and throw everything we have at it.
It was not so principled to my eyes.
Then later, you know, with all the deep learning
and machine learning evolution, it became much more principled
and maybe to its, not to its benefit, I'm not sure.
But in addition to the more principled fields
that I could see in databases,
I liked more the conferences, the community.
So gradually, eventually, I found myself working
in the field of data management.
Let's get into the meat of the show today.
So can you tell us what is collaborative privacy?
And it's a really important kind of topic in your paper.
So can you give the listener an overview of kind of what it is?
Yes, sure.
So this was born with the idea, I mean,
back around the time when GDPR was about to come out,
we started thinking about how could we support, technologically support,
this kind of ideas of GDPR, of protecting personal consent and personal privacy in a machine processable way.
So collaborative privacy is the technology that allows collaborative
parties to automatically capture, update and implement a data privacy contract.
So when you have data sharing between entities, they usually agree on several terms
and collaborative privacy is there
to provide automation to that process
of agreement and enforcement implementation
of those terms.
It's a new concept that we're trying to push
and it's related to data privacy, but it's
not exactly data privacy because in the collaborative privacy concept, you don't have an adversary.
So the idea is that you are trying to enforce your privacy preferences in coordination with
the service provider, not hiding something, not encrypting something
against the service provider.
Okay, because it's not like a different trust model
that you have.
Okay.
So what's an example?
Can you give us an example of how someone
in their day-to-day life would be involved
in such an agreement?
What's a typical example?
So you have typical examples from, I mean, it's everywhere in all our interactions with services
on the web today. So even on social media, you commit your data to a social media provider
and you trust them. You have agreed with them on the use of your data
using these terms and conditions documents,
maybe doing some opt-in, opt-out choices, right?
And then that's your contract.
But you completely trust them with your data, right?
Another very simple example is when you go to buy something
from an e-commerce website and you put your email address.
You don't encrypt it.
You don't secure it there, right?
But you trust the website to use it
per your agreed preferences, right?
To use it according to your agreed preferences.
So use it to send you update emails about your purchase,
but not advertisement, let's say.
So that's on one end, on the consumer end.
But you can go all the way to businesses
merging their databases or federated
learning of different datasets where you
have privacy concerns between those datasets. Or when
you have a company buying another company
and they want to merge the data.
Or in several other data consortia, let's say,
where you have a data sharing scenario,
you have these agreements that are put in place
in order for some preferences to be respected.
It's there that we envision technology playing a major role in the future.
So what are the problems with the current way
that collaborative privacy is implemented?
I mean, I know the one thing from my experience is the terms and conditions.
I mean, I very rarely read them, right?
And I saw, I remember seeing this art show once
where someone had printed off basically all the terms and conditions
from the popular social media companies at the time
and they were so long.
There's no way a normal person would read them.
I'm sure if you added them all up,
there'd be not enough years in your life to read them all,
the amount you agree to these days.
But anyway, yeah.
Exactly.
So this is one aspect of the problem.
The fact that no one reads those terms and conditions.
They are written in natural, in legal language, right?
They're hard to read and they are much more hard to enforce.
So usually these terms are written in a top-down way.
So you don't put the terms in the contract.
The service provider does, right?
So they are top-down, they are imposed on you.
Usually they are an accept all or nothing kind of agreement.
So you want to accept the terms, you get the service.
You don't want to accept the terms, you don't get the service. There is no fine-grained saying for you in those terms, right?
So these terms are very coarse-grained.
The only amount of automation that happens
is very coarse-grained.
It's predefined opt-in or opt-out options
for a particular set of scenarios, right?
At the same time,
this is a problem for the service provider as well
because they do this agreement with you
and then they have to hire an in-house engineer
and tell them to implement into code
whatever they have agreed with Jack or George or whoever.
And because this implementation is ad hoc for an agreement,
you cannot have these agreements
varying a lot between users right now
because you have to implement them.
So you have to implement a different agreement, right?
So the idea is that if you do this somehow automatically
or semi-automatically,
you could give more power to the user to
have a say over their personal preferences in a more bottom-up way.
So the user will co-construct the contract, which will be in a machine-processable way.
Of course, to go there, it's not immediate because you have to have users that are data literate, right?
Or you have to have agents that act on behalf of users.
They have some defaults and they put some kind of constraints in the system in a machine processable form.
And then this is the contract and the contract gets automatically respected, automatically implemented.
So this is the idea and we have done some initial work on this.
Cool. So just to kind of, I guess, go jump back, you mentioned earlier on in that,
you kind of said how it's different from data privacy, but are there any techniques that exist
in the sort of data privacy space that could be applied to help address this? Or is it just
totally kind of not relevant?
No, it's a very good question.
And that's the first thing that we looked into.
Okay, first we said that collaborative privacy
is not substituting data privacy.
It comes after.
It's complementary, right?
So data privacy comes before you encrypt,
you secure whatever you want to protect.
Then you have to commit some data.
You have to give some data,
but then you still have privacy preferences
or concerns, right?
So in that sense, it's not data privacy, but at the same time,
we could look into particular data privacy techniques.
There is, for example, a technique that we started looking,
we originated from looking into that technique,
which is called controlled query evaluation,
where you have a query to be executed against a database,
but you want to do this in a controlled way,
in a way that respects some requirements
written down in a machine-processable way
or some specifications, some data privacy concerns.
But this is very rigid in data privacy.
You are afraid of
of revealing too much information so you are you're very very strict in in our setting you
you trust the other party so you can reveal more information so so deep down the the in the at the
first instance the technology that we started developing is related to data privacy technologies, but from a different spin with different semantics.
Okay, cool. So I guess let's dig into your solution. So this is called consent constraints. Can you tell us what these are and how they work? So I spoke about the vision that we have, right? But as a first step, we wanted to
start from software and keep it simple. And we said, okay, let's start from relational databases
and see what can we do within relational databases. How can we capture some kind of
consent of preferences on the processing that can happen inside the relational database.
What kind of processing can happen?
What is the most common processing in relational databases?
It is query answering.
So, okay, let's try to encode some constraints
that will impose some restrictions on query answering.
Again, looking into data privacy,
there has been some work that deals with
what is called denial constraints.
Denial constraints are queries, are negative queries,
are queries that you don't want to be answered.
So at first we imagined the setting where by default, the user would allow queries to be answered. So at first we imagine the setting where by default the user would allow queries to
be answered unless they explicitly want some kind of join, some kind of projection or some kind of
selection not taking place. In that case they would write a negative query that in the face of
a query from a service provider or some kind of processing from the
service provider will affect how the service provider's query will get answered. So this is
the idea. Again, it's not completely protecting against never finding out the answers of the
negative query. It's more like explicitly protecting against certain operations happening on the data at the time of the query answering.
And then trusting the service provider that for a future query, they will go through the system again,
rather than try, let's say, to infer something that they shouldn't.
Okay, so there when you said that it's not completely protected,
that you can, it's basically, it's preventing specific operations happening,
but you can, in some other roundabout way,
deduce something that you would have necessarily
kind of not been concerned about.
So the way classic data privacy works is
it is afraid of collusions.
It is afraid to answer query A and then query B
in a privacy compliant way. But then somehow through the answers of the query A and then query B in a privacy compliant way
but then somehow through
the answers of the query A and query B
the adversary can combine
those answers and find something out which is
not explicitly
violated in query
A or query B in isolation.
In our setting we don't
we're not afraid
in some sense of this collusion happening.
Unless something is explicitly violated, we return the answers.
So we give more answers, essentially, to the service provider to play with,
at their disposal.
And we trust the service provider that when they want to do some combined processing,
they again will go through the system and explicitly filter. And so they will not try to do collusion and obtain something that we don't
want them to obtain. Again, this is not about protecting, it's about enabling the service
provider to do the most processing that they want to do in a consent abiding way.
Okay, great. So you've obviously taken this idea then,
and you've designed an algorithm,
a system that will allow a service provider to go and honour these consent constraints.
Can you tell us more about how the algorithm actually works
and how the system works, how that kind of was all done?
So there are two algorithms.
Initially, we had to try to find semantics for query answering.
So we wrote queries and constraints on the blackboard, There are two algorithms. Initially, we had to try to find semantics for query answering.
So we wrote queries and constraints on the blackboard and we scratched our heads and said,
okay, what does this mean?
What answers do I want here?
And slowly we ended up in a semantics definition
which is based on provenance.
So the idea is that in an imaginary world,
you tag every tuple and every cell of your data,
you annotate it with some label,
and then both the consent constraints,
these negative queries,
and the service provider's query,
they both get answered on these annotated databases.
And then the answers carry with them some annotations,
some provenance on the way that describes the way
of how these answers got obtained, right?
These answered apples.
So you do that both for the query, the input query,
and for your, let's say,
consent contract, and you try to do some kind of difference there. So you give back to the
query issuer only those answers that are not labeled with, let's say, data that you don't
want to be given back. But these labels are more complex than traditional labels in the sense that you can now annotate
joints rather than just simple cells.
With some mechanisms that we have, you can allow, let's say, two labels in isolation
to be given to the service provider, but let's say my disease table, right?
Or the rows of the disease table that belong to me, right?
But then when you join the disease table,
let's say with the insurance table,
it's then when I don't want my information to be used.
So I can describe things like that.
That was the first algorithm.
That was mostly to give theoretical foundation to be used. So I can describe things like that. That was the first algorithm. And that was mostly to give theoretical foundation to the work.
And then based on that algorithm, we proved some complexity results.
We proved some formal connections to data privacy.
And then we went on and we devised a second algorithm,
which does not need to touch the data.
It's data agnostic in a sense.
So you have your consent contract
and you have your input query.
And then what we do is query writing.
We rewrite the input query into a new query
that no matter the database,
when executed will abide by the consent contract.
And of course, this query depends
on how large your consent contract is,
how many consent constraints you have.
It could be a very large query.
So it's not 100% clear that this is the best approach to go about,
although our experiments show that this is a better approach
than the provenance-based mechanism. Yeah, I was going to say, than the provenance-based mechanism.
Yeah, I was going to say, because the provenance-based mechanism
sounds like it has quite, it's quite invasive, right?
It sounds like it has quite a potential high cost on the,
kind of bloat in the storage layer, maybe, if you're annotating everything.
But yeah, I guess then the trade-off of the query writing
is that there's a cost associated with that as well.
Exactly. It depends how large your query writing becomes.
Yeah, there's a trade-off space there, I guess. And there is work there to still work there to
investigate what is the best approach for different scenarios. So we are currently in the process of
that. Amazing. So what were the challenges you had to overcome in this sort of process of
starting off with algorithm one and going on to algorithm two?
Okay, so first I would like to point out one of the major challenges in this line of work is cultural a cultural obstacle okay so initially
when when we were um uh trying to publish this work uh the comments that we were getting back is
how how can you be sure that you will enforce the contract, right? So there is this still cultural change that needs to happen
to understand that today we do give data to service providers
with no mechanistic, no algorithmic guarantee
of privacy enforcement, right?
The privacy enforcement that you now trust is all legal or extra algorithmic, right?
So the first obstacle that we have to, of course,
when something like this happens, you always come back and you say,
I should have described the work better.
I should have described the motivation better.
But still, I see this when talking to people, it's a hard-coded way of thinking that we have about privacy that does not allow us to easily switch to this new model.
And we always go back to think, how am I going to enforce this?
How am I going to enforce that the service provider is not going to violate this? And the answer is,
you're doing that for the service provider as well, because the problem starts with,
they don't want to violate your preferences. They have business incentives not to violate
your preferences. You remember how much, let's say, a big social media company changed policies
after certain scandals, like the Cambridge Analytical scandal.
They have business incentives to do that.
They have legal incentives to do that.
And because they don't want to, they attract more customers, if you will, if they are more transparent, right?
So they want to have this automated means to not violate your preferences, right? So they want to find this, they want to have this automated means to not violate your preferences, right? So this was the cultural challenge that we had.
Of course, then we had technical challenges, right? So technical challenges is, again,
it had to do with what does a consent constraint mean exactly? How do we encode this kind of joins? Why queries?
This seems unnatural. Why?
I mean, you have some privacy preferences.
Can you encode all your privacy
preferences in queries?
No, you cannot, right? Unless you have a
table for anything, for purposes.
You must have a table that describes purposes, right?
Otherwise, you cannot.
And by the way, previous data privacy work
tries to do that by encoding
all the language of of purposes inside inside the database itself right so you cannot but you have
to start from somewhere and and we asked ourselves okay so let's start from selection projection and
joins these are the main operations that you want to talk about in your in your contract right and
the other challenge that we have is okay okay, the average user is not really familiar
with selection projection and joins
to the extent that they can play with them
and write consent contracts.
So there is more research to happen there
on the automation of these preferences.
How do you go from a friendly UI
to these consent constraints? And that's another
challenge that we are still working on. And of course, other challenges for the particular paper
was what to compare against, because we don't have another approach which is similar. We have
terms and conditions, or we have classic data privacy approach. So we did a mix of comparing against both,
mostly against the most relevant data privacy approaches.
And the last challenge had to do with obtaining data
and to run our experiments because this is a new idea.
There are no consent constraints documented.
How you are going to do this?
And again, we created data generators
and consent generators and we did experiments on synthetic data but we also obtained real data,
anonymized data from clinical trials and from patients in ICU units and wrote some
constraints of our own on top of this real data.
We can maybe just dig into a little bit and how you actually took the
algorithms and whatnot and how you implemented these in some framework.
Can you maybe tell us more about the framework that you use to then evaluate
your approach?
Yes.
Okay.
So first, with respect to the classic data privacy, there is a work, a series of works, actually, known as Hippocratic databases.
And where the ideas of the privacy ideas that they had implemented there, they were reminiscent of what we were trying to do here. So they had some opt-in, opt-out choices, but of course, from a classic
data privacy perspective, when you opt it out to share a particular row, you could not even use
that in joints anywhere. It was blanked out, essentially. And in our case, we implemented
an opt-in, opt-out approach using our consent constraints, which are much more powerful.
But just to compare,
we use them only as hiding projections.
That is opt-ins or opt-outs, essentially, right?
Projecting out columns
from particular rows of your queries.
But in our semantics,
you can even still do joins with hidden attributes
as long as you don't want these attributes to be returned.
So it returns more data.
So when we compared, we compared how fast we are, but also how much more data we return against this other data privacy approach.
So this is the connection, the technical connection, let's say, to this other approach that is out there.
So in terms of the framework, right now it's a prototypical implementation.
In the input, we have these consent constraints.
In some notation, we use the data log notation to write rules. And
we have implemented this in
everything in Java. You get the input
query in SQL
and you get these constraints and then you
create a SQL rewriting in the output
of your Java program that is
good for execution on top of your
databases, on your database, and we
executed that in Postgres,
in the Postgres database so this
framework right now once you know what what you want to do with it it's ready to to go it's it's
open source the link is in our paper the github link is in our paper but of course you would need
all the technology around that to use it right now. So you would need to go from the preferences to the actual consent constraints.
We are doing active research on that.
And you would need to encode that inside your system somehow.
But the framework can be run autonomously.
We mostly support the query rewriting approach.
The provenance-based approach was mostly implemented for comparison
and is not, let's say, production-ready in some sense.
So we do have that in the GitHub as well,
but it's probably more buggy than the query rewriting approach.
What was your approach to
evaluate a new framework and kind of the questions you were trying to answer and how did that look
like the good thing with this technology is that and with the vision that we have for this
technology and i can go into that more later i mean what comes in the future. But the vision is that you can encode both personal constraints.
So bottom-up, let's say, creating a contract of data sharing bottom-up
or data processing bottom-up, but also top-down.
So you can encode also institutional policies, right?
So you can encode things that talk about large portions of your data
with these queries.
Simply, you know, an atomic query that mentions stable employees
simply, and let's say forbids the projection of the first attribute,
is talking about the entire set of your employees, right?
So this is like an institutional policy.
Do not do this at the institutional level.
But you can also fix in your query your personal ID and talk about... So these negative query answers, so the way you talk about your data is through these negative queries,
will only be particular to your data that contain this particular ID because you fixed it in the query.
So you can go all the way from institutional policies
to personal constraints.
And of course, a natural question is,
how do you perform?
What can be supported?
What are the limits here, right?
Indistinguishably, we found out that personal consent constraints
are easier supported to a bigger scale, right?
We can scale to thousands,
tens of thousands of personal consent constraints
in our experiments
than scaling these large institutional policies, right?
This, so interestingly, we found that
a major factor that affects query performance here is how much data does your negative consent constraint touch?
How much data does the negative consent constraint talk about?
If it talks about a little bit of data, even if you have hundreds or thousands of them, so individual user policies, normally you would expect the data for an individual user within a database to be small, to be localized, right?
So in that scenario, you perform fast.
You go, essentially, there were cases where the consent-abiding query execution was as fast as the original query executed with no consent enforced.
So the consent overhead is not much.
Worst case, we found that it is linear in our experiments.
In theory, it could be worse because the problem we proved is an NP-hard problem.
So in theory, it could be worse.
But in practical settings, in all our generated settings, in the ICU, the real data that we used, the overhead of enforcing your privacy
is kind of linear. When you go to global policies, this changes radically depending on what kind of
policy you want to enforce. You have policies that are very complex and they slow
down the query very much. And you have policies that are easier to enforce and they're not so
complex. So regarding this interplay and how much bottom-up against top-down contracts closes or
requirements you can enforce, we're still looking at this and
investigating into this interplay can you tell us a little more about the the experimental setup
and you mentioned earlier on and how you got data from some icu and you know i know you used their
tcph as well so can you tell us a little bit more about this the experimental itself so uh for experiments we use the tpch benchmark but of course this um
does not have uh it comes with a set of queries and it comes with a different number of scales
you can scale from from a few rows in your tables until uh until millions of let's say, of tuples in your tables.
And we did all this range of data generation.
But of course, it doesn't come with constraints,
with consent constraints.
So we looked at the queries that it comes with
and we tried to change them to make them more like,
to generate, let's say, more like constraints.
One thing that we did is we looked back in this Hippocratic databases world that had
policies and data privacy policies that had to do with patients.
And we tried to mimic those constraints, which again, they were hardcore, hard data, strict data privacy
constraints, but still they were useful to us. We try to mimic those constraints on top of our
TPC-8 benchmark, which means if there was a constraint that talked about a patient,
we transformed it to a constraint that talks about a customer. So a customer does not
want to share their address, similarly to a patient does not want to share their disease
or something like that. So this was a kind of inspiration that we took to create realistic
constraints on top of the TPC-8 benchmark. And then we created, using this, we created hundreds of constraints or
thousands of constraints by simply talking about more and more customers. And then we created
constraints, more complex constraints by joining them with the customer relation with orders
relation and putting some constraints on the orders. And so we really drilled down into the TPCH benchmark in a clever way.
We didn't simply try to brute force create constraints that do not have any sense.
Similarly, we did with the ICU dataset.
Again, looking inside the dataset, we created constraints that had to do with,
we found a way to automatically generate constraints because we need large numbers of
them that have to do with real concerns that one might have on the length of the stay in the ICU,
on the particular treatment that they got while they stayed there, on the disease, on the diagnosis, and so on.
So this was the setting, and we scaled to different numbers of constraints.
We did experiments with constraints that touch a little bit of data,
lots of data, with small databases, large databases.
Awesome. So I guess, what were the key results then?
So if you were to kind of summarize them and put some sort of numbers, I know you said before that the smaller the amount of data, the negative query touches, the faster you are. But what are the other key results from your experiments? on par or even faster than data privacy approaches, while giving back on average 30% more answers
to the service provider,
because again, of this premise that the answers
are not explicitly sensitive against the query,
and therefore we're not afraid to give those answers back,
in contrast to data privacy approaches.
The slowdown of the queries was linear, as I said,
and we found out that compared to classic data privacy approaches
like controlled query evaluation, there is a version of our framework
that can also be used as that.
So, and we did that both theoretically
by proving that
and we found,
we did an experiment as well
where we use our framework
as strict data privacy,
as a strict data privacy framework.
So in that case,
but this for us,
this can
only happen with particular kinds of constraints. When you restrict yourself in, let's say,
in what we call Boolean constraints, constraints that do not talk about projections, then the
enforcement of these constraints will be privacy abiding as well, in some sense.
Okay, cool. So are there any situations in when your approach is performance
is suboptimal?
I'm guessing I'm trying to get here.
What are the limitations of this?
Yeah, I think also what I forgot to mention earlier,
that we did some experiments with aggregation as well.
But for this, there is much more research to be done because
when the
user, the service provider
queries have aggregation, what we did
we took it out of the query, we executed
the query in a consent-abided way and we put it back.
That's one semantics
to go about it. But is it the correct one?
Is it what you want?
And more interesting
is how do you evolve your your consent language
to contain aggregations and this is not clear at all we are currently looking into that but we did
some initial experiments with with the disclaimer that the results might not be consistent because
there is no developed semantics right now but these are the numbers this is how it
would look the limitations in the terms of experiments are those that i mentioned in terms
of performance the limitations are when you use constraints that touch lots of data you don't
scale as much the slowdown is becomes becomes large however it depends on the application of
how much slowdown you can actually tolerate. Because if you're doing this
overnight, you might be able to do even large
policies. If you're doing this every second, it's
a different story. So it depends on really the application
what are the actual limitations there. Other limitations of the
work in general
are again the as i mentioned are again um how to capture the users preferences and and make them
into this consent constraints and we are actively working on that uh but the the most maybe the
major limitation which is which i don't see a limitation, but I see as an opportunity to invest and look into,
is the consent language right now.
The consent language, yes, it's more expressive
than what you could do with opt-in, opt-out,
and we also experimentally verified that
by giving back more answers.
But it's still a language that has selections,
projections, and joins.
That's it. That's the queries that we can write, conductive queries, projections, and joins. That's it.
That's the queries that we can write,
conductive queries, essentially,
that we can write as constraints.
What happens when you extend this language
and how you can extend this language?
How do you talk about purposes?
How do you talk about other kinds of processing
on top of relational data?
The queries is not the only kind of processing that can happen, right?
Learning, let's say, analytics,
other kinds of processing that can happen
on top of your data.
So currently you cannot do this with our initial work,
but this is one major direction for us
for future research.
Cool.
So you mentioned earlier on that the framework
is available on GitHub,
but how has, maybe there's some ongoing work in this area but how is a sort of service provider can
how easy would it be I guess just a question to take what's there at the minute even though it's
in a kind of a yeah production ready and then integrate that into my sort of um yes yes again
I know I know there are approaches in industry that, for example, I think Snowflake and Amazon have approaches that are called data clean rooms now
that try to do some kind of data privacy enforcement
in the face of queries and privacy preferences.
They do some kind of cleaning of the query answer or something like that.
I'm not sure what's happening, but I know the work is relevant. So for
something like that, you could take the framework with appropriate
licenses. That is, it's open source, but
we have, we
do want the framework to remain and extensions
to remain open source.
So you can take the framework
and as long as you have some kind of preferences,
you can always translate them in a data translatable
to this kind of query language that we use.
You can simply take the framework and run it.
You take the input query,
you put it in a middleware between the user's query and the
database, and you just run it.
Now, the question is, how do you obtain the constraint?
I want to claim that even for industry right now or for large entities, to use our framework
is easier than what they are doing now. Because they would still need to
hire somebody who would
look at the user preferences, even the terms
and conditions that are being
written right now, and translate those
to
these negative constructive queries that we
use, negative constraints. Some of those.
Of course, you cannot
substitute the entire agreement,
arbitrarily legal sentences, legal clauses, right? But when it
comes to query processing preferences, you can encode some
of those, even manually, to our constraints. Then instead of
going all the way from the contract to implementation code,
you go from the contract to a declarative language and then you execute that
in an automatic way so i hope this this will find applications but of course we are not
being quiet we are still working on aspects of this actually i've recently been awarded a
two horizon projects horizon Horizon Europe projects.
One is RAISE.
It's called RAISE.
The other one is called Upcast.
RAISE has already started.
RAISE stands for Research Analysis and Identifier System.
And the RAISE project is about moving, instead of moving data to algorithms, moving algorithms to data.
And do this in a privacy-preserving way.
So we are doing the privacy aspect of the, let's say,
the privacy-respecting aspect of collaborating privacy aspect
of the project.
And the Upcast is about to start in January.
Again, it's a very large consortium, a very large project,
where it's mostly dealing with data marketplaces.
And these ideas are central to that project.
How do you negotiate contracts in data sharing
and how do you put them in place using an automated,
let's say, algorithmic way?
This slowly, so it's getting more and more traction and enables us
to extend this in many more directions in the future. So what's been the most interesting and
maybe unexpected lesson that you have learned while working on this project and on personal concern. So the most interesting lesson is that we can always reuse and always connect
different areas in ways, in new ways that we didn't imagine before, right? And this is not so
much about the work that I have done in this paper.
This is about additional work that we are doing,
where we are actually reusing ideas from data integration and knowledge graphs
to support this kind of automated contracts.
And then suddenly all these technologies become relevant again,
which were hot in early 2000.
Let's say query answering using views
or other kinds of problems.
They were very relevant in the early 2000s,
then kind of died out a little bit.
Also AI techniques that are put aside now
for the sake of deep learning, right,
of approximate things.
But right now you realize that
you cannot approximate the enforcement of a contract you have to do it in a strict way so
again you have you have to use we are we are going back to using some strict knowledge
representation and reasoning techniques right but do some discrete, some exact reasoning,
because you want to do some kind of exact reasoning
on top of these contracts and say what is allowed,
what is not allowed.
And you cannot simply learn this.
Of course, machine learning and deep learning
still has its place, especially in the front end of this work,
because you have to learn the preferences of the user.
But how to enforce them, you cannot enforce them in an approximate way. So one of the interesting things that come out of this work because you have to learn the preferences of the user. But how to enforce them, you cannot enforce them in an approximate way.
So one of the interesting things that come out of this is that I can see that all things
becoming new again, and I can see that I can reuse and I can connect dots that I didn't
know that were there to be connected.
In terms of culture, there are some surprises because people always discuss when
new research happens that a cultural change is difficult, but it's always interesting to see this
from a first row, let's say, and trying to advocate for this and trying to see that,
you know, a cultural change is really happening here. In terms of results, of technical results,
they were not really surprises
because we really believed
that this would be viable and feasible.
So there were very positive results that came out,
but I wouldn't call them surprises per se.
Yeah, it's very interesting
how things often come full circle, right?
Exactly. I think in computer science, that's a characteristic of our field,
that it works like a pendulum.
Things go from one end to the other and back.
Let's see, for example, what's happening three or four times already
with distributed against centralized computing.
So I guess the knick-knacks, because I tend to ask this to most of my interviews, is research obviously is non-linear, right?
It's a bumpy process.
So from the conception of the idea for this to the actual final publication,
what were the things that you tried on the way that maybe the listeners
could benefit from knowing about?
The way I worked here is that I didn't abandon the idea when it was first rejected, let's say,
and I put more into it.
So I gambled in some sense.
So I hired more people and started working more into it.
So I prepared papers for down the line and extensions to the system. We already
have a very cool extension to the system that is not published. It's about to be published,
which is the following. Consider some data processing that happens. And with this consent
constraints, you answer a query and you get back a data output, let's say, the answer of your query.
And this goes to further live on within the data ecosystem.
We can automatically generate consent constraints, so a new contract for the query output based on the old contract.
We call this consent propagation.
So you have a consent contract, you do some kind of processing, you generate an output, a this was one thing, for example, that we did
while the paper got rejected.
And instead of, I mean, we still
kept on working on the paper.
The paper got rejected for
cultural reasons. How are you going to
enforce this again? We
didn't get disappointed
and we
continued working on this, extending
this. So the way i work is i i like
brainstorming on using the whiteboard with the students with the postdocs with colleagues right
and then taking some homework for everyone then trying trying this out and coming back and
brainstorming again it's very very important very important to communicate with other people.
And I should say when you put extra authors in your papers,
the work gets multiplied, right?
So two people and even at an exponential rate.
So two people will not just do twice the work.
They will do more than twice the work.
So collaboration is essential
talking to people about your ideas is essential being generous with people's participation and
attribution of contributions let's say is important so this way i think you generate robust and
work that that can stay and generate more potential So we've touched on it at many points across the course of the chat,
but what's next for this line of work?
I guess we could maybe just summarise the ones that you've mentioned so far.
If there's any additional ones, then...
No, I think the main thing that we want to work with is going forward is,
yes, the front end, how do you encode the consent constraints the consent
propagation how does the consent live in the data ecosystem or this privacy preferences contract
how the how do these get generated and live in the data ecosystem but most importantly
investigate into more rich language that stand between the user and the service provider,
the data owner and the service provider,
richer languages to describe privacy preferences and consent.
In this problem, I see knowledge graphs playing a big role.
There are already W3C vocabularies being developed
for GDPR and privacy and consent,
and we can reuse and extend those.
And moreover, these data integration ideas where you have a vocabulary in the middle
and then you have different sources mapping to those vocabularies,
and then you have algorithms in data integration of how you use those settings, right?
This can be reused here because the vocabulary could be like a global vocabulary of contracts
or your global contract, let's say.
And then you have new purposes that come along
in the future, right?
And express themselves as parts of the original vocabulary.
So let's say, and in that way, you can give consent
for future processing without having even thought about it.
Let's say, or you can withdraw consent.
Let's say you don't want face recognition to be done on your pictures if you look too old.
This is an arbitrary constraint.
You might have it.
It's your data.
I'm not the one to judge about your privacy preferences on your data. And then you have a new purpose in the future, which is, let's say, emotion recognition.
But emotion recognition is a type of face recognition.
So if you have done a negative constraint on face recognition, it is implied, it is reasoned automatically by the system that unless the system comes back to you and you allow it,
by default, through reasoning,
you can see that you disallow emotional recognition as well.
So working in all these kind of rich languages in between
that allow for reasoning, that allow for richer purposes,
that allow for withdrawal of consent, which is very important.
What happens in the future if I want to withdraw my data
from a particular service provider, right?
That has already obtained my data, my data set,
and run away with it, right?
If we all point to some centralized knowledge graph,
I might be able to express that and propagate that
all the way down to everybody who's got a hold of my data, even if I don't know who.
Nice. I just wanted to kind of pull up on the thread a little bit more of how you approach idea generation and your whole process.
You like to go brainstorming and kind of collaborate with students.
You can throw some ideas on and then kind of iterate like that.
How do you then go about selecting which projects or which ideas to pursue?
Can you tell me a little bit more about your process there?
Okay, so I'm not doing this in a very strategic way in the sense that,
oh, this idea is going to get me lots of papers.
This is important as well, but I want to feel that the idea itself is important
and will provide contributions to
our field, to science and to
society in general. And like this idea,
it can be so small as looking at it
within the relational databases,
or it could be as large as expressing your privacy preferences
over your personal data on the web, in the wild, right?
Which is like a contribution to society at the end of the day.
So I like to look at ideas that have potential for greater good,
but also, of course, be relevant in what we are living today.
Right. The once I get an idea and I think it's cool, I usually go after it.
I usually go after it. So sometimes this doesn't pan out. So some ideas are better than others, right? But I think once I get an idea in my head and I think it's cool,
I talk to my students and my boss docs or the idea might come from them sometimes, right?
And we brainstorm, we get excited, we spend hours in front of the whiteboard, right?
That is sometimes we, I still, sometimes I still work as a student in the sense that I don't go home for dinner.
I just stay and, you know, order and take away and let's stay and work on this.
This is so exciting.
The important thing for me is to have fun.
If you have fun doing that, then you should keep on doing that, right?
If you are working on an idea and you hate it but it's hot and you you believe that
you should work on it don't do it it usually does not produce so yes have fun at the end of the day
have fun what do you think is the biggest challenge in this research area now i can't
be my answer will be very short in that it will be it's a cultural challenge as long as as people
will get the idea of collaborative
privacy i think they are going to jump on it and i'm happy with it go and if the listeners like it
go and take it and and play with it and and produce things right i think as as soon as we
realize oh we've been doing it's 2022 23 almost and we've been writing contracts in natural language and terms
and conditions these huge ugly documents in natural language instead of having some natural
language default and then doing so so much automation afterwards like we can do so many
things automatically in between so this is a cultural change that that needs to happen once this this
idea starts spreading i think this this will become a a very hot area what's the one key
thing you want the listeners to take away from from this this show this episode have fun i think
the one key thing is have fun with uh when you have fun, you produce nice, good research.
Be meticulous, of course, and do things properly.
Have integrity, but also have fun.
That's my takeaway message.
That's a brilliant message.
We'll end it on that.
Thanks so much, George.
If you're more interested in knowing more about George's work,
put links to all of the things you mentioned
across the show in the show notes.
And we will see you next time
for some more awesome computer science research.
Thank you, Jack.
Thank you.