Disseminate: The Computer Science Research Podcast - George Konstantinidis | Enabling Personal Consent in Databases | #14

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. I'm delighted to say I'm joined today by George Konstantinidis, who will be talking about his VLDB22 paper, Enabling Personal Consent in Databases. George is an assistant professor at the University of Southampton, and he's also a Turing Fellow with the Alan Turing Institute in London. His research interests are databases, data integration, data sharing, and data knowledge graphs. Welcome to the show, George. Hi, Jack. Thanks for inviting me. Nice to be here. Pleasure to have you. So let's dive straight in. So can you tell us a little bit about your journey?

Starting point is 00:01:05 How did you get interested in researching databases? And obviously, specifically, the topic of the show today, personal consent. Yeah. So since after I graduated from my undergraduate bachelor's degree at the University of Crete in Greece, I started developing an interest in artificial intelligence and data management at the same time. And I did my first master's there at the University of Crete and worked for a bit at the foundation of technology, research and technology, HELAS. That's called FORTH, the acronym. So during that time,

Starting point is 00:01:50 I think I started shifting more towards databases because I found it to be more principled. Back then, AI was mostly working with the model of let's find a problem and throw everything we have at it. It was not so principled to my eyes. Then later, you know, with all the deep learning and machine learning evolution, it became much more principled and maybe to its, not to its benefit, I'm not sure. But in addition to the more principled fields

Starting point is 00:02:32 that I could see in databases, I liked more the conferences, the community. So gradually, eventually, I found myself working in the field of data management. Let's get into the meat of the show today. So can you tell us what is collaborative privacy? And it's a really important kind of topic in your paper. So can you give the listener an overview of kind of what it is?

Starting point is 00:02:59 Yes, sure. So this was born with the idea, I mean, back around the time when GDPR was about to come out, we started thinking about how could we support, technologically support, this kind of ideas of GDPR, of protecting personal consent and personal privacy in a machine processable way. So collaborative privacy is the technology that allows collaborative parties to automatically capture, update and implement a data privacy contract. So when you have data sharing between entities, they usually agree on several terms

Starting point is 00:03:46 and collaborative privacy is there to provide automation to that process of agreement and enforcement implementation of those terms. It's a new concept that we're trying to push and it's related to data privacy, but it's not exactly data privacy because in the collaborative privacy concept, you don't have an adversary. So the idea is that you are trying to enforce your privacy preferences in coordination with

Starting point is 00:04:22 the service provider, not hiding something, not encrypting something against the service provider. Okay, because it's not like a different trust model that you have. Okay. So what's an example? Can you give us an example of how someone in their day-to-day life would be involved

Starting point is 00:04:40 in such an agreement? What's a typical example? So you have typical examples from, I mean, it's everywhere in all our interactions with services on the web today. So even on social media, you commit your data to a social media provider and you trust them. You have agreed with them on the use of your data using these terms and conditions documents, maybe doing some opt-in, opt-out choices, right? And then that's your contract.

Starting point is 00:05:15 But you completely trust them with your data, right? Another very simple example is when you go to buy something from an e-commerce website and you put your email address. You don't encrypt it. You don't secure it there, right? But you trust the website to use it per your agreed preferences, right? To use it according to your agreed preferences.

Starting point is 00:05:37 So use it to send you update emails about your purchase, but not advertisement, let's say. So that's on one end, on the consumer end. But you can go all the way to businesses merging their databases or federated learning of different datasets where you have privacy concerns between those datasets. Or when you have a company buying another company

Starting point is 00:06:07 and they want to merge the data. Or in several other data consortia, let's say, where you have a data sharing scenario, you have these agreements that are put in place in order for some preferences to be respected. It's there that we envision technology playing a major role in the future. So what are the problems with the current way that collaborative privacy is implemented?

Starting point is 00:06:36 I mean, I know the one thing from my experience is the terms and conditions. I mean, I very rarely read them, right? And I saw, I remember seeing this art show once where someone had printed off basically all the terms and conditions from the popular social media companies at the time and they were so long. There's no way a normal person would read them. I'm sure if you added them all up,

Starting point is 00:06:55 there'd be not enough years in your life to read them all, the amount you agree to these days. But anyway, yeah. Exactly. So this is one aspect of the problem. The fact that no one reads those terms and conditions. They are written in natural, in legal language, right? They're hard to read and they are much more hard to enforce.

Starting point is 00:07:13 So usually these terms are written in a top-down way. So you don't put the terms in the contract. The service provider does, right? So they are top-down, they are imposed on you. Usually they are an accept all or nothing kind of agreement. So you want to accept the terms, you get the service. You don't want to accept the terms, you don't get the service. There is no fine-grained saying for you in those terms, right? So these terms are very coarse-grained.

Starting point is 00:07:46 The only amount of automation that happens is very coarse-grained. It's predefined opt-in or opt-out options for a particular set of scenarios, right? At the same time, this is a problem for the service provider as well because they do this agreement with you and then they have to hire an in-house engineer

Starting point is 00:08:09 and tell them to implement into code whatever they have agreed with Jack or George or whoever. And because this implementation is ad hoc for an agreement, you cannot have these agreements varying a lot between users right now because you have to implement them. So you have to implement a different agreement, right? So the idea is that if you do this somehow automatically

Starting point is 00:08:42 or semi-automatically, you could give more power to the user to have a say over their personal preferences in a more bottom-up way. So the user will co-construct the contract, which will be in a machine-processable way. Of course, to go there, it's not immediate because you have to have users that are data literate, right? Or you have to have agents that act on behalf of users. They have some defaults and they put some kind of constraints in the system in a machine processable form. And then this is the contract and the contract gets automatically respected, automatically implemented.

Starting point is 00:09:21 So this is the idea and we have done some initial work on this. Cool. So just to kind of, I guess, go jump back, you mentioned earlier on in that, you kind of said how it's different from data privacy, but are there any techniques that exist in the sort of data privacy space that could be applied to help address this? Or is it just totally kind of not relevant? No, it's a very good question. And that's the first thing that we looked into. Okay, first we said that collaborative privacy

Starting point is 00:09:51 is not substituting data privacy. It comes after. It's complementary, right? So data privacy comes before you encrypt, you secure whatever you want to protect. Then you have to commit some data. You have to give some data, but then you still have privacy preferences

Starting point is 00:10:06 or concerns, right? So in that sense, it's not data privacy, but at the same time, we could look into particular data privacy techniques. There is, for example, a technique that we started looking, we originated from looking into that technique, which is called controlled query evaluation, where you have a query to be executed against a database, but you want to do this in a controlled way,

Starting point is 00:10:30 in a way that respects some requirements written down in a machine-processable way or some specifications, some data privacy concerns. But this is very rigid in data privacy. You are afraid of of revealing too much information so you are you're very very strict in in our setting you you trust the other party so you can reveal more information so so deep down the the in the at the first instance the technology that we started developing is related to data privacy technologies, but from a different spin with different semantics.

Starting point is 00:11:10 Okay, cool. So I guess let's dig into your solution. So this is called consent constraints. Can you tell us what these are and how they work? So I spoke about the vision that we have, right? But as a first step, we wanted to start from software and keep it simple. And we said, okay, let's start from relational databases and see what can we do within relational databases. How can we capture some kind of consent of preferences on the processing that can happen inside the relational database. What kind of processing can happen? What is the most common processing in relational databases? It is query answering. So, okay, let's try to encode some constraints

Starting point is 00:11:59 that will impose some restrictions on query answering. Again, looking into data privacy, there has been some work that deals with what is called denial constraints. Denial constraints are queries, are negative queries, are queries that you don't want to be answered. So at first we imagined the setting where by default, the user would allow queries to be answered. So at first we imagine the setting where by default the user would allow queries to be answered unless they explicitly want some kind of join, some kind of projection or some kind of

Starting point is 00:12:34 selection not taking place. In that case they would write a negative query that in the face of a query from a service provider or some kind of processing from the service provider will affect how the service provider's query will get answered. So this is the idea. Again, it's not completely protecting against never finding out the answers of the negative query. It's more like explicitly protecting against certain operations happening on the data at the time of the query answering. And then trusting the service provider that for a future query, they will go through the system again, rather than try, let's say, to infer something that they shouldn't. Okay, so there when you said that it's not completely protected,

Starting point is 00:13:22 that you can, it's basically, it's preventing specific operations happening, but you can, in some other roundabout way, deduce something that you would have necessarily kind of not been concerned about. So the way classic data privacy works is it is afraid of collusions. It is afraid to answer query A and then query B in a privacy compliant way. But then somehow through the answers of the query A and then query B in a privacy compliant way

Starting point is 00:13:46 but then somehow through the answers of the query A and query B the adversary can combine those answers and find something out which is not explicitly violated in query A or query B in isolation. In our setting we don't

Starting point is 00:14:02 we're not afraid in some sense of this collusion happening. Unless something is explicitly violated, we return the answers. So we give more answers, essentially, to the service provider to play with, at their disposal. And we trust the service provider that when they want to do some combined processing, they again will go through the system and explicitly filter. And so they will not try to do collusion and obtain something that we don't want them to obtain. Again, this is not about protecting, it's about enabling the service

Starting point is 00:14:38 provider to do the most processing that they want to do in a consent abiding way. Okay, great. So you've obviously taken this idea then, and you've designed an algorithm, a system that will allow a service provider to go and honour these consent constraints. Can you tell us more about how the algorithm actually works and how the system works, how that kind of was all done? So there are two algorithms. Initially, we had to try to find semantics for query answering.

Starting point is 00:15:04 So we wrote queries and constraints on the blackboard, There are two algorithms. Initially, we had to try to find semantics for query answering. So we wrote queries and constraints on the blackboard and we scratched our heads and said, okay, what does this mean? What answers do I want here? And slowly we ended up in a semantics definition which is based on provenance. So the idea is that in an imaginary world, you tag every tuple and every cell of your data,

Starting point is 00:15:30 you annotate it with some label, and then both the consent constraints, these negative queries, and the service provider's query, they both get answered on these annotated databases. And then the answers carry with them some annotations, some provenance on the way that describes the way of how these answers got obtained, right?

Starting point is 00:15:59 These answered apples. So you do that both for the query, the input query, and for your, let's say, consent contract, and you try to do some kind of difference there. So you give back to the query issuer only those answers that are not labeled with, let's say, data that you don't want to be given back. But these labels are more complex than traditional labels in the sense that you can now annotate joints rather than just simple cells. With some mechanisms that we have, you can allow, let's say, two labels in isolation

Starting point is 00:16:41 to be given to the service provider, but let's say my disease table, right? Or the rows of the disease table that belong to me, right? But then when you join the disease table, let's say with the insurance table, it's then when I don't want my information to be used. So I can describe things like that. That was the first algorithm. That was mostly to give theoretical foundation to be used. So I can describe things like that. That was the first algorithm. And that was mostly to give theoretical foundation to the work.

Starting point is 00:17:09 And then based on that algorithm, we proved some complexity results. We proved some formal connections to data privacy. And then we went on and we devised a second algorithm, which does not need to touch the data. It's data agnostic in a sense. So you have your consent contract and you have your input query. And then what we do is query writing.

Starting point is 00:17:30 We rewrite the input query into a new query that no matter the database, when executed will abide by the consent contract. And of course, this query depends on how large your consent contract is, how many consent constraints you have. It could be a very large query. So it's not 100% clear that this is the best approach to go about,

Starting point is 00:17:58 although our experiments show that this is a better approach than the provenance-based mechanism. Yeah, I was going to say, than the provenance-based mechanism. Yeah, I was going to say, because the provenance-based mechanism sounds like it has quite, it's quite invasive, right? It sounds like it has quite a potential high cost on the, kind of bloat in the storage layer, maybe, if you're annotating everything. But yeah, I guess then the trade-off of the query writing is that there's a cost associated with that as well.

Starting point is 00:18:20 Exactly. It depends how large your query writing becomes. Yeah, there's a trade-off space there, I guess. And there is work there to still work there to investigate what is the best approach for different scenarios. So we are currently in the process of that. Amazing. So what were the challenges you had to overcome in this sort of process of starting off with algorithm one and going on to algorithm two? Okay, so first I would like to point out one of the major challenges in this line of work is cultural a cultural obstacle okay so initially when when we were um uh trying to publish this work uh the comments that we were getting back is how how can you be sure that you will enforce the contract, right? So there is this still cultural change that needs to happen

Starting point is 00:19:07 to understand that today we do give data to service providers with no mechanistic, no algorithmic guarantee of privacy enforcement, right? The privacy enforcement that you now trust is all legal or extra algorithmic, right? So the first obstacle that we have to, of course, when something like this happens, you always come back and you say, I should have described the work better. I should have described the motivation better.

Starting point is 00:19:40 But still, I see this when talking to people, it's a hard-coded way of thinking that we have about privacy that does not allow us to easily switch to this new model. And we always go back to think, how am I going to enforce this? How am I going to enforce that the service provider is not going to violate this? And the answer is, you're doing that for the service provider as well, because the problem starts with, they don't want to violate your preferences. They have business incentives not to violate your preferences. You remember how much, let's say, a big social media company changed policies after certain scandals, like the Cambridge Analytical scandal. They have business incentives to do that.

Starting point is 00:20:30 They have legal incentives to do that. And because they don't want to, they attract more customers, if you will, if they are more transparent, right? So they want to have this automated means to not violate your preferences, right? So they want to find this, they want to have this automated means to not violate your preferences, right? So this was the cultural challenge that we had. Of course, then we had technical challenges, right? So technical challenges is, again, it had to do with what does a consent constraint mean exactly? How do we encode this kind of joins? Why queries? This seems unnatural. Why? I mean, you have some privacy preferences. Can you encode all your privacy

Starting point is 00:21:11 preferences in queries? No, you cannot, right? Unless you have a table for anything, for purposes. You must have a table that describes purposes, right? Otherwise, you cannot. And by the way, previous data privacy work tries to do that by encoding all the language of of purposes inside inside the database itself right so you cannot but you have

Starting point is 00:21:31 to start from somewhere and and we asked ourselves okay so let's start from selection projection and joins these are the main operations that you want to talk about in your in your contract right and the other challenge that we have is okay okay, the average user is not really familiar with selection projection and joins to the extent that they can play with them and write consent contracts. So there is more research to happen there on the automation of these preferences.

Starting point is 00:22:00 How do you go from a friendly UI to these consent constraints? And that's another challenge that we are still working on. And of course, other challenges for the particular paper was what to compare against, because we don't have another approach which is similar. We have terms and conditions, or we have classic data privacy approach. So we did a mix of comparing against both, mostly against the most relevant data privacy approaches. And the last challenge had to do with obtaining data and to run our experiments because this is a new idea.

Starting point is 00:22:39 There are no consent constraints documented. How you are going to do this? And again, we created data generators and consent generators and we did experiments on synthetic data but we also obtained real data, anonymized data from clinical trials and from patients in ICU units and wrote some constraints of our own on top of this real data. We can maybe just dig into a little bit and how you actually took the algorithms and whatnot and how you implemented these in some framework.

Starting point is 00:23:15 Can you maybe tell us more about the framework that you use to then evaluate your approach? Yes. Okay. So first, with respect to the classic data privacy, there is a work, a series of works, actually, known as Hippocratic databases. And where the ideas of the privacy ideas that they had implemented there, they were reminiscent of what we were trying to do here. So they had some opt-in, opt-out choices, but of course, from a classic data privacy perspective, when you opt it out to share a particular row, you could not even use that in joints anywhere. It was blanked out, essentially. And in our case, we implemented

Starting point is 00:24:01 an opt-in, opt-out approach using our consent constraints, which are much more powerful. But just to compare, we use them only as hiding projections. That is opt-ins or opt-outs, essentially, right? Projecting out columns from particular rows of your queries. But in our semantics, you can even still do joins with hidden attributes

Starting point is 00:24:24 as long as you don't want these attributes to be returned. So it returns more data. So when we compared, we compared how fast we are, but also how much more data we return against this other data privacy approach. So this is the connection, the technical connection, let's say, to this other approach that is out there. So in terms of the framework, right now it's a prototypical implementation. In the input, we have these consent constraints. In some notation, we use the data log notation to write rules. And we have implemented this in

Starting point is 00:25:07 everything in Java. You get the input query in SQL and you get these constraints and then you create a SQL rewriting in the output of your Java program that is good for execution on top of your databases, on your database, and we executed that in Postgres,

Starting point is 00:25:24 in the Postgres database so this framework right now once you know what what you want to do with it it's ready to to go it's it's open source the link is in our paper the github link is in our paper but of course you would need all the technology around that to use it right now. So you would need to go from the preferences to the actual consent constraints. We are doing active research on that. And you would need to encode that inside your system somehow. But the framework can be run autonomously. We mostly support the query rewriting approach.

Starting point is 00:26:07 The provenance-based approach was mostly implemented for comparison and is not, let's say, production-ready in some sense. So we do have that in the GitHub as well, but it's probably more buggy than the query rewriting approach. What was your approach to evaluate a new framework and kind of the questions you were trying to answer and how did that look like the good thing with this technology is that and with the vision that we have for this technology and i can go into that more later i mean what comes in the future. But the vision is that you can encode both personal constraints.

Starting point is 00:26:48 So bottom-up, let's say, creating a contract of data sharing bottom-up or data processing bottom-up, but also top-down. So you can encode also institutional policies, right? So you can encode things that talk about large portions of your data with these queries. Simply, you know, an atomic query that mentions stable employees simply, and let's say forbids the projection of the first attribute, is talking about the entire set of your employees, right?

Starting point is 00:27:20 So this is like an institutional policy. Do not do this at the institutional level. But you can also fix in your query your personal ID and talk about... So these negative query answers, so the way you talk about your data is through these negative queries, will only be particular to your data that contain this particular ID because you fixed it in the query. So you can go all the way from institutional policies to personal constraints. And of course, a natural question is, how do you perform?

Starting point is 00:27:55 What can be supported? What are the limits here, right? Indistinguishably, we found out that personal consent constraints are easier supported to a bigger scale, right? We can scale to thousands, tens of thousands of personal consent constraints in our experiments than scaling these large institutional policies, right?

Starting point is 00:28:20 This, so interestingly, we found that a major factor that affects query performance here is how much data does your negative consent constraint touch? How much data does the negative consent constraint talk about? If it talks about a little bit of data, even if you have hundreds or thousands of them, so individual user policies, normally you would expect the data for an individual user within a database to be small, to be localized, right? So in that scenario, you perform fast. You go, essentially, there were cases where the consent-abiding query execution was as fast as the original query executed with no consent enforced. So the consent overhead is not much. Worst case, we found that it is linear in our experiments.

Starting point is 00:29:16 In theory, it could be worse because the problem we proved is an NP-hard problem. So in theory, it could be worse. But in practical settings, in all our generated settings, in the ICU, the real data that we used, the overhead of enforcing your privacy is kind of linear. When you go to global policies, this changes radically depending on what kind of policy you want to enforce. You have policies that are very complex and they slow down the query very much. And you have policies that are easier to enforce and they're not so complex. So regarding this interplay and how much bottom-up against top-down contracts closes or requirements you can enforce, we're still looking at this and

Starting point is 00:30:06 investigating into this interplay can you tell us a little more about the the experimental setup and you mentioned earlier on and how you got data from some icu and you know i know you used their tcph as well so can you tell us a little bit more about this the experimental itself so uh for experiments we use the tpch benchmark but of course this um does not have uh it comes with a set of queries and it comes with a different number of scales you can scale from from a few rows in your tables until uh until millions of let's say, of tuples in your tables. And we did all this range of data generation. But of course, it doesn't come with constraints, with consent constraints.

Starting point is 00:30:54 So we looked at the queries that it comes with and we tried to change them to make them more like, to generate, let's say, more like constraints. One thing that we did is we looked back in this Hippocratic databases world that had policies and data privacy policies that had to do with patients. And we tried to mimic those constraints, which again, they were hardcore, hard data, strict data privacy constraints, but still they were useful to us. We try to mimic those constraints on top of our TPC-8 benchmark, which means if there was a constraint that talked about a patient,

Starting point is 00:31:41 we transformed it to a constraint that talks about a customer. So a customer does not want to share their address, similarly to a patient does not want to share their disease or something like that. So this was a kind of inspiration that we took to create realistic constraints on top of the TPC-8 benchmark. And then we created, using this, we created hundreds of constraints or thousands of constraints by simply talking about more and more customers. And then we created constraints, more complex constraints by joining them with the customer relation with orders relation and putting some constraints on the orders. And so we really drilled down into the TPCH benchmark in a clever way. We didn't simply try to brute force create constraints that do not have any sense.

Starting point is 00:32:36 Similarly, we did with the ICU dataset. Again, looking inside the dataset, we created constraints that had to do with, we found a way to automatically generate constraints because we need large numbers of them that have to do with real concerns that one might have on the length of the stay in the ICU, on the particular treatment that they got while they stayed there, on the disease, on the diagnosis, and so on. So this was the setting, and we scaled to different numbers of constraints. We did experiments with constraints that touch a little bit of data, lots of data, with small databases, large databases.

Starting point is 00:33:20 Awesome. So I guess, what were the key results then? So if you were to kind of summarize them and put some sort of numbers, I know you said before that the smaller the amount of data, the negative query touches, the faster you are. But what are the other key results from your experiments? on par or even faster than data privacy approaches, while giving back on average 30% more answers to the service provider, because again, of this premise that the answers are not explicitly sensitive against the query, and therefore we're not afraid to give those answers back, in contrast to data privacy approaches. The slowdown of the queries was linear, as I said,

Starting point is 00:34:13 and we found out that compared to classic data privacy approaches like controlled query evaluation, there is a version of our framework that can also be used as that. So, and we did that both theoretically by proving that and we found, we did an experiment as well where we use our framework

Starting point is 00:34:36 as strict data privacy, as a strict data privacy framework. So in that case, but this for us, this can only happen with particular kinds of constraints. When you restrict yourself in, let's say, in what we call Boolean constraints, constraints that do not talk about projections, then the enforcement of these constraints will be privacy abiding as well, in some sense.

Starting point is 00:35:03 Okay, cool. So are there any situations in when your approach is performance is suboptimal? I'm guessing I'm trying to get here. What are the limitations of this? Yeah, I think also what I forgot to mention earlier, that we did some experiments with aggregation as well. But for this, there is much more research to be done because when the

Starting point is 00:35:27 user, the service provider queries have aggregation, what we did we took it out of the query, we executed the query in a consent-abided way and we put it back. That's one semantics to go about it. But is it the correct one? Is it what you want? And more interesting

Starting point is 00:35:43 is how do you evolve your your consent language to contain aggregations and this is not clear at all we are currently looking into that but we did some initial experiments with with the disclaimer that the results might not be consistent because there is no developed semantics right now but these are the numbers this is how it would look the limitations in the terms of experiments are those that i mentioned in terms of performance the limitations are when you use constraints that touch lots of data you don't scale as much the slowdown is becomes becomes large however it depends on the application of how much slowdown you can actually tolerate. Because if you're doing this

Starting point is 00:36:28 overnight, you might be able to do even large policies. If you're doing this every second, it's a different story. So it depends on really the application what are the actual limitations there. Other limitations of the work in general are again the as i mentioned are again um how to capture the users preferences and and make them into this consent constraints and we are actively working on that uh but the the most maybe the major limitation which is which i don't see a limitation, but I see as an opportunity to invest and look into,

Starting point is 00:37:07 is the consent language right now. The consent language, yes, it's more expressive than what you could do with opt-in, opt-out, and we also experimentally verified that by giving back more answers. But it's still a language that has selections, projections, and joins. That's it. That's the queries that we can write, conductive queries, projections, and joins. That's it.

Starting point is 00:37:25 That's the queries that we can write, conductive queries, essentially, that we can write as constraints. What happens when you extend this language and how you can extend this language? How do you talk about purposes? How do you talk about other kinds of processing on top of relational data?

Starting point is 00:37:41 The queries is not the only kind of processing that can happen, right? Learning, let's say, analytics, other kinds of processing that can happen on top of your data. So currently you cannot do this with our initial work, but this is one major direction for us for future research. Cool.

Starting point is 00:38:00 So you mentioned earlier on that the framework is available on GitHub, but how has, maybe there's some ongoing work in this area but how is a sort of service provider can how easy would it be I guess just a question to take what's there at the minute even though it's in a kind of a yeah production ready and then integrate that into my sort of um yes yes again I know I know there are approaches in industry that, for example, I think Snowflake and Amazon have approaches that are called data clean rooms now that try to do some kind of data privacy enforcement in the face of queries and privacy preferences.

Starting point is 00:38:39 They do some kind of cleaning of the query answer or something like that. I'm not sure what's happening, but I know the work is relevant. So for something like that, you could take the framework with appropriate licenses. That is, it's open source, but we have, we do want the framework to remain and extensions to remain open source. So you can take the framework

Starting point is 00:39:09 and as long as you have some kind of preferences, you can always translate them in a data translatable to this kind of query language that we use. You can simply take the framework and run it. You take the input query, you put it in a middleware between the user's query and the database, and you just run it. Now, the question is, how do you obtain the constraint?

Starting point is 00:39:35 I want to claim that even for industry right now or for large entities, to use our framework is easier than what they are doing now. Because they would still need to hire somebody who would look at the user preferences, even the terms and conditions that are being written right now, and translate those to these negative constructive queries that we

Starting point is 00:39:58 use, negative constraints. Some of those. Of course, you cannot substitute the entire agreement, arbitrarily legal sentences, legal clauses, right? But when it comes to query processing preferences, you can encode some of those, even manually, to our constraints. Then instead of going all the way from the contract to implementation code, you go from the contract to a declarative language and then you execute that

Starting point is 00:40:27 in an automatic way so i hope this this will find applications but of course we are not being quiet we are still working on aspects of this actually i've recently been awarded a two horizon projects horizon Horizon Europe projects. One is RAISE. It's called RAISE. The other one is called Upcast. RAISE has already started. RAISE stands for Research Analysis and Identifier System.

Starting point is 00:40:56 And the RAISE project is about moving, instead of moving data to algorithms, moving algorithms to data. And do this in a privacy-preserving way. So we are doing the privacy aspect of the, let's say, the privacy-respecting aspect of collaborating privacy aspect of the project. And the Upcast is about to start in January. Again, it's a very large consortium, a very large project, where it's mostly dealing with data marketplaces.

Starting point is 00:41:25 And these ideas are central to that project. How do you negotiate contracts in data sharing and how do you put them in place using an automated, let's say, algorithmic way? This slowly, so it's getting more and more traction and enables us to extend this in many more directions in the future. So what's been the most interesting and maybe unexpected lesson that you have learned while working on this project and on personal concern. So the most interesting lesson is that we can always reuse and always connect different areas in ways, in new ways that we didn't imagine before, right? And this is not so

Starting point is 00:42:22 much about the work that I have done in this paper. This is about additional work that we are doing, where we are actually reusing ideas from data integration and knowledge graphs to support this kind of automated contracts. And then suddenly all these technologies become relevant again, which were hot in early 2000. Let's say query answering using views or other kinds of problems.

Starting point is 00:42:49 They were very relevant in the early 2000s, then kind of died out a little bit. Also AI techniques that are put aside now for the sake of deep learning, right, of approximate things. But right now you realize that you cannot approximate the enforcement of a contract you have to do it in a strict way so again you have you have to use we are we are going back to using some strict knowledge

Starting point is 00:43:19 representation and reasoning techniques right but do some discrete, some exact reasoning, because you want to do some kind of exact reasoning on top of these contracts and say what is allowed, what is not allowed. And you cannot simply learn this. Of course, machine learning and deep learning still has its place, especially in the front end of this work, because you have to learn the preferences of the user.

Starting point is 00:43:43 But how to enforce them, you cannot enforce them in an approximate way. So one of the interesting things that come out of this work because you have to learn the preferences of the user. But how to enforce them, you cannot enforce them in an approximate way. So one of the interesting things that come out of this is that I can see that all things becoming new again, and I can see that I can reuse and I can connect dots that I didn't know that were there to be connected. In terms of culture, there are some surprises because people always discuss when new research happens that a cultural change is difficult, but it's always interesting to see this from a first row, let's say, and trying to advocate for this and trying to see that, you know, a cultural change is really happening here. In terms of results, of technical results,

Starting point is 00:44:27 they were not really surprises because we really believed that this would be viable and feasible. So there were very positive results that came out, but I wouldn't call them surprises per se. Yeah, it's very interesting how things often come full circle, right? Exactly. I think in computer science, that's a characteristic of our field,

Starting point is 00:44:52 that it works like a pendulum. Things go from one end to the other and back. Let's see, for example, what's happening three or four times already with distributed against centralized computing. So I guess the knick-knacks, because I tend to ask this to most of my interviews, is research obviously is non-linear, right? It's a bumpy process. So from the conception of the idea for this to the actual final publication, what were the things that you tried on the way that maybe the listeners

Starting point is 00:45:23 could benefit from knowing about? The way I worked here is that I didn't abandon the idea when it was first rejected, let's say, and I put more into it. So I gambled in some sense. So I hired more people and started working more into it. So I prepared papers for down the line and extensions to the system. We already have a very cool extension to the system that is not published. It's about to be published, which is the following. Consider some data processing that happens. And with this consent

Starting point is 00:45:58 constraints, you answer a query and you get back a data output, let's say, the answer of your query. And this goes to further live on within the data ecosystem. We can automatically generate consent constraints, so a new contract for the query output based on the old contract. We call this consent propagation. So you have a consent contract, you do some kind of processing, you generate an output, a this was one thing, for example, that we did while the paper got rejected. And instead of, I mean, we still kept on working on the paper.

Starting point is 00:46:52 The paper got rejected for cultural reasons. How are you going to enforce this again? We didn't get disappointed and we continued working on this, extending this. So the way i work is i i like brainstorming on using the whiteboard with the students with the postdocs with colleagues right

Starting point is 00:47:15 and then taking some homework for everyone then trying trying this out and coming back and brainstorming again it's very very important very important to communicate with other people. And I should say when you put extra authors in your papers, the work gets multiplied, right? So two people and even at an exponential rate. So two people will not just do twice the work. They will do more than twice the work. So collaboration is essential

Starting point is 00:47:46 talking to people about your ideas is essential being generous with people's participation and attribution of contributions let's say is important so this way i think you generate robust and work that that can stay and generate more potential So we've touched on it at many points across the course of the chat, but what's next for this line of work? I guess we could maybe just summarise the ones that you've mentioned so far. If there's any additional ones, then... No, I think the main thing that we want to work with is going forward is, yes, the front end, how do you encode the consent constraints the consent

Starting point is 00:48:26 propagation how does the consent live in the data ecosystem or this privacy preferences contract how the how do these get generated and live in the data ecosystem but most importantly investigate into more rich language that stand between the user and the service provider, the data owner and the service provider, richer languages to describe privacy preferences and consent. In this problem, I see knowledge graphs playing a big role. There are already W3C vocabularies being developed for GDPR and privacy and consent,

Starting point is 00:49:06 and we can reuse and extend those. And moreover, these data integration ideas where you have a vocabulary in the middle and then you have different sources mapping to those vocabularies, and then you have algorithms in data integration of how you use those settings, right? This can be reused here because the vocabulary could be like a global vocabulary of contracts or your global contract, let's say. And then you have new purposes that come along in the future, right?

Starting point is 00:49:35 And express themselves as parts of the original vocabulary. So let's say, and in that way, you can give consent for future processing without having even thought about it. Let's say, or you can withdraw consent. Let's say you don't want face recognition to be done on your pictures if you look too old. This is an arbitrary constraint. You might have it. It's your data.

Starting point is 00:49:59 I'm not the one to judge about your privacy preferences on your data. And then you have a new purpose in the future, which is, let's say, emotion recognition. But emotion recognition is a type of face recognition. So if you have done a negative constraint on face recognition, it is implied, it is reasoned automatically by the system that unless the system comes back to you and you allow it, by default, through reasoning, you can see that you disallow emotional recognition as well. So working in all these kind of rich languages in between that allow for reasoning, that allow for richer purposes, that allow for withdrawal of consent, which is very important.

Starting point is 00:50:45 What happens in the future if I want to withdraw my data from a particular service provider, right? That has already obtained my data, my data set, and run away with it, right? If we all point to some centralized knowledge graph, I might be able to express that and propagate that all the way down to everybody who's got a hold of my data, even if I don't know who. Nice. I just wanted to kind of pull up on the thread a little bit more of how you approach idea generation and your whole process.

Starting point is 00:51:17 You like to go brainstorming and kind of collaborate with students. You can throw some ideas on and then kind of iterate like that. How do you then go about selecting which projects or which ideas to pursue? Can you tell me a little bit more about your process there? Okay, so I'm not doing this in a very strategic way in the sense that, oh, this idea is going to get me lots of papers. This is important as well, but I want to feel that the idea itself is important and will provide contributions to

Starting point is 00:51:51 our field, to science and to society in general. And like this idea, it can be so small as looking at it within the relational databases, or it could be as large as expressing your privacy preferences over your personal data on the web, in the wild, right? Which is like a contribution to society at the end of the day. So I like to look at ideas that have potential for greater good,

Starting point is 00:52:23 but also, of course, be relevant in what we are living today. Right. The once I get an idea and I think it's cool, I usually go after it. I usually go after it. So sometimes this doesn't pan out. So some ideas are better than others, right? But I think once I get an idea in my head and I think it's cool, I talk to my students and my boss docs or the idea might come from them sometimes, right? And we brainstorm, we get excited, we spend hours in front of the whiteboard, right? That is sometimes we, I still, sometimes I still work as a student in the sense that I don't go home for dinner. I just stay and, you know, order and take away and let's stay and work on this. This is so exciting.

Starting point is 00:53:14 The important thing for me is to have fun. If you have fun doing that, then you should keep on doing that, right? If you are working on an idea and you hate it but it's hot and you you believe that you should work on it don't do it it usually does not produce so yes have fun at the end of the day have fun what do you think is the biggest challenge in this research area now i can't be my answer will be very short in that it will be it's a cultural challenge as long as as people will get the idea of collaborative privacy i think they are going to jump on it and i'm happy with it go and if the listeners like it

Starting point is 00:53:51 go and take it and and play with it and and produce things right i think as as soon as we realize oh we've been doing it's 2022 23 almost and we've been writing contracts in natural language and terms and conditions these huge ugly documents in natural language instead of having some natural language default and then doing so so much automation afterwards like we can do so many things automatically in between so this is a cultural change that that needs to happen once this this idea starts spreading i think this this will become a a very hot area what's the one key thing you want the listeners to take away from from this this show this episode have fun i think the one key thing is have fun with uh when you have fun, you produce nice, good research.

Starting point is 00:54:47 Be meticulous, of course, and do things properly. Have integrity, but also have fun. That's my takeaway message. That's a brilliant message. We'll end it on that. Thanks so much, George. If you're more interested in knowing more about George's work, put links to all of the things you mentioned

Starting point is 00:55:07 across the show in the show notes. And we will see you next time for some more awesome computer science research. Thank you, Jack. Thank you.

Disseminate: The Computer Science Research Podcast - George Konstantinidis | Enabling Personal Consent in Databases | #14

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.