Disseminate: The Computer Science Research Podcast - Lessons Learned from Five Years of Artifact Evaluations at EuroSys | #64

Episode Date: July 30, 2025

In this episode we are joined by Thaleia Doudali, Miguel Matos, and Anjo Vahldiek-Oberwagner to delve into five years of experience managing artifact evaluation at the EuroSys conference. They explain... the goals and mechanics of artifact evaluation, a voluntary process that encourages reproducibility and reusability in computer systems research by assessing the supporting code, data, and documentation of accepted papers. The conversation outlines the three-tiered badge system, the multi-phase review process, and the importance of open-source practices. The guests present data showing increasing participation, sustained artifact availability, and varying levels of community engagement, underscoring the growing relevance of artifacts in validating and extending research.The discussion also highlights recurring challenges such as tight timelines between paper acceptance and camera-ready deadlines, disparities in expectations between main program and artifact committees, difficulties with specialized hardware requirements, and lack of institutional continuity among evaluators. To address these, the guests propose early artifact preparation, stronger integration across committees, formalization of evaluation guidelines, and possibly making artifact submission mandatory. They advocate for broader standardization across CS subfields and suggest introducing a “Test of Time” award for artifacts. Looking to the future, they envision a more scalable, consistent, and impactful artifact evaluation process—but caution that continued growth in paper volume will demand innovation to maintain quality and reviewer sustainability.Links:Lessons Learned from Five Years of Artifact Evaluations at EuroSys [DOI] Thaleia's HomepageAnjo's HomepageMiguel's Homepage Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Disseminate the Computer Science Research Podcast Hello and welcome to disseminate the computer science research podcast. Today's episode is going to be slightly different to the usual episode, and we're going to be exploring a topic that's really important to how we build trust in scientific results and the scientific process, and that is artefacts evaluation. Specifically, we'll be talking about a recent paper that is titled Lessons Learned from Five Years of Artifacts Evaluation at EurSys.
Starting point is 00:00:29 And I'm sure a lot of our listeners know what EurSys is as a conference, but for those who don't, EurSys is a leading European conference that covers a whole wide range of computer systems, of aspects of computer systems, from operating systems all the way up to embedded systems, databases, networks and storage, a whole range of topics at Euris. And yeah, so the paper was co-authored by several of the guests I've got on the show today, actually, and who were the artefacts evaluation co-chairs from the years of 2021 to 2025. And they're going to be talking today about their collective experience, reflects on what work for them, what hasn't over that time period, and how looking forward, how the process
Starting point is 00:01:12 can be improved and we can evolve things to better support artifact evaluation in systems research. So welcome, guys. I'll let you introduce yourselves if you want to go around one by one, if you want to kick things off failure. Yeah, absolutely. Hi, my name is Thalia Dudali. I am an assistant professor at IMDA Software Institute in Spain, and I was one of the Artifact Evaluation co-chairs in 2025. Miguel, do you want to jump in now? Yes, sure. So, I everyone, thanks for having us, Jack. So I am Miguel Matus. I'm an associate professor at ISD Lisbon and the researcher at Ineshk ID. and I was the artifact evaluation co-shared in 2024. And Joe, you're up.
Starting point is 00:02:00 Hi, Hermannio, I'm a researcher at Interlamps and I was the co-chair in 2022. Fantastic. Well, I just want to say again, thank you all, three of you taking time out of your busy days to come and talk about artefacts evaluation. So I'm going to kick things off for the nice softball and ask you to explain what artifact evaluation is
Starting point is 00:02:19 and why it's so important. And then I'm going to get out of the way and let you three guys discuss the topic. So, yeah, what is artefacts evaluation? Why is it important? Awesome. I will take upon that. So artifact evaluation is a voluntary process
Starting point is 00:02:32 that promotes their reproducibility and reusability of scientific work. So the process is that we submit papers in conferences, like the one in Eurosis. And alongside the paper, we also submit the artifact, the software that was built to create the scientific results produced in the paper. So the software, the data, the documentation, how to run it, how to execute the code, how to replicate some of the results in the paper. And this artifact evaluation
Starting point is 00:03:04 process benefits a lot the scientific community because it really encourages the reproducibility of the results, open sourcing the software, and enables researchers to build upon, to compare against, and to extend the prior work. And this is something extremely useful. in systems research. And for authors in particular, it increases their visibility and their impact of the contribution by making it easier to reuse and validate their scientific results in their paper.
Starting point is 00:03:34 Yes, exactly. So we have been working on kind of refining this process over the different editions. And currently, so this is a voluntary process. No, authors are highly encouraged to participate, but it's completely voluntary. And the process essentially goes in three phases. So after the authors know that the paper is accepted at the conference,
Starting point is 00:04:02 they can apply to the artifact evaluation. So they submit their artifact, which is typically the source code plus maybe benchmarks and data set if relevant. They submit also the main paper that has been already approved by the PC committee of the main conference. And then they also add an appendix that explains how to reproduce the main claims of the paper. So artifact evaluation does not entail necessarily having to reproduce every single result because sometimes for many different reasons that we can go on and discuss a little bit,
Starting point is 00:04:44 it's not possible. But essentially the offers identify. So these are the main claims. key results that should be reproduced and they submit this to the artifact evaluation process. And then there is an intermediate phase which we call the kick the tires phase. And this is essentially a warm-up phase where the reviewers can check the artifact for basic functionality. So is this properly documented? Is there a RITME? Does these compiles? Can I run very simple demonstration case? And this allows us to flag early issues.
Starting point is 00:05:19 and interact with the healthers, if necessary, to ask them, please, this is not running, fix this before we're trying. And then the final phase is the evaluation itself, where the reviewers are essentially going to attempt to verify, to assess the artifacts for completeness, for documentation, the building process, and the ability to reproduce the main claims of the paper. usually artifacts are reviewed by three or four evaluators and then this goes as an unusual conference right there is a discussion phase and then based on the discussion of the reviewers and the viewers feedbacks and comments we will award them the relevant badges and I I will leave the floor to want you to explain those those badges yes so the entire goal of this process is to eventually award badges. There are various standards for our badges
Starting point is 00:06:23 because we are in, or yours is an ACM conference, we follow their policies and guidelines. And early on, the systems community has sort of agreed to three badges for now. That's the artifact available badge, the artifact's functional badge, and then there is a result you produce badge. And they sort of stack in
Starting point is 00:06:43 sort of complexity of evaluation after each badge. The available badge is basically that there is a UI, so a tracker for the artifact and that it's actually publicly released just helps it make it easy to find, I guess. But it does not say much about the functionality. That's what the functional batch is about.
Starting point is 00:07:09 So there we're checking for, or the evaluators are checking for the completeness, whether it's well documented. and it's actually working, right? So it's the first batch that the artifact is executed on. And then the most complex batch to receive is the reproduced badge where basically the authors have to describe what the main claims of the papers are,
Starting point is 00:07:34 and the evaluators then have to perform the same or similar experiments to identify whether the results in the paper are actually able to be reproduced. It might be on a different system. It might be with support from the authors and the same system. It depends on how the evaluation goes. Yeah. And based on the evaluation, the respective baddest would be awarded.
Starting point is 00:08:04 And then they're typically displayed on the conference side in the ACNBritory or on various websites, author websites in the paper, to demonstrate that this paper has a. I've been awarded those badges, right? So it's a really nice byproduct of other paper as well. Cool. So that really sets the background well then for kind of what artifacts evaluation is and all the badges and learn about all the various phases
Starting point is 00:08:31 and how we go and accumulate those badge and get those badges. So as I said at the very top of the show, you guys have been working on this for like around five years between 2021 and 2025. So tell us about your insights then. So I'll come to you first on this, Miguel. Give us an overview of the trends you've seen over the last five years. Sure.
Starting point is 00:08:55 So this essentially involves all of the co-shares of these past five editions. So it was a lot of work. And the way we approached this was to be as systematic as possible, even the data we have available. and to support our analysis and the main conclusions that we have on the paper and that all of us will be happy to discuss here. So we collect the data from multiple sources over these past five editions. So this includes the official conference proceedings to get what badges that are, the CSRTFacts.com.I.O. website, which is the official site for the CSRTFacts evaluation,
Starting point is 00:09:40 not only for Euroses, but also for other systems conferences. So this is probably interesting to link to this site in the show notes as well for the interested listeners. And then we also developed some internal tooling to crunch and collect and scrape all of these data. And we did things like check on GitHub, what are the usage metrics, the stars, how many pools and pull requests and those kinds of things from GitHub. GitHub has been used for source code, but typically we also strive to have a permanent place, a digital repository for this.
Starting point is 00:10:26 Zenodo is a good example of such a repository. And we also use these data sources to understand how these artifacts are being used and shared with the community, how many downloads we have over the years and so on. So the data set is not tons of tons of data, but we already have some reasonable numbers that allow us to support our conclusions and to identify mostly, and this is one of the things
Starting point is 00:10:58 that I find interesting about these work that we did together to identify the trends and the recurring patterns. So to give you just a very brief overview of the numbers, we are talking about. So roughly 60% of the papers that have been accepted at the conference
Starting point is 00:11:17 participated in the artifact evaluation process. So this is a number we want to increase, but let's discuss this later. And of these papers that apply again 60% roughly of all the papers that have been accepted,
Starting point is 00:11:34 we have awarded 161 artifacts available. So this means that the code and data sets and so are available for the community to try. Out of these 161, 136, they are deemed functional. So they got the functional patch, as Anjo explained before. And 75 artifacts were awarded the results reproduced which means that we were able, or the reviewers rather, were able to reproduce the claims made on the paper. So this is lots of work.
Starting point is 00:12:18 It's very challenging because we have to deal with many different things. And maybe Thalia, you want to talk a little bit about that? Yeah, so it's interesting to see that we have had an increasing interest in people participating in the Artifact Evaluation Committee. So every year there is a call for participation, so we encourage self-nominations from primarily PhD students, junior, senior ones to participate in the committee of the evaluators. And over the years, we've seen an increasing interest, and especially last year, 2025, this year, but, you know, it started from last year the preparation.
Starting point is 00:13:03 We had double the size of the committee. So the committee was consisting of 98 members, which is great to see the increasing interest, but it's also challenging to handle this large number of committees. But regardless, it is a testament to the increased interest in the researchers wanting to participate in the evaluation. We had reviewers from institutions across the world, Europe, USA, Asia, even though the conference is, you know, the European conference on computer systems, but it's a testament to how much of a leading conference it is in systems research. So this shows that the number of evaluators is large,
Starting point is 00:13:57 and the interest in participating has increased as evaluation, as evaluators. Yeah, and then we had some interesting facts. Andrew, do you want to go ahead and talk about those? Yeah, so I was mainly interested in finding out more about the artifacts itself and how and if they're being used. So for that, we sort of looked at all the submitted artifact URLs, which initially were GitHub only, so repositories maybe. And later on, it used to be GitHub, the DUIs typically from Zenozo or Ficture. Just to find out, are those artifacts, for example, still available?
Starting point is 00:14:40 And we basically found out that nearly all of them are still in one form or another are still available. So even five years down the road, people are not deleting their repositories or removing their own entries. we found one entry but it's a single instance right out of 160 61 so it's a relatively minor I would I would say and then in terms of
Starting point is 00:15:08 usage we looked at GitHub stars forks Zanoto has sort of downloads and I think use is the other category and there are really varies right you have artifacts that are barely accessed but then some of them
Starting point is 00:15:24 are downloaded hundreds of times have hundreds of forts and really quite a bit of traffic and are still maintained. So they receive still updates, also something that we have been looking at. So it's quite interesting to see sort of that kind of data and how research artifacts continue to be used. Of course, the older the artifacts, the more likelihood there is that there is more actually more views, more forks, more stars. on GitHub. So generally, I think
Starting point is 00:15:59 sort of a good study of impact of these artifacts. Yeah, definitely. I'm going to say then that when they come around to determining sort of the most impactful, a test of time sort of award for a given paper for me. This would be a good
Starting point is 00:16:14 useful input into that, right? It's still been used 10 years later, right? And it still works. So that's quite a good indicator that it's had some impact, right? So I don't think that we will have an impact on the test of time paper, because that's typically based on sort of research value, not necessarily the artifact, but something that at least among us we have been discussing
Starting point is 00:16:35 is whether we need a test of time award in five years of time for an artifact, right? It's not even incentivize people to keep maintaining it as well, right? But yeah, cool. I guess given that, let's talk about challenges and some like proposals maybe then for future. That's a nice segue into the next section of the conversation. So, Athalia, I'll let you take the lead on this section. Yeah, it was interesting that we, you know, all of, most of the chairs, we had this discussion about what was the most challenging part over our service. And there were things that were coming up every year, the same challenge again and again.
Starting point is 00:17:13 And the primarily one is that we have a very, very tight timeline to work on. And this is part of many reasons. So essentially the whole process of evaluating submitting, and evaluating those artifacts happens between the paper acceptance when the decisions are out and the camera ready deadline because the final version of the artifact and the artifact appendix need to be part of the final version of the paper. So everything needs to be done between those two dates. And this essentially is like two, three weeks of a time.
Starting point is 00:17:49 So it's very, very tight. And it leaves very little room for, you know, things to be. be done properly. But regardless, this is particularly the case because Euris as a conference recently switched to a dual submission deadline. So essentially during one year, there are two submission deadlines, one in the fall and one in the spring. So for Eurosis 2025, we had a spring deadline in 2024 and a fall deadline again in 2024 for the papers to appear in 2025. So essentially, This is the timeline. It's hard to extend it because of all the other deadlines, you know, submission deadline for the program committee to review the papers.
Starting point is 00:18:36 So we have to work with that tight timeline. So essentially what we propose, because this is a consistent challenge across the years, we propose that the authors essentially start preparing their artifact early on, even upon submission time. Essentially, we would like to have authors have their artifact ready when the decisions are out, even though these are negative or positive. Even in the case that we have a negative, like a rejection, this process of preparing the artifact is very useful because the paper will be submitted in another venue and the artifact will be ready regardless.
Starting point is 00:19:16 So short term, we really want to motivate the authors to have their artifacts ready because there is no time to properly prepare an artifact during this very tight timeline. There is only time to communicate with the reviewers if things are not working properly and there are like small fixes that need to be done, but in one week you cannot properly prepare an artifact. And then long term, there is discussion among the chairs and this is something that we want to discuss further with the steering committee and the community about making the process mandatory so that every author knows that I submit a paper, I should also submit an artifact
Starting point is 00:19:58 to reproduce the claims. And this is something that would be extremely beneficial. And of course, it's a very complicated decision. So this is something that we propose to the community to discuss. Yes. And another challenge that we identified and picking these issue of communication that Thalia raises related also with the communication between the Committee of the Main Conference and the Committee of the Artifact Evaluation Conference. So as we said before, these are separated processes and the committees are built independently. So on the main conference, usually there are very senior researchers. On the Artifact evaluation, there are junior researchers. Sometimes, I would say, a big chunk of those reviewers are PhD students, and
Starting point is 00:21:00 therefore there are all sorts of differences in experience and expectations about what the artifact evaluation should be about. And this is, in fact, a challenge that we identified over the past few editions in a quite consistent manner, which is, so there are many or several mismatched expectations about what results should be reproduced. For instance, the authors might think that those are our key results. The PC members of the main conference should think that there might be some differences there. And then the reviewers of the fact evaluation can also have a different understanding of this. And this is something we find out. And then this creates all sorts of friction that we also want to address over the coming years.
Starting point is 00:21:55 So in the short term and what we want to propose is introduce an informal communication channel between those two committees that again have different experience, different seniorities and different timelines and periods where they work to have the reviewers, for instance, flag very clearly what claims should be reproduced and at which level of detail and then this information could be passed down to the reviewers of 35 evaluation. So this is something that we believe should be achievable in the short term and do not create too much overhead on either side of the process. Over the long term, our goal is to kind of formalize these disconnection and required alphors early on, so at submission times say, to declare which
Starting point is 00:22:55 claims are the main claims and which claims they plan to be supported experimentally by the artifact. And this would allow us, allow the reviewers of the main conference to validate that list. And then, of course, this is always an iterative process and maybe propose adjustments to the authors, that the authors can react accordingly when preparing their artifact. And the goal here is essentially for us to have a good match between the expectations of both committees, because we believe this is good for the authors that they know already beforehand what they should target in terms of reproducibility of the results, and also for the reviewers on their de facto evaluation, because they have a clear list of things to be.
Starting point is 00:23:46 checked against. Let me take the next point. So based on the numbers that Miguel mentioned earlier, only half of the submissions actually are reproduced. Other results are reproduced.
Starting point is 00:24:02 One of the major drivers of that is actually the use of specialized hardware in our community. So you may need a special server with certain capabilities, but it can be much, much worse, right? You may need an Android phone. It needs to be physically present.
Starting point is 00:24:19 And you can imagine sort of all sorts of craziness going on in the systems, just because we are building new things. And I think it gets even worse if hardware is involved, because then sometimes export control issues also arise, depending on which country and students are coming from. So all of this makes it very hard to sort of scale reproducibility. And we've sort of seen this over and over again. I think the current solution is on the sort of a short-term fix
Starting point is 00:24:56 where we try to ask authors for what the requirements are, whether they can share the hardware through SSH. But it's sort of very limiting to then assign only a few subset reviewers that may have able the same hardware available. So it's a bit problematic. In the long term, we need to find, I think,
Starting point is 00:25:20 better systems to share and be able to reproduce those results, which is certainly not easy. Absolutely. And moving on to the next challenge that we identified is the fact that the
Starting point is 00:25:39 artifact evaluation chairs and the reviewers they change every year and this is not the case for example for the technical program committee you know in the program committee the chairs change every year but reviewers repeat especially senior reviewers typically they repeat in the committee for many years but this is not the case for active art evaluation the chair change every year and also the reviewers, usually PhD students may be participating in those committees for one, two times, but then they don't. So we have to, you know, redo a lot of things and relearn a lot of knowledge. So what we propose is to kind of mimic the program committee and establish some sort of steering committee. That could be consisting of people that have served as artifact evaluation chairs in the past.
Starting point is 00:26:40 This could be also done potentially across systems conference to help transfer knowledge because essentially it's the same process across the different systems conferences. Another proposal would be to prolong the service duration, for example, having chairs to serve for more than one year, but of course that is complicated and puts more load to them. But essentially, we need some sort of solution to be able to transfer knowledge, to maintain best practices, to also work on the challenges that we mentioned so far so that essentially we get solutions. So we need some sort of steering committee or some sort of having a group of people to transfer the knowledge across years.
Starting point is 00:27:29 And again, with the committee, it is important that the chairs create evaluation committees with both junior and more senior members, with people that have expertise in reviewing artifacts and also allowing, of course, younger researchers to enter the service. Exactly. So another thing that we also identified, and it was a two, honestly quite a surprise to, I guess, to all of us, is that even if we have the artifacts on the DIY-backed platforms, which are designed to be long-term, ensuring that the artifacts remain there over the long-term is not trivial. And in fact, we found a few cases where some
Starting point is 00:28:25 artifacts were removed after they get the badges, which is, of course, something we don't want to happen, right, because the badges should mean something very, very precise. So one of the, so this is more a technical aspect, but over the short term, what we plan to do and to propose is to require that all artifacts are indeed stored in these BOI-backed platforms such as they know or fixed share, but also restrict deletions without previous approval from either this steering committee that Alia mentioned or the committee that is active during that year or the original committee when the paper, when the award was given. So this is still, the details are not very clear to us, but.
Starting point is 00:29:24 This is definitely something that we want to make sure that artifacts remain available. And this gives credence and credibility to the badges, right? And of course, over the long term, we also want to refine these. And when authors submit the paper, to also declare explicitly, and this is related with a point we already discussed earlier, which results are reproducible when the paper is submitted, and then this would get in the camera ready as well. So this would allow us a better quality control
Starting point is 00:30:07 and making sure that if that is a badge, it's meaningful when the badge was awarded, which is the case nowadays, but should also be meaningful over the long term. The last two points that Talia and Miguel raised, I think are making the point that I'm going to raise even sort of words, right? So we have sort of the short-term stewardship of A.E. Chairs and even evaluators. And then some of the sort of long-term or descriptions in the artifact are not quite ideal yet.
Starting point is 00:30:46 But in combination with that, the badge definitions themselves, the language is a bit in precise. So it needs quite a bit of guidance, especially for new evaluators, to understand especially for the reproducibility badge or also for the function badge, to understand
Starting point is 00:31:07 what they are supposed to check. And sort of to ideally you want evaluations between two artifacts to be equally harsh on them, right? Because otherwise one paper may receive a badge
Starting point is 00:31:22 the other doesn't, but was not properly vetted. So, neither the authors of the paper, of the artifact, nor the ability to sort of have a positive experience in that case. Yes, exactly. And one thing that we can also discuss is the fact that so all of these in definitions hurt, but on the bright side, I think we can say that the artifact evaluation has become kind of standard practice in systems research. So in the C's artifacts site that we mentioned before, besides Eurocis, we also have there, for instance, SOSP and OSDI that also participate in these process with the differences that are relevant for each community, of course.
Starting point is 00:32:16 but this what we also identified and discussed in our internal meetings when thinking about this and building these results is that this practice is still very widely across different computer science subfields so on some some some domains priorize availability over reproducibility and there are certain certain league merits on that other like machine learning and HBC, they have separate reproducibility challenge, so it's kind of a different way to do this. What we have tried to do over the past few years is to develop in EurSIS, kind of a blueprint to allow the systems-oriented community
Starting point is 00:33:12 to improve in this artifact evaluation process, because all of us believe that it is very relevant and critical for science in general and also for the industry as well. And so my feeling, and I think my colleagues share this feeling as well is that, so this is a step in the way that is still lots of work to do, as we discussed earlier. And our end goal, I would say, is to make these more. inclusive, more consistent, and more aligned with realities of modern research requirements that have different challenges, and I think we will discuss this, but when we go deeper and
Starting point is 00:33:59 think about this, there are many challenges that are still present nowadays, that we want to do our small contribution to improve upon. Yeah, definitely. From what you guys are saying there, I mean, there are a lot of challenges outstanding still, but the direction of travel is going in the right place. Things are going in the right direction. Even from afar, you can sort of see that the things are moving towards a better future. And a lot of the proposals you mentioned there are really sort of appealing as well. And you can see how they would help. And things like standardisation across different subfields and CS definitely would help as well. So yeah, I think the next section of the podcast, let's do some reflecting, more broader reflection on your
Starting point is 00:34:41 time working with artefacts evaluation. So let's focus on solving what your advice is. So let's focus on So what your advice would be for future authors, reviewers or chairs or people involved with artefact evaluation in the community. So Thalia, yeah, you kick us off. For authors, maybe, what would authors, what advice have you got for them? Absolutely. I think for authors, my advice is to start preparing the artifact early, ideally before you even know if the paper is accepted or not, because this is a process that will be useful regardless of the outcome. And think of your artifact as part of the contribution, just an afterthought. So write code, ideally open source it, after the acceptance, of course, if needed.
Starting point is 00:35:27 Try to write clear documentation, automate scripts that will help the evaluators run seamlessly the code. A well-structured appendix really, really helps and make sure that your work is reusable and reproducible. And really this effort put into preparing the artifact will pay off tremendously after that. Yeah, definitely. I echo that sentiment. Miguel, how about reviewers? What advice have you got for them? Yeah, so even if the artifact is very well-prepared, reviewing is a lot of work. But I think it's very valuable work. So my advice would be for reviewers willing to participate in this process in the future editions
Starting point is 00:36:12 to approach these with a collaborative mindset. So the goal is not just to put a check on the CV that I did these. And checking boxes, this plot is exactly the same in the paper. This is not what we aim to do, right, but also be helpful to outforce and help them improve the quality and the impact of the research, because today you are a reviewer, you will learn in the process, and next year, hopefully you will be an author going under this process. So this is something we also try to encourage reviewers to think about, be constructive, to communicate clearly, and sometimes remembering that small fixes, small suggestions in the artifact
Starting point is 00:36:58 can make a big difference towards our goal of having better reproducibility. yeah definitely mindset is very important and Joe yeah for future chairs what would you say to them I mean get in touch with us that's one thing I think having continuity is sort of
Starting point is 00:37:21 growing right the knowledge that that is instilled in this process is important similar to how people have learned how to run over hundreds of years, right? I think generally, over the past few years, we have invested quite a bit of time in templates
Starting point is 00:37:42 for the artifact appendix, in guides for authors, for reviewers, chat lists, just to sort of help make this process as easy and as similar as possible. But of course, those documents aren't perfect, right? We are trying to improve them over time and I hope future chairs will sort of pick up on that work and continue to evolve the process, but also the documentation to, yeah, make it make it even easier for the evaluating community at the end.
Starting point is 00:38:16 Yeah, definitely keep iterating right towards a better future. Cool. So yeah, speaking of the future then, that's a nice segue into the next question I've got. And that's right. We get our crystal ball, aren't we'll want to look five years in the future. What does artifacts evaluation look like, failure? Yeah, it's interesting because already five years have passed. So I'm very curious to see in the next five years what will happen. But I really believe that the artifact evaluation process will be more standardized and more of the authors will be incentivized to participate.
Starting point is 00:38:50 Because right now, to be honest, the percentage of participation is quite low. Only half of the accepted papers, around 60% of the accepted papers, participate in the process of artifact evaluation, and even less, even half of them actually get the reproducible badge. So I would really, really like this number to increase, and I really hope it will do. And it's more of like a mentality change. It's more of a mentality where we write code, we document it well, we make it runable, we make it reproducible, we ideally open source the code and the community can build upon and extend that work. And this is very, very important in systems research because we, as program committee members, we always ask the reviewers to compare their system with all the other prior systems that existed
Starting point is 00:39:43 and show that it's best. So we need code to do that. We need open source code and runable code to properly evaluate the new systems that we propose. So I hope that we will get to that point in the next five years. Yes, so this is my wish as well and something I would really like to see happening to go from this roughly 60% to a big, big number. So of course, there are always very valid and legitimate reasons that some papers or some systems cannot go undergo this process. So there are industry-related restrictions that is this amazing paper. and the authors want to do a startup with it.
Starting point is 00:40:30 So this is a perfectly valid reason as well. But I believe that increasing substantially the number of papers that go through this process would benefit not only the authors themselves today, because they will have better code, better documented and so on, that they can themselves build upon a better foundation, but also picking the point that Thalia was mentioning. So this makes it very, or not very easy, but easier for other people to reuse the work and build upon it.
Starting point is 00:41:07 And one of the pains that, well, I love working in systems, right? That's why I'm here in part. But one of the pains, and I think this is shared with every one that I mention is how hard sometimes is to make someone else's system run in a consistent and fair way. And I think this is a big challenge in our community. And having most of the artifacts available, I think would be very, very available. To be clear, to go through this process and get the badges that they merit, would very much help this effort and would benefit everyone, in my opinion.
Starting point is 00:41:52 So from my side, I think I'm a bit more pessimistic. So I think both of your points are very wed at them. I hope that it happens in this way. But I think especially in the last two years, Sirius has seen an explosion of 50, 60% of more submissions and acceptance has gone up in a similar range for papers. So this leads sort of to me worrying about how we skate. Talia was already mentioning
Starting point is 00:42:21 we have a hundred evaluator committee huge size and every one of them is spending to reproduce an artifact probably tens
Starting point is 00:42:35 20, 30 hours of their time to evaluate that artifact and sometimes maybe even more. So I think what I would like to see for years is
Starting point is 00:42:47 a way of sort of automating, scaling the artifact evaluation process to be able to cope with yours is accepting 200 papers and us evaluating 200 papers because I think the current system is not completely set up for that and it leads to quite a bit of work both on the chair and the evaluator side to sort of steer that process so ideally we find better ways to do it. Yeah, fantastic. Well, I think that's what brings our podcast to an end. We should probably reconvene in five years
Starting point is 00:43:23 and do another state of play and see where things are at and any of the predictions have come true. But yeah, thank you very much joining me today, folks. It's been a really insightful chat and I'm sure the listener will have absolutely love the conversation as well.
Starting point is 00:43:38 I'll drop links to everything in the show notes so you can go and check everything out. And yeah, thanks again, guys. I hope you enjoyed it. Thank you. Thanks. It was very fun. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.