Disseminate: The Computer Science Research Podcast - Konstantinos Kallas | Practically Correct, Just-in-Time Shell Script Parallelization | #20

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. Today we have another installment from OSDI 2022, and I'm delighted to say I'm joined by Konstantinos Kalles, who will be talking about his paper, Practically Correct Just-in-Time Shell Script Paralysation. Konstantinos is a PhD student at the University of Pennsylvania, and he is on the Technical Steering Committee of PASH, which we'll be learning more about today. And his research interests lie in the intersection of programming languages, distributed systems and software verification. Konstantinos, welcome to the show. Hi, Jack. I'm glad to be here. Great to have you. So let's dive straight in. So can you tell us a little bit more about yourself and maybe how you became interested at researching at the intersection of these three areas?

Starting point is 00:01:13 Right. So, yeah, I guess my background, or at least my undergraduate, I worked a lot on compilers. And I particularly worked on a compiler for the language Erlang. My advisor at the time was one of the, Kostis Adonis was one of the developers of the head of time compiler for Erlang and I worked on a just-time compiler for it. And yeah, Erlang really instilled in me some interest and love for distributed systems

Starting point is 00:01:46 because of how nice and clean it makes to develop such systems. Yeah, and then during my PhD, I slightly moved towards verification and more theoretical programming languages. But I'm now coming back home to the system side. And I think that's where I probably lie, like in the system side, but with programming language perspective and principles guiding what I'm doing. That's really nice. Yeah, you can kind of come full circle almost there.

Starting point is 00:02:20 Yeah. Yeah. That's ace. So the paper, the style of the today is is all about the shell script compiler uh past jit so maybe you can give us an elevated pitch for this work and also obviously this is part of a like a larger part of work as well so maybe you can give us an intro to the whole project as well that's right that's right yeah i guess i would yeah i would probably give you a little bit of the homework because this paper is only part of this whole PaaS paralyzing compiler for the shell. So I guess the pitch is that shell scripts are ubiquitous, I guess.

Starting point is 00:02:59 And yeah, something that I did find very interesting recently, and the GitHub yearly survey, the shell is the eighth most popular language across GitHub. So, I mean, yeah, that just shows, and it has been consistently on the top 10 for many years now, and I guess, yeah, since its inception, I guess. But it's extremely ubiquitous, but there is no actual... Most scripts are sequential by default. You get this pipeline parallelism by writing pipelines, but that's it. And essentially what we do

Starting point is 00:03:33 is pipeline parallelism for the shell. But this was hard to do before, and this is because of the main three challenges, we think at least, of the shell and three of its characteristics that make it very hard to develop principled solutions on top of it. So these three issues are, first of all, shell scripts allow for... Shell programs and shell scripts compose arbitrary black box commands

Starting point is 00:04:01 like grep, sort, arbitrary commands. And all of these are written in different languages. Essentially what you have in your system is just the binary of the command, often. And that means it's extremely hard to make any form of static analysis on these commands and then leave the result of this analysis to your whole script. So that's one very, very hard problem that makes the cell extremely hard to deal with and develop tools for. Another thing is that its semantics is extremely complex. Like the POSIX specification, the POSIX cell specification is I think, multiple hundreds, like many, many hundred pages of English, that is also underspecified in many places and so on. So in the end, what we end up with is like

Starting point is 00:04:45 more than 20, I think. Yeah, I think that's a low estimate of cell implementation. A low bound. Yeah, exactly. Yeah, yeah, yeah. That's right. Yeah, so we end up with like many, yeah, a lot of cell implementations and none of

Starting point is 00:05:02 them can be, you know, all of them diverge in different different parts of the spec and so if one wants to develop some new research and they develop a new cell to carry their research with like a paralyzing cell for example if we develop the paralyzing cell that would diverge from all other cells too making it extremely hard for for someone to actually use it because then they would have to break their scripts to adopt our research. And the final thing is that the shell is extremely dynamic as a language.

Starting point is 00:05:30 So you have these behaviors like a simple string can contain a command substitution. So the result, the value of a string could depend on the files in your directory, in a specific indeterminate directory in your system. So there is no way you can actually predict or analyze statically a shell script and be able to have any form of precision with this analysis. So these three characteristics of the shell make it extremely hard in our opinion to develop

Starting point is 00:05:59 any kind of solution on it. And that's what we have essentially covered with our research until now. So with respect to the commands, we came up with a command specification framework that we used to decouple essentially the analysis process and allow for the development of tools that know something about the behaviors of the commands through these specifications.

Starting point is 00:06:23 And then it's the job of the command developer or some expert to write the specification for the command and make sure that it's correct with respect to the command implementation. And so this gives us, this solves the first issue, or at least makes a dent towards solving it. Then with respect to the semantics, we built our whole infrastructure as a shell-to-shell transformation. So essentially, we do not implement the shell, we just produce a parallel shell script, which we then run with the original interpreter,

Starting point is 00:06:55 allowing us to get extremely high compliance with an existing interpreter. Now, our workhead is Bash, but yeah, translating the original script to a new, better version of itself and running it on the original interpreter allows us to get very high compliance. And finally, for the dynamic aspect, we developed the Just-In-Time compilation infrastructure. Essentially, this is not like traditional Just-In-Time compilers, but it's more like a very tight orchestration framework that integrates with the cell and

Starting point is 00:07:26 essentially reads a little bit of the input script and then decides at runtime whether it should give that script to the cell to execute or rather transform it to a better version of itself and then give it to the cell so this very tight integration and incremental let's say execution of the script allows us to get compilation at the right time when all of the dynamic things have been resolved. So when we see a string that contains an LS inside, a command substitution and an LS, we can actually ask the shell, the interpreter,

Starting point is 00:07:57 please expand this using the last-time infrastructure and then say, okay, can we now do something about this whole region of the script that we have expanded and we know what it's supposed to do. guess that's uh that's a very long elevator piece but i know that's that's that's great that's great so obviously we're gonna we're gonna we're gonna focus on the on the jit stuff today but there are there any like where's where's a good reference to go and check out the obviously the prior work on this what would you direct the listener to is i'm guessing you've got plenty of papers on this. So what would be the best place to start looking?

Starting point is 00:08:27 I think for the commands and the self-compilation infrastructure, the foundation of it, our EURSYS 2021 paper that's called Pass Light Touch Data Parallel, I think self-parallelization, self-compilation or something should be, shell compilation or something should be, yeah, a good study point.

Starting point is 00:08:48 Great stuff. We'll link that in the show notes as well, so the interested listener can go and follow up and look at that. So let's talk more about PassJIT then. You sort of outlined sort of like four main sort of areas, four main features to the

Starting point is 00:09:03 compiler. Let's go through them one at a time, and we'll dig into each one. You can tell us what it is, like what is set there, how it works, and kind of what problem it's solving. So you start like one of the sections is about dynamic interposition. So what is this? Why do we need this? And what's the problem it's solving? That's right. Okay. So the dynamic composition is essentially a solution

Starting point is 00:09:27 that deals with the dynamic issues and it's this integration part that I told you about in the beginning. So essentially how we achieve that is that we have a preprocessor of the script that goes and

Starting point is 00:09:43 instruments the script at multiple places with calls to our just-in-time compilation engine, let's say. And then at runtime, when this is called, the script, the actual script is called, at various places we enter the JIT engine loop. That's essentially the integration part that I talked about in the beginning. Okay, cool. So when it goes through in the instruments, the script, what is it looking for to instrument? What areas are kind of, oh yeah, this we can instrument this, so when it comes through, we can parallelize this area. What are the sort of hooks you hook

Starting point is 00:10:21 into there? That's a good question. So we find areas where that we call in the paper data flow regions. And these are areas that look like they could benefit from parallelization. And intuitively what they are is compositions of commands with pipelines and the background operator, the amberson. So these are regions of the script that are supposed to execute multiple commands concurrently and each of these regions before it, we essentially replace these regions with calls to this just time. Do the annotations always work or is there

Starting point is 00:11:00 a chance where you get some sort of false positive with the annotation or is it always going to work? I guess maybe we can touch on that later on when we do the evaluation maybe or we talk about that. But is there something where it's not 100% correct all the time? That's, yeah, I mean, it could not be correct. So the processing is essentially just an optimistic approach. Okay, I see. Let's replace all of these things with a call to adjust in time. And then at runtime, what happens is that we actually go and check whether we know enough things about this region and whether it is actually parallelizable.

Starting point is 00:11:36 And sometimes it's likely that it's not the case. If this region contains, for example, an RM command, deleting some file, we don't want to touch it. So what we do is fall back at runtime to the original region. So essentially, the preprocessing just needs to be optimistic and it over-approximates

Starting point is 00:11:55 the places where we could get parallelization and just make sure that it covers as many of them. And then at runtime, we make a sound decision of whether we can actually parallelize the region. Okay, cool, nice. So let's talk a little bit more about the JIT engine then. So obviously this is a kind of

Starting point is 00:12:12 I guess is one of the main components of this whole thing. So let's go into that. Can you tell me how that sort of looks and what the structure looks like and how it works? Right, so I guess the JIT engine at its core is first of all it's a shell script. At its core, what it does is that it hides from the perspective of the user and the shell

Starting point is 00:12:32 the fact that we are now entering in a compilation investigation mode. So what it does is it saves the state of the shell at that point and that includes all of the configuration of the user, the ways the user could have configured the cell like using the dash e flag which exits the cell on an error or the dash v flag which gives you verbose output and so saves all that to be able to revert it later and then sets the cell state to an internal pass internal state that is very hidden in a way. It doesn't expose any information to the shell of the user. And then asks the compiler whether that particular region that was there at the time can be paralyzed or not.

Starting point is 00:13:18 And the compiler responds something, a yes or no. And then the JIT engine restores the state of the shell, the original one, and then executes either the optimized parallel or the original fragment. Okay, nice. So in the case where we can actually parallelize this, the parallelizing compilation server kicks in here, right? So at this point, how does this work then? So once it works out, yeah, this can be actually optimized. How does it go about doing this?

Starting point is 00:13:55 Right, so that's the part of the compiler. So what the compiler does, it has a whole pipeline of things that it does. And the first thing that it does is it asks the interpreter to expand all of the strings in that region. All of these things that could contain arbitrary commands, substitutions, whatever. This could be even a whole,

Starting point is 00:14:15 there could be even a whole command, a whole program inside the strings. And it does that by simply asking them, okay, well, I'm actually lying. So the original implementation was asking the interpreter to do it for it, but that ended up being very slow. So what we ended up doing is re-implementing our own expansion on top of it and covering a big percentage of all the cases, the common and safe expansion cases,

Starting point is 00:14:46 and doing that in our internal Python library. But yeah, you can think of it, this is the equivalent to just asking the interpreter to do it for us, the expansion, because essentially at that point, we have all the information we need. We can reflect on the file system, execute a command, do whatever we want.

Starting point is 00:15:03 So after expanding the strings, we have a fully expanded fragment of the script, which could be a pipeline. Let's say it's a pipeline. And then what Pasta is to do is for each command in this pipeline, it tries to find the command specification and try to see whether, first of all, whether we have it, this command specification, because we might not. And if it hasn't, it tries to transform this command to a dataflow node representation. And the commands that we can do that for are the commands that are what we call pure in the paper and that means that they only interact with their environment by reading from a well-defined set of input files and writing to a well-defined

Starting point is 00:15:53 output instead of output files um so and so that's a command that where we know the inputs and outputs we can turn it into a data flow node with well-defined inputs and outputs. And then it does this for every command in the script. And if all of them succeed, we have all of these commands that can be composed in a big dataflow graph. At any point, by the way, until here, if something fails, if some specifications are not found, if something goes wrong, the compiler simply gives up, folds back and then the original fragment is executed. So all of these processes does not need to be complete in a way. It works only if we have the specifications and so on. So once we have the data flow graph, this is a very clean abstraction and we have developed like proven, provably correct transformations

Starting point is 00:16:50 on top of this data flow graph that we then apply and transform it to a parallel data flow graph. Essentially, the transformations that we do are all divide and conquer, simply divide and conquer parallelism. So if a command processes its different input lines independently, then we can simply parallelize this command and give a subset of the lines to each of the parallel instances and then concatenate the results. So that's the main idea of how we get parallelization in the data flow graph. And then we simply have a backwards pass that takes the flow graph and transforms it back to a parallel shell script

Starting point is 00:17:34 by instantiating all of these nodes in the parallel data flow graph back to the commands. That's really interesting. I mean, in the general case, do you find that, I mean, obviously, again, we'll probably touch on this more when we're talking about the evaluation.

Starting point is 00:17:52 How likely is it that I'm not just going to end up with big one long dependency chain of you've got to do X, Y, and Z. There's no actual opportunity for parallelism there. There's like a dependency chain that you just,

Starting point is 00:18:02 you can't move things around. Is it quite unlikely that it happens? Or is it pretty common that you can actually do things smarter right right so so so i think one thing that um so we do find and discover dependencies too but i cannot talk about this yet so so so that's that's later in the paper. So the parallelism I'm talking about here does not talk about dependencies. Even if I have a single command, let's say you have a sort, right? Somewhere in your script, you call a sort and it writes to the file and the next command in the sequence reads from that file. So these are dependent clearly.

Starting point is 00:18:40 But even this sort on its own, it can be translated to a data flow graph. Okay, so within one thing. Oh, okay. Right now. But even this sort on its own, it can be translated to a data flow graph. Okay, so within one thing. Okay, right now. Yeah, split it up into multiple sorts that then match their output with sort doesM. So essentially for many commands, we have this kind of aggregation that we can do after that and parallelization. So even at the single command level and essentially if you have the sort followed in a pipeline by another command, then we would parallelize the sort and then parallelize the next one, and also make sure that the combinations between all of these parallel versions, parallel instances are efficient

Starting point is 00:19:15 in a way. And so we get a pipeline and the data parallelism. Yeah. Fascinating. Yeah. Sorry, I was getting ahead of myself there. Cool. Fascinating, yeah, sorry I was getting ahead of myself there Cool This is a common I think

Starting point is 00:19:27 misunderstanding about this work and I think everyone in the beginning assumes that the main thing we do is figure out these dependencies, but we actually do more and the original work does this data parallelism So I guess, where does the commutativity awareness fit into

Starting point is 00:19:44 this then? So this is, where does the commutativity awareness fit into this then? Are we talking here within an operation? Or is this again the level up? It's within a command. That's right. So before, in our first paper, the past paper, we came up with transformations that do not require commutativity. And these are pretty good.

Starting point is 00:20:08 They get you really far. But some commands actually you can do better with commutativity. And these are commands that either process each line of their inputs independently, like grep. For example, with grep, you search for a pattern in each of the lines independently. So as long as you make sure to reorder the lines after the parallel greps correctly, you get the same result.

Starting point is 00:20:32 So we call this command stateless in the paper because they don't require any state across lines. So with stateless commands and commutative commands like sort, because sort, all of its inputs, it doesn't matter the order with which the lines arrive, the end result will be the same. We can actually do better than what we, than just the transformation we had in the original pass.

Starting point is 00:20:56 And essentially doing better means doing a round-robin passing of lines, which allows us to have better finer grained parallelism. Because in the past, in the original pass, we just split the original input file in big chunks and then gave each chunk to its parallel version. But that requires running through the file once first. Right. So you pay the cost of running once through the file and then you can split it. But with a round Robin, you can pass 10 lines to each parallel instance in a round-dropping fashion and essentially stream it

Starting point is 00:21:30 towards the downstream parallel instances. So, I mean, I guess kind of how did you go about implementing this then? So was it pretty straightforward? So you said it's obviously a shell script itself. Was it a big implementation effort or was it, yeah, I guess talk us through how that went. It, so the implementation is, all of the compilation stuff is in Python,

Starting point is 00:21:57 just for simplicity, I guess. And a lot of the glue is in the shell because we need to have this tight integration with the shell to do a lot of the glue is in the cell because we need to have this tight integration with the cell to do a lot of things. We cannot avoid essentially being in the cell for a big part of the time. Yeah, I guess the effort, there is not, I guess there's a lot of code in the compiler as the part,

Starting point is 00:22:20 and there's not a lot of code in the shell, in the shell part of the implementation, but coming up with this code was a very, very long process. Because, yeah, I think we'll talk about this in the evaluation, I guess, if we go through it. But yeah, we made sure that our tool passes the whole POS Excel test suite, which is more than 29,000 lines of code. And so, yeah, this took me about a year of engineering and figuring out why is this test failing. All of the tests were big, first of all. And yeah, it was a very, very, very long process to actually achieve this compliance but

Starting point is 00:23:05 yeah yeah cool i just wanted to ask you a question about the knowing about the commutativity awareness in it like how did you decide which things commute or not was it kind of obvious or was it like a trial and error sort of process or did you sit down and sort of like prove that sorry these the two things definitely commute so we did not prove anything, but we did a thorough, if you can say that, analysis of a lot of commands that we found out there and just added the commutativity field in their specification. That's what it took to annotate them, essentially. And yeah, we went through a bunch of commands and did our best to decide if they are commutative or not. And yeah, I mean, in our evaluation,

Starting point is 00:23:54 we do a lot to check for correctness and we run on very big inputs, a lot of scripts, and all of them give equivalent outputs to bash. But of course, that's not the proof. So our hope is actually next. We have some next steps with respect to checking the specification correctness. Experimentally proven then, shall we say. Practically correct.

Starting point is 00:24:19 Practically correct. There we go. All right, yeah. That's cool. Great. So let's talk about the evaluation then. So it sounds like it was a very, very extensive evaluation, but let's take it back a step and talk about what questions you were trying to answer and what that sort of experimental setup looked like and what benchmarks you ran in addition to the POSIX test suite.

Starting point is 00:24:45 Right. So I guess we have two main questions that we want to answer. The first is correctness, and I will explain maybe later what exactly that means because when we talk about correctness, actually we can talk about correctness maybe completely first and then go to performance. But the two questions are correctness and performance, and performance has sub-questions. But with respect to correctness, what we wanted to show and evaluate, I guess, is whether pass on top of dash, it's very hard to differentiate the two with the sound only, but Bash with a P, R2. Yeah, yeah, yeah. Whether it has the same exact behavior with Bash,

Starting point is 00:25:32 the shell interpreter, in as many POSIX shell tests as possible. So we don't want to bash all the tests, but we rather want to bash and fail exactly the ones that Bash does too. Because even Bash doesn't bash all of them. tests but we rather want to pass and fail exactly the ones that bash does too because because even bash doesn't pass all of them and okay so yeah out of the 400 tests um yeah we i think there are 400 something all the tests that were applicable in our case uh bash versus bash they only diverge in two tests and and the divergence is actually not even it's actually distinguishable and it's only so the differences that both fail but the error code of bash is one and the error code of pass is 127. okay

Starting point is 00:26:20 there's a very there's a very little um like um yeah, it's a very weird reason why this is happening, but it has to do with the way we are calling Bash. We have to call it in a particular way, using the dash C flag to set its name. That's a very, very, very low level detail. But essentially, this is what Bash gives us. This is the error code it gives us, 127. And therefore, it diverges with the original way of invoking Bash. And we haven't found any way to actually resolve this. We have also contacted the Bash developers,

Starting point is 00:27:01 and they do not intend to change that at any point. So we're kind of stuck with this divergence. Yeah. But yeah, practically, it's like it's virtually distinguishable from BaaS. Yeah, and I guess we also run about 80 scripts that are the ones that we used in our performance evaluation too. We run them on multiple gigabytes of input and all of the output is the same with Bash. So that's also another boost to our confidence that Bash is correct. I'm just looking through the the paper now the range of

Starting point is 00:27:47 where these scripts came from. So there's a wide, wide range of different sort of domains they come from. Was there any reason why you settled on these specific benchmark sets? Or was it a case of these are the ones that kind of everyone uses or what was the rationale behind it? Yeah, that's a good question. So there's not a lot of research on the shell, especially performance and the performance of the shell before our work. So in the original paper, we found a lot of pipelines. And that's what we used for our original paper, because we mostly dealt with pipelines at the time.

Starting point is 00:28:29 And many of those come from GitHub and other real sources. And there are pipeline scripts that we found essentially in the wild that do some non-negligible data processing. I guess that was the main way of choosing them, because if they don't do any data processing. I guess that was the main way of choosing them because if they don't do any data processing, then we wouldn't be able to parallelize anything and get any, if it would make sense. And I guess the most interesting out of those, I think, the pipeline ones is the COVID MTS benchmark set. So that one professor from a Greek university called Diomedes Pinedas actually wrote his scripts during the pandemic to help a journalist

Starting point is 00:29:10 measure whether buses in Athens were actually increased frequencies, whether frequencies of the buses were increased, because that's what the Greek government said. So that professor wrote his scripts and actually gathered downloaded a bunch of like telemetry data from buses and processed it with his scripts to see if the frequencies of the buses in athens that were increased during government which was what the

Starting point is 00:29:36 government claimed yeah so the government was claiming that oh we have more we had more buses because that was what was the reason for having more buses was it so that people are they're less crowded i guess ah okay okay yeah what was the conclusion that was that once uh yeah yeah but but uh yeah this was i mean we really uh we got it uh we're very happy when we found it because uh um i mean we're asking it out do you have some scripts that do data processing you know and this professor came in and we're like wow this is amazing uh yeah that's a great find yeah so let's talk performance anyway so um i guess give us some numbers and how well does it perform right Right, so let's see.

Starting point is 00:30:25 First of all, we're at about 80-80 shell scripts, as I said. And I guess the main conclusion from our performance benchmarks is that the just-in-time version of Bash is always faster than the ahead-of-time version of Bash. It's almost always faster than the ahead of time version of bash. It's almost always faster than bash, the sequential version. And with respect to the ahead of time original bash, it also extends benefits to many, many scripts that the original could not handle. So actually, any complex script that contains a complex control

Starting point is 00:31:03 flow, like if analysis or for loops or dynamic behavior without the compiler, you cannot handle it because you cannot statically determine what will actually happen, how will these variables expand? So I think some numbers are the average speed up and pass it on all of those scripts is about 5.86. That's on a 64-core machine that we used. And the pass ahead of time is about 2.9. And that was only on the scripts that it could actually handle.

Starting point is 00:31:37 We didn't even measure the rest, which doesn't even apply. And the only few scripts where we are slower than the sequential one are ones that perform tiny computations. So there was one script that just essentially keeps the first line of its input, which took 8 milliseconds for Bash to execute it. But for BashJT it took about one second because of all the initialization that we need to do um and so on so uh so this is i think that the only scripts that we were we were slower than the sequential are ones that do negligible computation and so we pay the constant overheads up front um and that's that's uh one of the slowdowns yeah sure yeah because i was kind of going to ask like leading into the next question of kind of what the limitations are when a situation is when that performance is sub-optimal

Starting point is 00:32:25 but like kind of what are the characteristics of the scripts where i mean other than it being like a really tiny computation is there any other characteristics that would maybe mean that bash would be the right option to go with oral yeah right i i think that if you have um any script that has tiny computation or very very tight um loops with not a lot of computation so that would take that would take very you would get the benefits by you wouldn't get a lot of benefits by using pass i guess also if you do not have any specification for any command in your script then pass itthread also cannot do anything for it.

Starting point is 00:33:07 So yeah, I guess a main assumption behind all of this work is that you have some specifications for a few commands in your script, or you use some standard commands. We have written about, I think, more than 70, something in that range, 70, 80 specifications for commands. And we cover a big bunch of the standard ones but if you if you use some completely third-party custom command and you don't have the specification pass it's gonna just yeah yeah you're on your own there right yeah yeah exactly exactly yeah yeah continue yeah We really tried to reduce, I guess, the overhead of our JIT engine and this in and out essentially calls between the JIT engine.

Starting point is 00:33:54 And we have also a small part in the evaluation of the paper that we saw that. And the best we could get in the end is, I think, about 50 milliseconds per entrance to the just the time engine so that's the lowest we can get and and in the end we were happy with this but but i think that's one place where where one could potentially seek for for an improvement to make this tool um actually having no negative impact if you ever want to use it. Yeah. There is no trade off that would need to be done there to achieve very low latency, which is that you would probably need to integrate

Starting point is 00:34:42 into an original interpreter, so modify bash itself. Because we wanted to avoid that and essentially have this loose coupling with an existing interpreter, we ended up with, in having this shell-to-shell compiler infrastructure, we ended up, that's the lowest we could get with respect to latency. Yeah, in that sort of setup, there's always going to be slow movement. You're never going to truly get rid of it all, right? So, but yeah but so i mean obviously in in the experiment in experiments you compared against bashing obviously kind of the ahead of time version of of past but like what you mentioned there's not really much other work in this area what is the nearest sort of for lack of a better i think a competitor other than bash sort of thing? Is there any other sort of research? So I think there are a lot of

Starting point is 00:35:26 recently I have been seeing more and more interesting research in this topic. So I guess there are three papers by three different groups that I've seen out there that look really promising. They're not

Starting point is 00:35:42 exactly doing the same thing as what we do, but they're in a similar space of shell research with respect to performance. So one of them is is a paper that was actually published at UCNPC this summer. It got the best paper award also. It's called Riker, R-I-K-E-R. And what they do is they essentially get automatic incremental execution for a cell script. And they do it by tracing.

Starting point is 00:36:16 So essentially what you do is you write a build script, your build script, for example, using just a cell script, not a makefile. But then when you run it, it automatically finds all the dependencies between the different build commands. And then the next time you run it, it only runs what actually changed,

Starting point is 00:36:33 what actually needs to be rerun. So it essentially creates the makefile in a fine-grained way, automatically by tracing the execution of your script. That's so good. Is that, I mean, we're kind of diverging a little bit here, I guess, by tracing the execution of your scripts. That sounds interesting. Is that, I mean, we're kind of diverging a little bit here, I guess,

Starting point is 00:36:53 but does that probably perform better for different types of computation, though, that you want to do? Yeah, I guess Riker is mostly targeting build scripts where you have a lot of dependencies between commands. Yeah. And your commands usually build scripts. They don have a lot of dependencies between commands. And your commands usually build scripts. They don't process their input, like in data processing where they read their input line by line, but they read the whole source file, for example, and then they do something with it.

Starting point is 00:37:20 Yeah. So yeah, it's more targeted towards this. Let's say, yeah, build. I think I would call this input-output computations, right? Where you have a dependency graph. That's one work that I found that is extremely interesting. I mean, their work is very dynamic in the sense that there is no concept of specifications, so they discover everything at runtime. And so we're actually interested in investigating the synergy

Starting point is 00:37:51 of iCare with our tools. So if we can actually discover some of the specifications by tracing, that's the idea. Yes, I guess my next question is, I'm guessing this works publicly available, right? It's on GitHub, yes. It's under the BIMPASH. That's the organization name.

Starting point is 00:38:11 B-I-N-P-A-S-H. Cool, cool. We can link that. Also, have you found that it's had quite a substantial impact? Do you find a lot of people using it and have you got feedback from people using it? So actually we have found

Starting point is 00:38:27 that when we published the paper, we saw that some people tried to use it in some of their build scripts though. And yeah, PaaS is not that great with build scripts.

Starting point is 00:38:40 So someone tried to run it on the GNU lib build scripts scripts which is like a few thousand lines of code and of course of course pass broke in the beginning but but then we fixed these errors and we tried we ran the script and he didn't get benefits because because it was um it was highly dependent was more on this build script like a area um but yeah we saw that people are forking their repository we hope that someone uses it we are actually trying to also keep an eye out on people that could potentially use it and i mean our hope is we keep putting resources in it and hopefully we will get if we keep persisting, someone might use it.

Starting point is 00:39:26 But it's one of our hopes to keep it alive and make sure that we can help anyone that wants to use it. Yeah, that's great. Yeah, I think with these sort of things, you've got to just kind of stay in the game, right? Be consistent with it and keep it out there, keep spreading it, kind of publicize it,

Starting point is 00:39:40 and hopefully it kind of takes off. How easy would it be then? So for me to say tomorrow to switch it out for, or to start using it in my day-to-day sort of workflow, what's the usability aspect like of it? So I think you could, I think just if you're using Bash, because that's where we have the, that's where we can plug in now. I mean, we have not tried to run it with Zs, for example, or any other server.

Starting point is 00:40:06 But if you're using Bash and you're running your scripts with Bash, you can just directly run your scripts with Bash just by downloading and installing Bash and it should just run. That's the theory. Yeah. I mean, if you find the bug, if you do it and you find the bug, please

Starting point is 00:40:22 send it to us. Yeah, raise a yellow issue if we find the bug, if you do it, that you find the bug, please send it to us. Yeah, I'll raise a yellow issue if we find the bug. Yeah, but that's like how it should be. So you just download it and run your script with bash and then it will do this compilation, run it with bash as the underlying interpreter. So yeah, and there you would get high, you would get equivalence with what bash would give you based on the correlation and so on yeah nice nice so i kind of i guess this is the next kind of few sets of questions are kind of kind of my sort of uh sort of like stock questions in a way and i like to see how people's answers to them diverge so um first one is sort of what's the

Starting point is 00:41:03 most sort of interesting and unexpected lesson that you have learned while working on pash and maybe on specifically on the on the jit stuff right so so yeah i think i think that the most uh the most interesting at least idea that crystallized in my mind while working on this is that if we don't have this loose coupling, as in having a cell-to-cell compiler, and instead if we implement a cell, there's no way for anyone to adopt this. So I think that's what really crystallized in my mind while working on this, that I spent one year to achieve compliance with Bash and still we are delegating all of the execution on Bash. So I cannot even imagine how much

Starting point is 00:41:51 time it would take for someone to try to achieve compliance with an existing shell by implementing it. That would probably be a multi-year effort. And so I think what I would, yeah, the main idea that the most interesting lesson learned, I guess, is that research in the shell for it to be easily adoptable and easy to, and not have a huge barrier for someone to do it. I think everyone should aim for some loose coupling and some form of cell to cell compilation or something in that in that area in that sense so that they can just plug in existing cells and run with them instead of trying to compete with them and yeah i think that's the biggest lesson yeah that's really definitely it's really interesting lesson so kind of i guess a question that kind of falls out with

Starting point is 00:42:39 that is then is when was the last time someone sort of tried to compete with an existing chair like i mean how often didn't i mean obviously there's there's a whole there's loads right it's like like you said earlier on there's well over 20 right so how frequently are new ones sort of cropping up where people are thinking oh i can do the full thing better i think it's it's very common that i still even now see on github often like uh oh you might like this rep ripple and it's a new shell uh and they all do their own little tricks their own little interesting features like some of them handle arrays some of them have objects you know each cell does does something better but it feels extremely hard to migrate uh existing uh all of your extinct shell scripts to your new cell so that's i think a very very big challenge

Starting point is 00:43:33 like uh it's impossible to migrate all of these legacy scripts that we have um when you sell it we do not have compliance so it's very easy to make a better shell um but then how do you switch things there's no plan i guess or it takes many many years. So in the case of Zish, for example, now I think it's probably, if I'm not mistaken, it became the default shell in Apple computers, Zish. But that took, I think it has been in existence for, I don't know, like 10 years now, maybe even more. And being heavily engineered continuously and you know pushing for its adoption and it has now like it's it's it's not it's not even it's not

Starting point is 00:44:12 even that everyone has switched from bass to jish but okay apple computers runs this but that took so much time and so much energy so yeah yeah yeah yeah it's funny like even even with databases right like people constantly oh but they think i can better, I'm going to build my own database, right, it's just like, everyone thinks they can do it better, right, in a long weekend, normally, as well, give me a long weekend, I can create anything, yeah. That's like the next case of the comic, yeah, yeah, yeah, that's, yeah, all the standards are wrong. Yeah, basically yeah yeah cool I guess let's talk kind of the journey and sort of any sort of war stories

Starting point is 00:44:52 you kind of had along the way so this is a kind of quite a sizable implementation of trying to get compliance right was a bit of a battle it sounded like so kind of from the initial idea of for Pash to now to the to the osdi publication what were the things that you tried along the way that that um that failed

Starting point is 00:45:12 and yeah give us the sort of the war stories and talk about that journey from start to finish right um yeah so i guess this started like um it was like it was in the summer of 2019. That was when we first... Pre-COVID, yeah. Yeah, it was a very long time ago. Yeah, yeah. And actually, yeah, it was one of the collaborators in this work, Nikos Vasilakis, who is now a professor at Brown.

Starting point is 00:45:40 He was then a postdoc at Penn or transitioning to MIT as a postdoc. He called me and he told me, I've got this crazy idea. Let's distribute shell scripts. And I was like, it's a super dumb idea. Like, it will never work. Why are you calling me? It was like late in the evening or something. And I was like, this is completely ridiculous.

Starting point is 00:46:00 Like, it's never going to work. And then, yeah, somehow he convinced me that we should try it out at least and uh yeah we in the beginning so in the beginning we were like thinking that we would distribute shell scripts so that was the idea um but actually the first iteration of this the first fast paper he said it's a distribution engine for cells and it got massacred at some conference i didn't remember it was the i2019 i think 2010 2020 and i think it got like a extremely like terrible reviews like this paper is terrible um yeah and uh and then we were like okay distribution is hard let's just do parallelization for now yeah and that's how we

Starting point is 00:46:38 will order expectations and then we i think we submitted this paper at OSDI 2021 or 2020, I don't remember. And there we had the most surreal experience because there were six reviews and four of them were accepts or strong accepts. So like the WIC accepts, accepts or strong accepts, like the ones that say I will champion this paper. And the other two reviews were like rejects, like not weak, rejects. And so there was a very huge dissonance from the reviewers' perspective. And two of them said, this is complete garbage.

Starting point is 00:47:16 This is not obvious. This is garbage research. And then the four others were, it's good. And then it didn't get in. We were like super furious. Yeah. And eventually it got in u.s where it also got the best paper award which was actually funny because yeah yeah yeah yeah yeah how it came out but yeah after that after that this one has been going very smoothly i think but after this first one and a half year of uh

Starting point is 00:47:42 real hard you know like um pushing for it and so on. Yeah, 18 months of grind but like worth it in the end I guess. Yeah, yeah, yeah, yeah. Cool, so I guess what's next then? So where do you take it from here then? So what else is in the pipeline? Are you going to go down the distribution route now, distributed version or? That's a good question, yeah, so we actually, yeah, we have a conditionally accepted distributed pass at NSDI 2023. Congratulations. Yeah, yeah.

Starting point is 00:48:12 It's conditionally accepted yet, so we have to share the confirmation. But yeah, after four years, I guess, the distributed version will be published somewhere. So that's that. But then we have a ton of ideas, actually. It will be published somewhere. So that's that. But then we have a ton of ideas actually. I think there is a lot of interesting synergy that could be investigated with dynamic approaches like Riker. There is a lot of work to do on parallelization optimality because currently we just showed that we can do parallelization in the shell but

Starting point is 00:48:45 we haven't thought at all about how to do it well like we just parallelize yeah okay but how much like um there are a lot of parameters parameters that can be cured heuristics that can be improved so these are like some directions i think for me the the biggest like uh direction i think the biggest problem that that um exists at the moment with the shell is is correctness and maintainability so you have all of these legacy scripts out there and like the state of the art i think in i mean i don't think people even test these scripts it's more like they run they run in our, so let's not mess with them because they're like these very complex things. And I think that's what I see like the most interesting and more maybe more salty like

Starting point is 00:49:37 research on the shell where it could happen. It would be great to see frameworks and tools and affordances for developers that would help them make scripts, check that their scripts are correct, or maybe hold their hand while implementing the script, not letting them do any kind of big goof, because you can very easily in the shell delete your home directory and then lose your computer or something so yeah um like some hand holding security hand holding correctness hand holding and maybe even maintainability in a sense of porting old legacy scripts to saner languages or saying like uh um coding practices or conventions uh so this is this is what i see i see i see the future of our research. And I think that we have done

Starting point is 00:50:26 all of these intersections we have done now, I guess. And I guess in these challenges that I talked to you about in the beginning, the shell challenges and the shell characteristics, I think is a vehicle where one could build very easily other things on top of it. So the just-in- time framework, the shell-to-shell compilation pipeline, and the command specifications, all those three combined could, I think, very easily at least lower the effort needed to do some research on the cell and solve some other challenges. There's plenty of very interesting directions there i'm sure you'll be

Starting point is 00:51:06 that'll keep you busy for a long time yeah yeah hopefully yeah yeah yeah and so i mean other than receiving calls late in the evening and with crazy ideas how else do you like like go about sort of idea generation and sort of selecting what things to work on? What's your process there? I think I'm not that good at it, to be honest. I select the ideas. I think usually people that I collaborate with come up with

Starting point is 00:51:37 good ideas. I feel that the way I pick, I guess, what to work on is I want to have both know that it's potentially impactful, highly

Starting point is 00:51:51 impactful, and understand that the problem is actually relevant and that there is a clear application that people are actually doing at the moment. So that's one big benefit of this shell research, that I know that people are writing shell scripts. So I think I really like problems

Starting point is 00:52:08 where the impact is clear and practical is clear and visible so I don't like working on some theoretical thing that maybe in 10 years someone will figure out how to apply and I want the research to be actually applicable

Starting point is 00:52:23 and I guess the other thing which is more the PL programming languages, like the flavor to things is, I really try to find like cute and elegant subsets of the problems that can be described in a cute and clean model, but actually also proving that some things are correct in this model. So I guess, yeah, the thing that I like working on is this combination of potentially practically

Starting point is 00:52:49 impactful and acute, elegant with respect to the models and representations. Yeah. Hey, I guess, I guess kind of my kind of next question I kind of want to ask you is, I mean, maybe, maybe I kind of maybe know the answers already kind of given what we spoke about what do you think sort of the biggest challenge in in with like shell scripts and with shells in in at the moment like I mean maybe something like the the fact that like migrating from one to the other is pretty like challenging I guess that feels to me like it'd be a big a big sort of one of the big challenges but what do you think is the sort of biggest challenge out there at the moment?

Starting point is 00:53:26 Yeah, I think, I think that's right. Porting, maintaining and checking the correctness of shell scripts is like, it's, it's, we are, we are lagging, shell scripts are lagging behind other languages with respect to what affordances they have on all those topics, like for decades, they're lagging behind decades. So I think that's what we would like to push and what I think is a very relevant research focus to try to bring a lot of the knowledge and the affordance that we have in other languages to the share and actually make them more testable, maintainable, better, make it easier to reason about what they will do, what their correctness is and so on.

Starting point is 00:54:06 Why do you think it is that it's kind of almost been just left, no one's kind of tackled that problem compared to other programming languages? Why do you think it's been left to get to the state? Because everyone uses these things every day, right? Like it feels strange that such a fundamental sort of thing has kind of been left in that sort of state. That's true. And we were also not sure about that. And our hypothesis is that because of these of these challenges that it has, its characteristics, which make it great, also its characteristics, right, because the fact that you can compose

Starting point is 00:54:39 black box commands allows you to do very, very succinct, powerful scripts. But it's also this problem that you don't know what these commands are. So then how can you develop a principle solution that deals with many of them and that's realizable? Or the fact that it's very dynamic. It's great because you can introspect and reflect on the state of your file system while you're executing your script. So a for loop, its iterations could depend on what files you have in a specific directory,

Starting point is 00:55:09 for example. And that's very powerful to use as a developer. But then also, if you are like a programming language framework designer or like a systems researcher, you cannot write an analysis for the shell because it's so dynamic. It's like just impossible. So we think that these characteristics that make the shell great also make it very hard to actually tackle and do the search for principal solutions in this space. And this is our hypothesis, of course, and we're a little bit biased. That's what we think was the biggest roadblock.

Starting point is 00:55:53 So, yeah, our hope is that we're making a small fence on all these problems. And ideally, by building on top of this, we can bring ideas from other languages. We can make the shell a little bit saner. Yeah. That's interesting yeah and so i guess it's time for the kind of the last word now so what's the the one key thing you want the listeners to take away from your work on pash yeah i guess um um yeah research on the shell is possible again, we think. Do more of it. You should do research on the shell. Call to arms.

Starting point is 00:56:32 Yeah, exactly. Call to arms. There's also other work, not just by us. Like this Riker paper, for example, there's an offloading shell shell it's called posh by a group at Berkeley and Stanford I think that came out a few years ago yeah I think we should do research on the shell I think it's possible again that people should consider it again as a relevant place to do research and I guess something more specific is that download pass it's very easy it's compliant and yeah

Starting point is 00:57:11 you will not break your scripts and you might get paralysed yeah brilliant well I guess it's a good message

Starting point is 00:57:16 to end things on so yeah thank you so much Konstantinos if the listeners interested in knowing more about your work then we'll put

Starting point is 00:57:22 obviously all the links in the show notes and we will see you all next time for some more awesome computer science research

Disseminate: The Computer Science Research Podcast - Konstantinos Kallas | Practically Correct, Just-in-Time Shell Script Parallelization | #20

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.