Disseminate: The Computer Science Research Podcast - Konstantinos Kallas | Practically Correct, Just-in-Time Shell Script Parallelization | #20
Episode Date: January 30, 2023Summary: Recent shell-script parallelization systems enjoy mostly automated speedups by parallelizing scripts ahead-of-time. Unfortunately, such static parallelization is hampered by dynamic behavior ...pervasive in shell scripts—e.g., variable expansion and command substitution—which often requires reasoning about the current state of the shell and filesystem. Tune in to hear how Konstantinos Kallas and his colleagues overcame this issue (and others) with PaSH-JIT, a just-in-time (JIT) shell-script compiler!Links: OSDI paperPersonal websiteTwitterLinkedInPaSH homepage (you can find all associated papers here) Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby.
Today we have another installment from OSDI 2022, and I'm delighted to say I'm joined by Konstantinos Kalles, who will be talking about his paper, Practically Correct Just-in-Time Shell Script Paralysation.
Konstantinos is a PhD student at the University of Pennsylvania, and he is on the Technical Steering Committee of PASH, which we'll be learning more about today.
And his research interests lie in the intersection of programming languages, distributed systems and software verification.
Konstantinos, welcome to the show.
Hi, Jack. I'm glad to be here.
Great to have you. So let's dive straight in. So can you tell us a little bit more about yourself and maybe how you became interested
at researching at the intersection of these three areas?
Right.
So, yeah, I guess my background, or at least my undergraduate, I worked a lot on compilers.
And I particularly worked on a compiler for the language Erlang.
My advisor at the time was one of the,
Kostis Adonis was one of the developers of the head of time compiler for Erlang
and I worked on a just-time compiler for it.
And yeah, Erlang really instilled in me
some interest and love for distributed systems
because of how nice and clean it makes to develop such systems.
Yeah, and then during my PhD, I slightly moved towards verification
and more theoretical programming languages.
But I'm now coming back home to the system side.
And I think that's where I probably lie, like in the system side,
but with programming language perspective and principles guiding what I'm doing.
That's really nice.
Yeah, you can kind of come full circle almost there.
Yeah.
Yeah.
That's ace.
So the paper, the style of the today is is all about the shell script compiler uh past jit so maybe you can give us an elevated pitch for this work
and also obviously this is part of a like a larger part of work as well so maybe you can
give us an intro to the whole project as well that's right that's right yeah i guess i would
yeah i would probably give you a little bit of the homework because this paper is only part of this whole PaaS paralyzing compiler for the shell.
So I guess the pitch is that shell scripts are ubiquitous, I guess.
And yeah, something that I did find very interesting recently, and the GitHub yearly survey, the shell is the eighth
most popular language across GitHub. So, I mean, yeah, that just shows, and it has been consistently
on the top 10 for many years now, and I guess, yeah, since its inception, I guess. But it's
extremely ubiquitous, but there is no actual...
Most scripts are sequential by default.
You get this pipeline parallelism
by writing pipelines, but that's it.
And essentially what we do
is pipeline parallelism for the shell.
But this was hard to do before,
and this is because of the main three challenges,
we think at least, of the shell
and three of its characteristics that make it very hard to develop
principled solutions on top of it.
So these three issues are, first of all, shell scripts allow for...
Shell programs and shell scripts compose arbitrary black box commands
like grep, sort, arbitrary commands.
And all of these are written in different languages.
Essentially what you have in your system is just the binary of the command, often.
And that means it's extremely hard to make any form of static analysis on these commands
and then leave the result of this analysis to your whole script.
So that's one very, very hard problem that makes the cell extremely hard to deal with and develop tools for. Another thing is that its semantics is extremely complex. Like the
POSIX specification, the POSIX cell specification is I think, multiple hundreds, like many, many
hundred pages of English, that is also underspecified in many places and so on. So in the end, what we end up with is like
more than 20,
I think. Yeah, I think that's
a low estimate of cell
implementation. A low bound. Yeah, exactly.
Yeah, yeah, yeah.
That's right. Yeah, so we end up with like
many, yeah, a lot
of cell implementations and none of
them can be,
you know, all of them diverge in different
different parts of the spec and so if one wants to develop some new research and they develop a new
cell to carry their research with like a paralyzing cell for example if we develop the paralyzing cell
that would diverge from all other cells too making it extremely hard for for someone to actually use
it because then they would have to break their scripts to adopt our research.
And the final thing is that the shell is extremely dynamic
as a language.
So you have these behaviors like a simple string
can contain a command substitution.
So the result, the value of a string could depend
on the files in your directory,
in a specific indeterminate directory in your system.
So there is no way you can actually predict or analyze statically a shell script and be
able to have any form of precision with this analysis.
So these three characteristics of the shell make it extremely hard in our opinion to develop
any kind of solution on it.
And that's what we have essentially covered with our research until now.
So with respect to the commands,
we came up with a command specification framework
that we used to decouple essentially the analysis process
and allow for the development of tools
that know something about the behaviors of the commands
through these specifications.
And then it's the job of the command developer or some expert to write the specification for
the command and make sure that it's correct with respect to the command implementation.
And so this gives us, this solves the first issue, or at least makes a dent towards solving it.
Then with respect to the semantics, we built our whole infrastructure as a shell-to-shell
transformation.
So essentially, we do not implement the shell,
we just produce a parallel shell script,
which we then run with the original interpreter,
allowing us to get extremely high compliance with an existing interpreter.
Now, our workhead is Bash,
but yeah, translating the original script
to a new, better version of itself
and running it on the original interpreter allows us to get very high compliance.
And finally, for the dynamic aspect, we developed the Just-In-Time compilation infrastructure.
Essentially, this is not like traditional Just-In-Time compilers,
but it's more like a very tight orchestration framework that integrates with the cell and
essentially reads a little bit of the input script and then decides at runtime whether it should give
that script to the cell to execute or rather transform it to a better version of itself and
then give it to the cell so this very tight integration and incremental let's say execution
of the script allows us to get compilation at the right time
when all of the dynamic things have been resolved.
So when we see a string that contains an LS inside,
a command substitution and an LS,
we can actually ask the shell, the interpreter,
please expand this using the last-time infrastructure
and then say, okay, can we now do something
about this whole region of the script that we have expanded
and we know what it's supposed to do. guess that's uh that's a very long elevator piece but i know
that's that's that's great that's great so obviously we're gonna we're gonna we're gonna
focus on the on the jit stuff today but there are there any like where's where's a good reference to
go and check out the obviously the prior work on this what would you direct the listener to is
i'm guessing you've got plenty of papers on this. So what would be the best place to start looking?
I think for the commands
and the self-compilation infrastructure,
the foundation of it,
our EURSYS 2021 paper
that's called Pass Light Touch Data Parallel,
I think self-parallelization,
self-compilation or something should be, shell compilation or something
should be, yeah, a good study point.
Great stuff. We'll link that in the
show notes as well, so the interested
listener can go and follow up and look at that.
So let's talk more about
PassJIT then.
You sort of outlined sort of like
four main sort of areas, four main
features to the
compiler.
Let's go through them one at a time, and we'll dig into each one.
You can tell us what it is, like what is set there, how it works,
and kind of what problem it's solving.
So you start like one of the sections is about dynamic interposition.
So what is this? Why do we need this? And what's the problem it's solving?
That's right. Okay. So the dynamic composition is
essentially a solution
that deals with the dynamic
issues and it's this
integration part that I told you
about in the beginning. So essentially
how we achieve that is that
we have a preprocessor
of the script that
goes and
instruments the script at multiple places with calls to our
just-in-time compilation engine, let's say. And then at runtime, when this is called,
the script, the actual script is called, at various places we enter the JIT engine loop.
That's essentially the integration part that I talked
about in the beginning.
Okay, cool. So when it goes through in the instruments, the script, what is it looking
for to instrument? What areas are kind of, oh yeah, this we can instrument this, so when
it comes through, we can parallelize this area. What are the sort of hooks you hook
into there?
That's a good question. So we find areas where that we call in
the paper data flow regions. And these are areas that look like
they could benefit from parallelization. And intuitively
what they are is compositions of commands with pipelines and the
background operator, the amberson. So these are regions of the script that are supposed
to execute multiple commands concurrently and each of these regions before it, we essentially
replace these regions with calls to this just time. Do the annotations always work or is there
a chance where you get some sort of false positive with the annotation or is it always going to work?
I guess maybe we can touch on that later on when we do the evaluation maybe or we talk about that.
But is there something where it's not 100% correct all the time?
That's, yeah, I mean, it could not be correct. So the processing is essentially just an optimistic approach.
Okay, I see.
Let's replace all of these things with a call to adjust in time. And then at runtime, what happens is that we actually go and check
whether we know enough things about this region
and whether it is actually parallelizable.
And sometimes it's likely that it's not the case.
If this region contains, for example, an RM command,
deleting some file, we don't want to touch it.
So what we do is fall back at runtime
to the original region.
So essentially, the preprocessing
just needs to be optimistic
and it over-approximates
the places where we could get parallelization
and just make sure that it covers as many of them.
And then at runtime,
we make a sound decision
of whether we can actually parallelize the region.
Okay, cool, nice.
So let's talk a little bit more about the JIT engine then.
So obviously this is a kind of
I guess is one of the main components
of this whole thing. So let's go into that.
Can you tell me how that sort of looks
and what the structure looks like and how it works?
Right, so I guess the JIT
engine at its core is
first of all it's a shell script.
At its core, what it does is that it hides from the perspective of the user and the shell
the fact that we are now entering in a compilation investigation mode.
So what it does is it saves the state of the shell at that point and that includes all
of the configuration of the user, the ways the user could have configured the cell like using the dash e flag which exits the cell on an error
or the dash v flag which gives you verbose output and so saves all that to be able to revert it
later and then sets the cell state to an internal pass internal state that is very hidden in a way.
It doesn't expose any information to the shell of the user.
And then asks the compiler whether that particular region
that was there at the time can be paralyzed or not.
And the compiler responds something, a yes or no.
And then the JIT engine restores the state of the shell, the original one,
and then executes either the optimized parallel or the original fragment.
Okay, nice. So in the case where we can actually parallelize this,
the parallelizing compilation server kicks in here, right?
So at this point, how does this work then?
So once it works out, yeah, this can be actually optimized.
How does it go about doing this?
Right, so that's the part of the compiler.
So what the compiler does,
it has a whole pipeline of things that it does.
And the first thing that it does is it asks the interpreter
to expand all of the strings in that region.
All of these things that could contain arbitrary commands,
substitutions, whatever.
This could be even a whole,
there could be even a whole command,
a whole program inside the strings.
And it does that by simply asking them,
okay, well, I'm actually lying.
So the original implementation was asking the interpreter to do it for it,
but that ended up being very slow.
So what we ended up doing is re-implementing our own expansion on top of it and covering a big percentage of all the cases,
the common and safe expansion cases,
and doing that in our internal Python library.
But yeah, you can think of it,
this is the equivalent to just asking the interpreter
to do it for us, the expansion,
because essentially at that point,
we have all the information we need.
We can reflect on the file system,
execute a command, do whatever we want.
So after expanding the strings, we have a fully expanded fragment of the script, which
could be a pipeline.
Let's say it's a pipeline.
And then what Pasta is to do is for each command in this pipeline, it tries to find the command
specification and try to see whether, first of all, whether we have it, this command specification, because we might not.
And if it hasn't, it tries to transform this command to a dataflow node representation.
And the commands that we can do that for are the commands that are what we call pure in the paper and that means that they only interact with
their environment by reading from a well-defined set of input files and writing to a well-defined
output instead of output files um so and so that's a command that where we know the inputs and
outputs we can turn it into a data flow node with well-defined inputs and outputs. And then it does this for every command in the script. And if all of them succeed,
we have all of these commands that can be composed in a big dataflow graph.
At any point, by the way, until here, if something fails, if some specifications are not found,
if something goes wrong, the compiler simply gives up,
folds back and then the original fragment is executed. So all of these processes
does not need to be complete in a way. It works only if we have the specifications and so on.
So once we have the data flow graph, this is a very clean abstraction and we have developed like proven, provably correct transformations
on top of this data flow graph that we then apply and transform it to a parallel data
flow graph.
Essentially, the transformations that we do are all divide and conquer, simply divide
and conquer parallelism. So if a command processes its different input lines independently,
then we can simply parallelize this command and give a subset of the lines
to each of the parallel instances and then concatenate the results.
So that's the main idea of how we get parallelization in the data flow graph.
And then we simply have a backwards pass that takes the flow graph and transforms it back to a parallel shell script
by instantiating all of these nodes in the parallel data flow graph
back to the commands.
That's really interesting.
I mean, in the general case,
do you find that,
I mean, obviously, again,
we'll probably touch on this more
when we're talking about the evaluation.
How likely is it that
I'm not just going to end up
with big one long dependency chain
of you've got to do X, Y, and Z.
There's no actual opportunity
for parallelism there.
There's like a dependency chain
that you just,
you can't move things around.
Is it quite unlikely that it happens? Or is it pretty common that you can actually do things smarter right right
so so so i think one thing that um so we do find and discover dependencies too but i cannot talk
about this yet so so so that's that's later in the paper. So the parallelism I'm talking about here does not talk about dependencies.
Even if I have a single command, let's say you have a sort, right?
Somewhere in your script, you call a sort and it writes to the file
and the next command in the sequence reads from that file.
So these are dependent clearly.
But even this sort on its own, it can be translated to a data flow graph.
Okay, so within one thing. Oh, okay. Right now. But even this sort on its own, it can be translated to a data flow graph.
Okay, so within one thing. Okay, right now.
Yeah, split it up into multiple sorts that then match their output with sort doesM.
So essentially for many commands, we have this kind of aggregation that we can do after that and parallelization.
So even at the single command level and essentially if you have the sort followed in a pipeline by another command, then we
would parallelize the sort and then parallelize the next one, and also make sure that the
combinations between all of these parallel versions, parallel instances are efficient
in a way.
And so we get a pipeline and the data parallelism.
Yeah.
Fascinating.
Yeah.
Sorry, I was getting ahead of myself there.
Cool. Fascinating, yeah, sorry I was getting ahead of myself there Cool This is a common
I think
misunderstanding about this work
and I think everyone in the beginning assumes
that the main thing we do is figure out
these dependencies, but we actually do more
and the original work does this
data parallelism
So I guess, where does
the commutativity awareness fit into
this then? So this is, where does the commutativity awareness fit into this then?
Are we talking here within an operation?
Or is this again the level up?
It's within a command.
That's right.
So before, in our first paper, the past paper,
we came up with transformations that do not require commutativity.
And these are pretty good.
They get you really far.
But some commands actually you can do better with commutativity.
And these are commands that either process each line of their inputs independently, like
grep.
For example, with grep, you search for a pattern in each of the lines independently.
So as long as you make sure to reorder the lines
after the parallel greps correctly,
you get the same result.
So we call this command stateless in the paper
because they don't require any state across lines.
So with stateless commands and commutative commands like sort,
because sort, all of its inputs,
it doesn't matter the order with which the lines arrive,
the end result will be the same.
We can actually do better than what we,
than just the transformation we had in the original pass.
And essentially doing better means
doing a round-robin passing of lines,
which allows us to have better finer grained parallelism.
Because in the past, in the original pass, we just split the original input file in big chunks and then gave each chunk to its parallel version.
But that requires running through the file once first. Right.
So you pay the cost of running once through the file and then you can split it.
But with a round Robin, you can pass 10 lines to each parallel instance
in a round-dropping fashion and essentially stream it
towards the downstream parallel instances.
So, I mean, I guess kind of how did you go about implementing this then?
So was it pretty straightforward?
So you said it's obviously a shell script itself.
Was it a big implementation effort or was it,
yeah, I guess talk us through how that went.
It, so the implementation is,
all of the compilation stuff is in Python,
just for simplicity, I guess.
And a lot of the glue is in the shell
because we need to have this tight integration
with the shell to do a lot of the glue is in the cell because we need to have this tight integration with the cell to do a lot of things.
We cannot avoid essentially being in the cell
for a big part of the time.
Yeah, I guess the effort, there is not,
I guess there's a lot of code in the compiler as the part,
and there's not a lot of code in the shell,
in the shell part of the implementation,
but coming up with this code was a very, very long process.
Because, yeah, I think we'll talk about this in the evaluation, I guess, if we go through it.
But yeah, we made sure that our tool passes the whole POS Excel test suite, which is more than 29,000 lines of code.
And so, yeah, this took me about a year of engineering and figuring out why is this test
failing. All of the tests were big, first of all. And yeah, it was a very, very, very long process
to actually achieve this compliance but
yeah yeah cool i just wanted to ask you a question about the knowing about the commutativity
awareness in it like how did you decide which things commute or not was it kind of obvious or
was it like a trial and error sort of process or did you sit down and sort of like prove that
sorry these the two things definitely commute so we did not prove anything, but we did a thorough, if you can say that,
analysis of a lot of commands that we found out there and just added the
commutativity field in their specification.
That's what it took to annotate them, essentially. And yeah, we went through a bunch of commands and did our best to decide if they are commutative or not.
And yeah, I mean, in our evaluation,
we do a lot to check for correctness
and we run on very big inputs, a lot of scripts,
and all of them give equivalent outputs to bash.
But of course, that's not the proof.
So our hope is actually next.
We have some next steps with respect to checking the specification correctness.
Experimentally proven then, shall we say.
Practically correct.
Practically correct. There we go.
All right, yeah.
That's cool.
Great. So let's talk about the evaluation then.
So it sounds like it was a very, very extensive evaluation,
but let's take it back a step and talk about what questions you were trying to answer
and what that sort of experimental setup looked like
and what benchmarks you ran in addition to the POSIX test suite.
Right. So I guess we have two main questions that we want to answer. The first is correctness,
and I will explain maybe later what exactly that means because when we talk about correctness,
actually we can talk about correctness maybe completely first and then go to performance.
But the two questions are correctness and performance, and performance has sub-questions.
But with respect to correctness, what we wanted to show and evaluate, I guess, is whether
pass on top of dash, it's very hard to differentiate the two with the sound only, but Bash with a P, R2. Yeah, yeah, yeah.
Whether it
has the same exact behavior with Bash,
the shell interpreter, in as many POSIX shell tests as possible.
So we don't want to bash all the tests,
but we rather want to bash and fail exactly the ones that Bash does too.
Because even Bash doesn't bash all of them. tests but we rather want to pass and fail exactly the ones that bash does too because because even
bash doesn't pass all of them and okay so yeah out of the 400 tests um yeah we i think there are 400
something all the tests that were applicable in our case uh bash versus bash they only diverge in two tests and and the divergence is actually not even
it's actually distinguishable and it's only so the differences that both fail
but the error code of bash is one and the error code of pass is 127. okay
there's a very there's a very little um like um yeah, it's a very weird reason why this is happening, but it has to do with the way we are calling Bash.
We have to call it in a particular way, using the dash C flag to set its name.
That's a very, very, very low level detail.
But essentially, this is what Bash gives us.
This is the error code it gives us, 127.
And therefore, it diverges with the original way of invoking Bash.
And we haven't found any way to actually resolve this.
We have also contacted the Bash developers,
and they do not intend to change that at any point.
So we're kind of stuck with this divergence.
Yeah. But yeah, practically, it's like it's virtually
distinguishable from BaaS.
Yeah, and I guess we also run about 80 scripts that are the ones that we used in our performance evaluation too.
We run them on multiple gigabytes of input and all of the output is the same with Bash.
So that's also another boost to our confidence that Bash is correct.
I'm just looking through the the paper now the range of
where these scripts came from. So there's a wide, wide range of different sort of domains they come
from. Was there any reason why you settled on these specific benchmark sets? Or was it a case
of these are the ones that kind of everyone uses or what was the rationale behind it?
Yeah, that's a good question.
So there's not a lot of research on the shell, especially performance and the performance of the shell before our work.
So in the original paper, we found a lot of pipelines.
And that's what we used for our
original paper, because we mostly dealt with pipelines at the time.
And many of those come from GitHub and other real sources.
And there are pipeline scripts that we found essentially in the wild that do some non-negligible
data processing.
I guess that was the main way of choosing them, because if they don't do any data processing. I guess that was the main way of choosing them because if they don't do any data processing, then we wouldn't be able to parallelize anything and get any,
if it would make sense. And I guess the most interesting out of those, I think, the pipeline
ones is the COVID MTS benchmark set. So that one professor from a Greek university called Diomedes Pinedas actually wrote his scripts
during the pandemic
to help a journalist
measure whether buses in Athens
were actually increased frequencies,
whether frequencies of the buses were increased,
because that's what the Greek government said.
So that professor wrote his scripts
and actually gathered downloaded a
bunch of like telemetry data from buses and processed it with his scripts to see if the
frequencies of the buses in athens that were increased during government which was what the
government claimed yeah so the government was claiming that oh we have more we had more buses
because that was what was the reason for having more buses was it so that people
are they're less crowded i guess ah okay okay yeah what was the conclusion that was that once
uh yeah yeah but but uh yeah this was i mean we really uh we got it uh we're very happy when we
found it because uh um i mean we're asking it out do you have
some scripts that do data processing you know and this professor came in and we're like wow
this is amazing uh yeah that's a great find yeah so let's talk performance anyway so um
i guess give us some numbers and how well does it perform right Right, so let's see.
First of all, we're at about 80-80 shell scripts, as I said.
And I guess the main conclusion from our performance benchmarks
is that the just-in-time version of Bash
is always faster than the ahead-of-time version of Bash. It's almost always faster than the ahead of time version of bash.
It's almost always faster than bash, the sequential version.
And with respect to the ahead of time
original bash, it also extends benefits to many, many scripts that the original
could not handle. So actually, any complex script that contains a complex control
flow, like if analysis or for loops or dynamic behavior
without the compiler, you cannot handle it because you cannot statically
determine what will actually happen, how will these variables expand?
So I think some numbers are
the average speed up and pass it on all of those scripts is about 5.86.
That's on a 64-core machine that we used.
And the pass ahead of time is about 2.9.
And that was only on the scripts that it could actually handle.
We didn't even measure the rest, which doesn't even apply.
And the only few scripts where we are slower than the sequential one are ones that perform tiny computations.
So there was one script that just essentially keeps the first line of its input, which took 8 milliseconds for Bash to execute it.
But for BashJT it took about one second because of all the initialization that we need to do um and so on so uh so this is i
think that the only scripts that we were we were slower than the sequential are ones that do
negligible computation and so we pay the constant overheads up front um and that's that's uh one of
the slowdowns yeah sure yeah because i was kind of going to ask like leading into the next question
of kind of what the limitations are when a situation is when that performance is sub-optimal
but like kind of what are the characteristics of the scripts where i mean other than it being like a
really tiny computation is there any other characteristics that would maybe mean that bash
would be the right option to go with oral yeah right i i think that if you have um any script
that has tiny computation or very very tight um loops with not a lot of computation so that would take
that would take very you would get the benefits by you wouldn't get a lot of benefits by using
pass i guess also if you do not have any specification for any command in your script
then pass itthread also cannot
do anything for it.
So yeah, I guess a main assumption behind all of this work is that you have some specifications
for a few commands in your script, or you use some standard commands.
We have written about, I think, more than 70, something in that range, 70, 80 specifications
for commands.
And we cover a big bunch of the standard ones but if you if you
use some completely third-party custom command and you don't have the specification pass it's gonna
just yeah yeah you're on your own there right yeah yeah exactly exactly yeah yeah continue yeah We really tried to reduce, I guess, the overhead of our JIT engine
and this in and out essentially calls between the JIT engine.
And we have also a small part in the evaluation of the paper that we saw that.
And the best we could get in the end is, I think,
about 50 milliseconds per entrance to the just the time engine so that's
the lowest we can get and and in the end we were happy with this but but i think that's one place
where where one could potentially seek for for an improvement to make this tool um actually
having no negative impact if you ever want to use it. Yeah.
There is no trade off that would need to be done there to achieve very low
latency, which is that you would probably need to integrate
into an original interpreter, so modify bash itself.
Because we wanted to avoid that and essentially have this loose coupling with an existing interpreter, we ended up with, in having this shell-to-shell
compiler infrastructure, we ended up, that's the lowest we could get with respect to latency.
Yeah, in that sort of setup, there's always going to be slow movement. You're never going
to truly get rid of it all, right? So, but yeah but so i mean obviously in in the experiment in experiments you
compared against bashing obviously kind of the ahead of time version of of past but like what
you mentioned there's not really much other work in this area what is the nearest sort of for lack
of a better i think a competitor other than bash sort of thing? Is there any other sort of research? So I think there are a lot of
recently I have
been seeing more and more interesting
research
in this topic. So I guess
there are three
papers by three different groups that I've seen
out there that look really promising.
They're not
exactly
doing the same
thing as what we do, but they're in a similar space of shell
research with respect to performance. So one of them is
is a paper that was actually published at UCNPC this summer.
It got the best paper award also. It's called Riker, R-I-K-E-R. And what they do is they essentially get automatic incremental execution
for a cell script.
And they do it by tracing.
So essentially what you do is you write a build script,
your build script, for example, using just a cell script,
not a makefile.
But then when you run it, it automatically
finds all the dependencies between
the different build commands.
And then the next time you run it,
it only runs what actually changed,
what actually needs to be rerun.
So it essentially creates the makefile
in a fine-grained way, automatically
by tracing the execution
of your script.
That's so good.
Is that, I mean, we're kind of diverging a little bit here, I guess, by tracing the execution of your scripts. That sounds interesting.
Is that, I mean, we're kind of diverging a little bit here, I guess,
but does that probably perform better for different types of computation, though,
that you want to do?
Yeah, I guess Riker is mostly targeting build scripts where you have a lot of dependencies between commands.
Yeah. And your commands usually build scripts. They don have a lot of dependencies between commands.
And your commands usually build scripts.
They don't process their input,
like in data processing where they read their input line by line,
but they read the whole source file, for example, and then they do something with it.
Yeah.
So yeah, it's more targeted towards this.
Let's say, yeah, build.
I think I would call this input-output computations, right?
Where you have a dependency graph.
That's one work that I found that is extremely interesting.
I mean, their work is very dynamic in the sense that there is no concept of specifications, so they discover everything at runtime.
And so we're actually interested in investigating the synergy
of iCare with our tools.
So if we can actually discover some of the specifications
by tracing, that's the idea.
Yes, I guess my next question is,
I'm guessing this works publicly available, right?
It's on GitHub, yes.
It's under the BIMPASH.
That's the organization name.
B-I-N-P-A-S-H.
Cool, cool. We can link that.
Also, have you found that it's
had quite a
substantial impact?
Do you find a lot of people using it and have you got feedback
from people using it?
So actually we have found
that when we published
the paper,
we saw that some people
tried to use it
in some of their build scripts though.
And yeah,
PaaS is not that great
with build scripts.
So someone tried to run it
on the GNU lib build scripts scripts which is like a few thousand
lines of code and of course of course pass broke in the beginning but but then we fixed these
errors and we tried we ran the script and he didn't get benefits because because it was um
it was highly dependent was more on this build script like a area um but yeah we saw that people are forking their repository we hope that
someone uses it we are actually trying to also keep an eye out on people that could potentially
use it and i mean our hope is we keep putting resources in it and hopefully we will get
if we keep persisting, someone might use it.
But it's one of our hopes to keep it alive
and make sure that we can help anyone
that wants to use it.
Yeah, that's great.
Yeah, I think with these sort of things,
you've got to just kind of stay in the game, right?
Be consistent with it and keep it out there,
keep spreading it, kind of publicize it,
and hopefully it kind of takes off.
How easy would it be then?
So for me to say tomorrow to switch it out for,
or to start using it in my day-to-day sort of workflow,
what's the usability aspect like of it?
So I think you could, I think just if you're using Bash,
because that's where we have the, that's where we can plug in now.
I mean, we have not tried to run it with Zs, for example, or any other server.
But if you're using Bash and you're running
your scripts with Bash, you can just
directly run your scripts
with Bash just by downloading and installing
Bash and it should just run.
That's the theory.
Yeah. I mean, if you find the bug,
if you do it and you find the bug, please
send it to us.
Yeah, raise a yellow issue if we find the bug, if you do it, that you find the bug, please send it to us. Yeah, I'll raise a yellow issue if we find the bug.
Yeah, but that's like how it should be. So you just download it and run your script with
bash and then it will do this compilation, run it with bash as the underlying interpreter. So yeah,
and there you would get high, you would get equivalence with what bash would give
you based on the correlation and so on yeah nice nice so i kind of i guess this is the next kind
of few sets of questions are kind of kind of my sort of uh sort of like stock questions in a way
and i like to see how people's answers to them diverge so um first one is sort of what's the
most sort of interesting and unexpected lesson
that you have learned while working on pash and maybe on specifically on the on the jit stuff
right so so yeah i think i think that the most uh the most interesting at least idea that
crystallized in my mind while working on this
is that if we don't have this loose coupling, as in having a cell-to-cell compiler,
and instead if we implement a cell, there's no way for anyone to adopt this. So I think that's what really crystallized in my mind while working on this, that
I spent one year to achieve compliance with Bash and
still we are delegating all of the execution on Bash. So I cannot even imagine how much
time it would take for someone to try to achieve compliance with an existing shell by implementing
it. That would probably be a multi-year effort. And so I think what I would, yeah, the main idea that the most interesting lesson learned,
I guess, is that research in the shell for it to be easily adoptable and easy to,
and not have a huge barrier for someone to do it. I think everyone should aim for some loose
coupling and some form of cell to cell compilation or something in that in that area
in that sense so that they can just plug in existing cells and run with them instead of
trying to compete with them and yeah i think that's the biggest lesson yeah that's really
definitely it's really interesting lesson so kind of i guess a question that kind of falls out with
that is then is when was the last time someone sort of tried to compete with an existing chair like
i mean how often didn't i mean obviously there's there's a whole there's loads right it's like
like you said earlier on there's well over 20 right so how frequently are new ones sort of
cropping up where people are thinking oh i can do the full thing better i think it's it's very
common that i still even now see on github often like uh oh you might like this rep ripple and it's a new shell uh and they all do
their own little tricks their own little interesting features like some of them handle
arrays some of them have objects you know each cell does does something better but it feels
extremely hard to migrate uh existing uh all of your extinct shell scripts to your new cell so that's i think a very very big challenge
like uh it's impossible to migrate all of these legacy scripts that we have um when you sell it we do not have compliance so it's very easy to make a better shell um but then how do you switch
things there's no plan i guess or it takes many many years. So in the case of Zish, for example,
now I think it's probably, if I'm not mistaken,
it became the default shell in Apple computers, Zish.
But that took, I think it has been in existence for,
I don't know, like 10 years now, maybe even more.
And being heavily engineered continuously and
you know pushing for its adoption and it has now like it's it's it's not it's not even it's not
even that everyone has switched from bass to jish but okay apple computers runs this but that took
so much time and so much energy so yeah yeah yeah yeah it's funny like even even with databases
right like people constantly oh but they think i can better, I'm going to build my own database, right, it's just like, everyone thinks they can do it better, right, in a long weekend, normally, as well, give me a long weekend, I can create anything, yeah.
That's like the next case of the comic, yeah, yeah, yeah, that's, yeah, all the standards are wrong.
Yeah, basically yeah yeah
cool
I guess let's talk kind of the journey
and sort of any sort of war stories
you kind of had along the way so this is a kind of
quite a sizable implementation
of trying to get compliance right
was a bit of a battle
it sounded like so kind of from the
initial idea of
for Pash to now to the
to the osdi publication what were the things that you tried along the way that that um that failed
and yeah give us the sort of the war stories and talk about that journey from start to finish
right um yeah so i guess this started like um it was like it was in the summer of 2019.
That was when we first...
Pre-COVID, yeah.
Yeah, it was a very long time ago.
Yeah, yeah.
And actually, yeah, it was one of the collaborators in this work,
Nikos Vasilakis, who is now a professor at Brown.
He was then a postdoc at Penn or transitioning to MIT as a postdoc.
He called me and he told me, I've got this crazy idea.
Let's distribute shell scripts.
And I was like, it's a super dumb idea.
Like, it will never work.
Why are you calling me?
It was like late in the evening or something.
And I was like, this is completely ridiculous.
Like, it's never going to work.
And then, yeah, somehow he convinced me that we should try it
out at least and uh yeah we in the beginning so in the beginning we were like thinking that we
would distribute shell scripts so that was the idea um but actually the first iteration of this
the first fast paper he said it's a distribution engine for cells and it got massacred at some
conference i didn't remember it was the i2019 i think 2010 2020 and i think it
got like a extremely like terrible reviews like this paper is terrible um yeah and uh and then
we were like okay distribution is hard let's just do parallelization for now yeah and that's how we
will order expectations and then we i think we submitted this paper at OSDI 2021 or 2020, I don't remember.
And there we had the most surreal experience because there were six reviews and four of them were accepts or strong accepts.
So like the WIC accepts, accepts or strong accepts, like the ones that say I will champion this paper.
And the other two reviews were like
rejects, like not weak, rejects.
And so there was a very huge dissonance
from the reviewers' perspective.
And two of them said, this is complete garbage.
This is not obvious.
This is garbage research.
And then the four others were, it's good.
And then it didn't get in.
We were like super furious.
Yeah. And eventually it got in u.s where it also got the best paper award which was actually
funny because yeah yeah yeah yeah yeah how it came out but yeah after that after that this
one has been going very smoothly i think but after this first one and a half year of uh
real hard you know like um pushing for it and so on.
Yeah, 18 months of grind but like worth it in the end I guess.
Yeah, yeah, yeah, yeah.
Cool, so I guess what's next then? So where do you take it from here then? So what else is in
the pipeline? Are you going to go down the distribution route now, distributed version or?
That's a good question, yeah, so we actually, yeah, we have a conditionally accepted distributed pass at NSDI 2023.
Congratulations.
Yeah, yeah.
It's conditionally accepted yet, so we have to share the confirmation.
But yeah, after four years, I guess, the distributed version will be published somewhere.
So that's that.
But then we have a ton of ideas, actually. It will be published somewhere. So that's that.
But then we have a ton of ideas actually.
I think there is a lot of interesting synergy that could be investigated with dynamic approaches like Riker.
There is a lot of work to do on parallelization optimality
because currently we just showed that we can do parallelization in the shell but
we haven't thought at all about how to do it well like we just parallelize yeah okay but how much
like um there are a lot of parameters parameters that can be cured heuristics that can be
improved so these are like some directions i think for me the the biggest like uh direction i think the biggest problem that that um exists at the moment
with the shell is is correctness and maintainability so you have all of these legacy
scripts out there and like the state of the art i think in i mean i don't think people even test
these scripts it's more like they run they run in our, so let's not mess with them because they're like
these very complex things.
And I think that's what I see like the most interesting and more maybe more salty like
research on the shell where it could happen.
It would be great to see frameworks and tools and affordances for developers that would
help them make scripts, check that their scripts are correct, or maybe hold their hand while
implementing the script, not letting them do any kind of big goof, because you can very easily in
the shell delete your home directory and then lose your computer or something so yeah um like
some hand holding security hand holding correctness hand holding and maybe even maintainability in a
sense of porting old legacy scripts to saner languages or saying like uh um coding practices
or conventions uh so this is this is what i see i see i see the future of our research. And I think that we have done
all of these intersections we have done now, I guess. And I guess in these challenges that I
talked to you about in the beginning, the shell challenges and the shell characteristics,
I think is a vehicle where one could build very easily other things on top of it. So the just-in-
time framework, the shell-to-shell compilation pipeline,
and the command specifications, all those three combined could, I think,
very easily at least lower the effort needed to do some research on the cell
and solve some other challenges.
There's plenty of very interesting directions there i'm sure you'll be
that'll keep you busy for a long time yeah yeah hopefully yeah yeah yeah and so i mean other than
receiving calls late in the evening and with crazy ideas how else do you like like go about sort of
idea generation and sort of selecting what things to work on?
What's your process there?
I think I'm not that good at it, to be honest.
I select the ideas.
I think usually people
that I collaborate with come up with
good ideas.
I feel that
the way I pick, I guess, what to work
on is
I want
to have both
know that
it's potentially impactful, highly
impactful, and understand
that the problem is actually relevant
and that there is a clear application
that people are actually doing at the moment.
So that's one big benefit of this
shell research, that I know that people are writing
shell scripts.
So I think I really like problems
where
the impact
is clear and practical
is clear and visible
so I don't like working on some theoretical
thing that maybe in 10 years someone
will figure out how to apply
and I want the research to be actually applicable
and I guess the other thing which is more the PL programming languages,
like the flavor to things is,
I really try to find like cute and elegant subsets
of the problems that can be described
in a cute and clean model,
but actually also proving that some things are correct
in this model.
So I guess, yeah, the thing that I like working on is this combination of potentially practically
impactful and acute, elegant with respect to the models and representations.
Yeah.
Hey, I guess, I guess kind of my kind of next question I kind of want to ask you is, I mean,
maybe, maybe I kind of maybe know the answers already kind of given what we spoke about what do you think sort of the biggest challenge in in
with like shell scripts and with shells in in at the moment like I mean maybe something like the
the fact that like migrating from one to the other is pretty like challenging I guess that feels to
me like it'd be a big a big sort of one of the big challenges but what do you think is the sort
of biggest challenge out there at the moment?
Yeah, I think, I think that's right.
Porting, maintaining and checking the correctness of shell scripts is like, it's, it's, we are,
we are lagging, shell scripts are lagging behind other languages with respect to what
affordances they have on all those topics, like for decades, they're lagging behind decades.
So I think that's what we would like to push
and what I think is a very relevant research focus to try to bring a lot of the knowledge
and the affordance that we have in other languages to the share and actually make them more testable,
maintainable, better, make it easier to reason about what they will do, what their correctness is and so on.
Why do you think it is that it's kind of almost been just left, no one's kind of tackled that problem compared to other programming languages?
Why do you think it's been left to get to the state? Because everyone uses these things every day, right?
Like it feels strange that such a fundamental sort of thing has kind of been left in that sort of state. That's true.
And we were also
not sure about that.
And our hypothesis is that because of these of these challenges
that it has, its characteristics, which make it great,
also its characteristics, right, because the fact that you can compose
black box commands allows you to do very, very succinct, powerful scripts.
But it's also this problem that you don't know what these commands are.
So then how can you develop a principle
solution that deals with many of them and that's realizable?
Or the fact that it's very dynamic.
It's great because you can introspect
and reflect on the state of your file system while you're executing your script.
So a for loop, its iterations could depend on what files you have in a specific directory,
for example.
And that's very powerful to use as a developer.
But then also, if you are like a programming language framework designer or like a systems
researcher, you cannot write an analysis for the shell because
it's so dynamic. It's like just impossible. So we think that these characteristics that
make the shell great also make it very hard to actually tackle and do the search for principal
solutions in this space. And this is our hypothesis, of course, and we're a little bit biased.
That's what we think was the biggest roadblock.
So, yeah, our hope is that we're making a small fence on all these problems.
And ideally, by building on top of this, we can bring ideas from other languages.
We can make the shell a little bit saner.
Yeah. That's interesting yeah and so i guess
it's time for the kind of the last word now so what's the the one key thing you want the listeners
to take away from your work on pash yeah i guess um um yeah research on the shell is possible again, we think. Do more of it. You should
do research on the shell.
Call to arms.
Yeah, exactly. Call to arms. There's also other work, not just by us. Like this Riker
paper, for example, there's an offloading shell shell it's called posh by a group at
Berkeley and Stanford I think that came out a few years ago yeah I think we should do research on the shell
I think it's possible again that people should consider it again as a relevant place to do research
and I guess something more specific is that download pass
it's very easy
it's compliant
and yeah
you will not
break your scripts
and you might
get paralysed
yeah
brilliant
well I guess
it's a good message
to end things on
so yeah
thank you so much
Konstantinos
if the listeners
interested in knowing
more about your work
then we'll put
obviously all the links
in the show notes and we will see you all next time for some more awesome computer science research