CoRecursive: Coding Stories - Tech Talk: Modern Systems Programming And Scala Native With Richard Whaling
Episode Date: February 22, 2019Richard Whaling has an interesting perspective on software development. If you write software for the JVM or if you are interested in low level system programming, or even doing data heavy or network ...heavy IO programming then you will find this interview interesting. We discuss how to build faster software in a modern fashion by using glibc and techniques from system programming. This means using raw pointers and manual memory management but from a modern language. Richard also shares some perspectives on better utilizing the underlying operating system and how we can build better software by depending on services rather than libraries. Links: Beej's Guide to C Beej's Guide to Unix Interprocess Communication Beej's Guide to Network Programming Gary Bernhardt's Destroy All Software Screencasts (Web Server from Scratch, Malloc from scratch, shell from scratch) Stevens & Rago Systems Programming books: Advanced Programming UNIX Environment Unix Network Programming - Sockets UNIX Network programming - Interprocess Communication Â
Transcript
Discussion (0)
Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development.
I am Adam, your host.
So it's not as much that I'm a fan of it. It's more that I feel that it's real and it's there.
And the more we try to abstract over it, we'll both, one, our abstractions will
get messy, the underlying details will leak, because they inevitably do. And then it will
end up working with a sort of unpredictable and not really performant abstraction that's probably
worse in every respect than working with the underlying, somewhat poorly designed
operating system interface in the first place. That was Richard Whaling. He has an interesting
perspective on software development. If you write software for the JVM, or if you're interested in
low-level systems programming, then you'll find this interview super interesting.
I mean, I think lots of people will find it interesting
who don't fit into those categories.
I saw Richard present at a conference a couple years ago.
He was talking about building his own web server
and it was a super interesting talk.
So I'm excited.
I'm finally getting the chance
to have a nice long chat with him.
If you like the podcast, I recommend you subscribe to it and stay tuned to the end of the episode
for an update about the Slack channel. All right, Richard Whaling, thank you for joining me on the
podcast. Yeah, thank you so much for having me, Adam. So you are the author of Modern
Systems Programming with Scala Native. That's right. I've read half the book. And this is
actually good for me this case, because it's not that I got lazy and didn't finish it. Actually,
I think only half of the book has been released. That's right. It's in pragmatics, beta early
access program. So there's there's five chapters available online now for
anyone who wants to get in on it early. And then I'm going to be uploading additional chapters,
probably about one a month or thereabouts until it's done, hopefully sometime this summer.
Awesome. So the very first interview I did was actually with Dennis Shablin,
the original creator of Scala Native.
And in it, we talked a lot about kind of the implementation details.
And I think it's great if people want to go check it out,
but I don't think it would be required for this episode.
So I wanted to ask you, as an end user,
what is Scala Native?
Sure.
You know, from my point of view,
and from someone who's not an expert
on implementation details either,
for me, Scala native is an alternative implementation
of the Scala language. You can think of it sort of as being analogous to Scala.js,
but whereas Scala.js compiles Scala to JavaScript and regular Scala compiles Scala to JVM bytecode. Scala native compiles Scala code to native machine binary executable
instructions, the same way that a C compiler does or that a Rust compiler does. And it actually
uses the same compiler backend, the LLVM compiler framework that Rust does, for example.
Who should be interested in Scala Native?
Oh, yeah, that's a really good question.
I think there's sort of two different angles.
First of all, I think really anyone working in Scala
should take some notice of Scala Native.
I think, especially in the last year or two,
I feel like there's a movement away from Scala
as a single platform language that's really closely tied to the Java ecosystem.
And both more of an emphasis, first of all, on pure Scala libraries, things like STDP is a great example.
And then along with that, a move towards more explicitly embracing Scala as a multi-platform language specifically.
So Scala.js is a few years ahead of Scala Native in terms of sort of uptake and broad library support.
But I think we're moving towards a direction of really having three different platforms that can run Scala code.
You can run it on the JVM,
you can run it on the web,
or you can run it as native machine code.
And I think there's a place
for all of them in the ecosystem.
That makes sense.
Was that, I think you said
there was two classes of people.
Yeah, you're right, thank you.
And then the other side is
really people coming at it
from this like native or systems programming sort of angle.
Where, I mean, obviously Rust is a language that's made a huge amount of impact in that area, both people switching from writing C++ to writing Rust, and also bringing people in to writing code at a much lower level than they'd ever thought possible before.
Mostly because Rust provides modern ergonomics and language features,
and has a really inviting and supportive community in a way that C and C++ may have lacked.
I think similarly, Scala Native provides access to really low-level techniques in a way that's more friendly and
ergonomic than anything people have probably used before, while also having a pretty substantially
different spin than Rust, for what it's worth. And do you find yourself in a specific camp?
What brings you to Scala Native? Yeah, that's a good question. I mean,
for background, I started writing Scala maybe about 10 years ago,
maybe nine. And it was from a background of writing C and writing Python and from Python C
interop, especially for like XML databases, information retrieval, stuff like that.
And I originally started with Scala because I thought its XML support looked really good.
Yeah.
It's been a...
And I'm not working on XML anymore, fortunately.
So that's not nearly as much of a problem.
But for me, having the ability to go down to a very low level and write C when necessary
to solve a problem is sort of how I think about how
to solve problems. And sort of losing that capability when I switched to mostly working
on the JVM was really hard for me. So when I started seeing Scala Native show up at conferences
and just saw the elegance of Scala natives C interop.
It was just like nothing I'd ever seen before,
like compared to working with Python or even to languages like Lua.
It's just the cleanest, most straightforward C interop I think I've ever seen.
So I just sort of picked it up and started making like tiny PRs,
mostly to add C standard library function bindings
and sort of took it from there.
So when using Scala Native, I lose the Java libraries.
Is that right?
You lose at the very least the JDK implementations
of the Java libraries.
It's tricky for Scala.js and Scala native because
the Scala language itself does rely on Java libraries at a lot of fundamental levels for
things like arrays, regular expressions, strings. Even when Scala provides its own Scala string and
Scala array classes, under the hood, those are depending on Java string and
array classes a lot of the time. So the approach that both Scala.js and Scala Native have taken
is that they provided alternative non-JVM implementations of these classes wherever
possible. And the idea is that the sort of the surface area of Java libraries that you need to cover
to get Scala the language running is not that large. It's a couple hundred classes, I think.
And those are all provided as part of the standard Scala native distribution. The trouble is, right,
the line between a few hundred core Java classes versus the tens of thousands of classes in a full JDK Java libraries, right? And in some ways,
that's more painful than just the core JDK in some ways. So what you generally find in Scala Native
is to solve a new problem, a lot of times you'll look and find a C library that solves the problem well before you look for what you'd find
necessarily in the Scala ecosystem. That's changing more and more just because we're getting better
pure Scala implementations of so many core parts of the domain. We're just getting close to having a really good
pure Scala implementation of
Hocon, the type safe configuration
library that's really widely used.
Whereas before, that was a library that's incredibly widely used,
but a lot of it is implemented in Java, which would make it
unusable in Scala.js and Scala Native.
So moving towards having pure 100% Scala implementations of important libraries
is really important just for sort of allowing cross-platform development to work easily.
So it seems like it's a great time for it to emerge then, because as you were saying,
for a long time, like maybe using Java
libraries is less common now. It's still quite common, but I guess there's a lot of pure
Scala implementations that are used in that, that makes transposing things to native that much
easier. Yeah. You know, I think it really does. It means you can pick up a JSON library, like
Argonaut off the shelf and it'll just sort of work, you know, versus having to find a C
library or something like that. It's also just a really good time because I feel like there's also
just sort of a cycling of the common libraries used in the Scala ecosystem. And I think there's
a lot of things that are getting redesigned or people are bringing in new libraries for things like config, for HTTP, a lot of areas like that.
So it's a really interesting time to be building and designing new libraries and making an impact
there, I guess. So you mentioned interop with C. How does that work? In Scala native, it's really
simple. It's literally just a one line function def, just like in a regular
Scala program. You define a singleton object, you annotate that it's an at extern object.
And then let's say you want to write a binding for a function like quicksort. The C Quicksort takes, off the top of my head, it takes four arguments.
It takes a void pointer to the array of data to sort. It takes the number of elements in the array,
which is just an integer. It takes the size of each element of the array in bytes.
And then the fourth argument is a function pointer
to the actual comparator that Quicksort uses to actually implement the pairwise comparisons.
And it's this sort of magical C library function that is both incredibly low level,
quite generic. And when you compare it to sorting routines in a high-level language like in Java or in Python, it's sort of ludicrously performant, like usually one or two orders of magnitude faster compared to sorting in a higher-level language with richer data structures. to use it in Scala native is you write a Scala def and you give it the same name as the function
you want to bind and the same arguments. You have to translate the types a little bit. So
a void star, void pointer, and C becomes a pointer of byte in Scala. So in Scala native,
pointer is actually just a generic data type, which turns out to be incredibly elegant.
And it can actually make these function declarations and type definitions and stuff
like that quite a bit easier to read than C syntax, which can frankly feel like line noise
sometimes with function pointers and things like that. So it's actually really straightforward.
And if there's a function from the C standard
library that's missing that you need to use, one, you can provide it in your own code easily.
Again, just a one line function definition. But also like contributing those bindings to
Scala Native itself is a really easy, quick one, like small PR. And that was how I got involved in the first place,
was just sort of knocking these things out
because there was a bunch of string tokenization functions
that I wanted to use for some reason.
That's interesting.
It's hard to picture, I think,
so like what the code looks like,
for instance, if you're using this QSort,
is standard Scala, right?
You're going to import, you're going to import this namespace. That's like the C standard
library. And then you have, um, you have a generic pointer type of the type of data that is in that
you have to do a lot of casting. I think that's the big difference, right?
Yeah. And that's the thing that probably feels the most different about working with Scala
Native, just because casting pointers and structs and arrays in Scala Native has similarly
unsafe semantics to how C works.
C is a typed language, but in many ways it's a weakly typed language, in that not only
will the compiler allow you to freely cast between
nominally unrelated data types, but a lot of really important APIs require you to do so
in a way that feels deeply unhygienic to people who, like me, who have been working in strong,
safe type systems like Scala's for 10 years, right? So it can be a little
awkward to learn the patterns there if you never learned them from C. And maybe it's even harder
to trust yourself to do all of these sort of free, unsafe casts that C programmers do all the time.
But once you get the hang of it, it's actually incredibly powerful. And the
absence of sort of runtime overhead that you get from just doing this sort of pure compile time
casting and maintaining certain type disciplines oneself is kind of crazy, honestly. I think in my
Scala Days talks last year, I had some pretty good benchmarks on this. And I don't 100% remember all of the numbers off the top of my head. But the thing that was really impressive is that the difference in performance between, say, sorting a large Java array versus Quicksort and Scala Native on a large array of structs, you couldn't even say that Scala Native was X
percent faster because it was a super linear improvement in performance as the size of the
array gets bigger, which was both surprising and really cool. And the reason that happens is
because if you have a large array in Java or in JVM Scala, right, you have your array,
but you're also storing a ton of objects on the heap. And the larger the heap gets, the more time
and the more resources your system is going to spend on garbage collection. Whereas with Scala
native or with C, an array of structs is manually managed memory. It's never going to get garbage collected. You can have
regular garbage collected on heap objects also. But if you keep all of your bulk data, like
gigabytes of data off the main heap, it's basically free for the purposes of like runtime GC overhead.
And that's more than like, oh, this is 20, 30% faster. That's where you find the
difference between a program that can run and a program that can't. It makes things possible that
just aren't possible with vanilla Scala and large on-heap data structures in a lot of ways.
Yeah, it's kind of crazy. So what you're actually saying is, okay, like sort, sorting things,
quick sort, let's say it's supposed to be like n log n complexity, I think. Yeah,
that might be right. But you're saying that the complexity class is actually completely different
in a managed environment. Like you think that you're not actually getting that performance
because of all this overhead that you're not counting. So when you go native, you're actually in a whole different class of performance. Is that? Yeah. Yeah. I think
that's a really good way to put it. I think native quicksort is going to perform like the algorithmic
lower bound of quicksort. I think quicksort might nominally be in squared, but just a very well
optimized in squared, but I'm fuzzy. But you're right. So the difference
is that with the JVM, you have this performance penalty of garbage collection on top of everything.
And the burden the garbage collector places on large data intensive operations is much larger than actually running computations on them in a lot of cases.
And we take these large, heavy, legacy, virtual machine-managed environments
for granted so much in every modern language.
I think we lose sight of how much performance we're giving up
to these runtimes in some ways.
I don't know.
Yeah.
In your book, you use Scala Native to manipulate Google Ngram data.
Yeah.
Why was that a good choice? Why was Scala Native a good choice there?
Why was Scala Native a good choice? Well, first of all, for all of the kinds of reasons I've
put out there already, that when you're really trying to process bulk data, and I picked Google
Ingrams, right, because just the letter A file for this data set is like two gigabytes. And you can
get way, way more data than this, if you want to. Once you approach the size of around two plus
gigabytes of data, that's where a JVM heap is really going to start to have trouble.
And it turned out to be a really great way to exemplify the virtues of Scala Native for doing
this kind of bulk off-heap data processing on a real data set and with a simple but somewhat practical real world use case of taking two
gigabytes of data, aggregating it and sorting it, which I think is a pretty common and understandable
task with this kind of file. Yeah, like the first example you do, you're just reading it line by
line, I think, right? And just finding the... That's exactly it.
Just finding the max line value.
So to me, it feels like the JVM should do this very well.
Right.
And reading lines one at a time is something the JVM does well
and is pretty well optimized for.
But even then, when it's doing that,
it's going to be allocating data for a string and then freeing that object with a file like this tens of millions of times.
And thus, even for these really small object allocations, without having to actually persist a huge amount of data onto the heap, there's not a huge amount of data being retained in this case. It's just a lot of data goes in and a lot of data onto the heap, right? Like there's not a huge amount of data being retained
in this case. It's just a lot of data goes in and a lot of data goes out. Even in that use case,
the garbage collector is really imposing a lot of overhead. Whereas what you can do with Scala
native is instead of allocating data over and over to read lines in and then discard them. You can just allocate a static
buffer of like a kilobyte one time, and then you can read every file into the buffer, and then you
don't have to free anything. You just keep reading data into the same buffer and just process all the
data in place. And then there's no allocations at all, no de-allocations. So even where you don't actually
have the overhead of the large heap, but just high GC throughput, it's still possible to beat the JVM.
And I did try running this on my machine and yeah, it's definitely faster. And then you had
a second example, I think you were hinting at it, where you kind of aggregate this data.
So that was interesting because the result was not quite what I expected.
Yeah. I'm curious what you were expecting from doing like an aggregation and a sort.
Well, so the example I think was just grouping all the data because it's split by years. The
thing that was surprising to me was that the non-native version couldn't do it. Like it was just too much. Yeah. So I was expecting the JVM to do a little better there, to be honest. And I mean, my code
isn't always perfect either. So it's certainly possible a high performance JVM specialist could
write a Scala or a Java program that can do this. I certainly know people who do that for a living. But I was a little surprised that the basic JVM array classes and string classes
really couldn't handle what to me was a somewhat intensive, but not outlandish ask, I guess.
Processing a two gigabyte file and sort of aggregating it into maybe a couple hundred megabytes of
on-heap storage seems to me something we should expect a reasonable language implementation to
be able to do. So to me, this is a really important way just to illustrate that the overhead
and just the cost of the JVM or other heavy runtimes really does affect what programs you
can run and what programs you can't. I think, like, I remember Dennis talking about this before,
like, if you look at Spark or something, they end up just doing manual memory management using some
tricks. Yeah, I mean, Spark is such a fascinating use case. You know, I use Spark a ton and have
used it for years. I was doing Spark consulting and
struggling with Spark jobs and getting them to run a lot of times. I think in some ways,
it is really fascinating how much Spark still relies on on-heap storage, which has pros and
cons. But one of the cons of it in practice is that when a Spark job approaches
the maximum memory, amount of memory and heap available, either on a single node or even worse
across all the nodes, really the whole system starts to fall apart. The classic thing that'll
happen in really any data intensive JVM program, right, is that once your heap gets really maxed
out, your garbage collector will start imposing long stop the world pauses. And while that's
happening, you'll miss some kind of heartbeat or some other kind of distributed systems timeout
that your cluster is using for coordination. And then once you start missing heartbeats,
the cluster starts breaking apart and thrashing. And then it's just all downhill from there.
Right? So it's not like you can't even push the system to the limits of its capacity
without breaking all of the networking and distributed systems components.
Whereas I think using off-heat memory, whether you're in the JVM or not,
for distributed systems has a lot of benefits for data-intensive distributed systems.
But also, it's really, really hard to do that right.
The JVM really gets in your way over and over trying to do things like this.
And for me, what's really magical about Scala Native is that it's relatively straightforward. I can write a short sort of blog post length article or chapter and show people how to do it.
It's not like, oh, you have to study this in grad school for six months,
sort of advanced technique.
I feel
like it just makes it so much more approachable. And I'm hoping it makes it possible for us as a
community to build libraries that have these sort of more robust and elegant behaviors when handling
intense amounts of data, honestly. It's interesting. Scala is a very unopinionated language, I guess some people would
say, except in one particular domain, which is that it has to be garbage collected. So that's
the one constraint that's been thrown away now. Well, it's interesting because it's garbage
collected and it isn't. The thing that I think is subtle and that comes with practice with Scala
native is being able to sort of keep these two universes in your head.
You have one universe of regular Scala objects
that are garbage collected.
You have all of the immutable data structures and for loops
and all of the niceties of real Scala, right?
But then you also have the option of using these manually managed pointers and off-heap data structures and structs and arrays, which the garbage collector doesn't touch.
And I think if it were all manual memory, I think it would be – Skala Native would be just as hard to program as C, frankly.
It would be a step backwards. For me, really, the magic is having garbage collection,
all the idioms of regular Scala, but then being able to manage manually a handful of large
custom data structures for the critical paths of a program. And sort of finding the balance between these two domains, I think is really where the
art is. And I think where I'm still evolving, certainly.
It's an interesting point that I guess we probably didn't make clear. So Scala Native can be used
just like Go or something, right? Go is native and garbage collected. And that's about it,
right? But also with Scala Native, you can bring in the
standard C library, and then you can start doing manual memory management, you know, programming
like it's 1995 or whatever. Yeah, that's a really good distinction to make. Go is a really good
example of a native, relatively high level language that has a garbage collector and has
good ergonomics. And I think Scala Native is
competitive at a lot of the same things Go does. And in some ways, Go might be a better direct
comparison for Scala Native than Rust in some ways. The other thing that I compare it to a lot,
which might be a little more obscure, would be Standard ML or OCaml, which are somewhat more academic functional programming
languages, but are also strict languages like Scala, have great immutable data structures like
Scala, have awesome garbage collectors like Scala Native, and so on. But in practice, I don't think
there's a lot of folks out there who've experienced doing systems programming in modern, high-level, functional, capable language like this.
Scala Native is the first a lot of people have heard of this sort of style. my book is really about and trying to just show people this whole different way of writing both
native code that's much closer to the machine, but also with all of the quality of life and
niceties of just writing regular Scala too. Yeah. So sort of your, like I could imagine
that just your sort of tight interloop performance critical parts might be something you might want to
manually manage? Yeah, exactly. And then you would use regular Scala collections and strings and
all those niceties for configuration, for networking. I mean, it's always going to be
a balance. And when you're at the point where you're writing a program which has specialized performance needs, every program like that is going to be unique in some ways.
So it's always the experience and judgment you have of, well, which is the tight inner loop I actually need to optimize?
In some cases, that's obvious.
If you're sorting four gigabytes of data, it's the sorting, you know?
But in real world programs, it's not always that clear cut, which is definitely a challenge for
this kind of programming, I'd say. So your book is called Modern Systems Programming,
and spend some time on the C-Standard library. Why should we be interested in it? Yeah, you know, I called it modern systems
programming sort of to contrast it with the books I learned see systems programming from.
The one that's closest to my heart is the Stevens and Rago, Advanced Programming in the Unix
Environment, which is the size of a brick, right? And like a large brick, not a little brick.
But it's a great book, one of the best technical books I own. It's totally encyclopedic,
like everything you can do in C with a Unix OS kernel is in there, basically.
But it doesn't even cover networking. If you want to do anything involving networking,
there is not one but two additional brick-sized books, also by Stevens, on networking in the Unix environment in C, which, again, are encyclopedic and amazing.
But it's just so alien from the way we write code now, where everything is on the network.
You make REST calls at the drop of a hat, right?
Everything is a server.
It's not reasonable to say, well, if you want to write close to the
metal, you need to read these three brick-sized books. So the approach that my book takes is that
we introduce the fundamental C system calls you need to work with data, to allocate memory,
and to do basic TCP networking really early. We fly through some small, lightweight programs that exercise them,
and we really bootstrap to the point where you can write from scratch without any support libraries,
a simple HTTP client or a really simple but rugged HTTP server within the first 100 pages
of the book or so. And I guess
that's what I mean with the modern approach is that I'm not treating networking as an obscure,
advanced topic. I'm really just putting it out there in front because it's absolutely critical
to every program we write nowadays. Yeah. And then the second half of the book actually doubles down
on that. And that's all the stuff that I'm going to be releasing over the next couple of months. But essentially, what we're going to do in the second half of the book is we're going to bring in a C library called libuv, which is the event loop and networking library that Node.js also is based on, but it's a C library, not a JavaScript library. And sort of building up a fully asynchronous,
highly usable, well-designed Scala library
around asynchronous IO with C-level performance, basically.
Yeah, it's a different perspective.
So I'm a Scala developer.
At my work, there are people who are c programmers i don't feel like
we always speak the same language and then it's interesting in your book you're like let's build
a web server in scala so first let's look at how you you know listen for a socket using the standard
c library like that is a not the approach I would normally take if writing.
Yeah, it's, I sometimes have some ambivalence about that kind of presentation, because this ad, like, I think a lot of books would say, oh, just use this library, do you know, HTTP server
dot serve, poor ad, or whatever, right? And that's certainly the right way to build a web server at work.
You don't want to write a web server framework from scratch for your job,
and you don't want to have to maintain someone else's from scratch web server ever.
I think we've all been in situations where we've had to support sort of gratuitously DIY code that someone has done.
And I think we all know the downsides of that. For me, the reason to sort of embrace that
gratuitous DIY spirit in the book is, first of all, it's a way to get the reader really
intimately comfortable with exactly what the operating system does and
is responsible for, both to show how much there is, but also in some ways how little there is.
And then to demystify the libraries and frameworks we use for this every day,
right? Because it is somewhat insane to write an HTTP server from scratch.
But then you realize, oh, I can write an HTTP server that can handle a couple hundred or a couple thousand requests a second
in less than 200 lines of code.
And if you realize that's within reach for any Scala developer, really,
I feel like it opens up so many things.
It makes it a lot more believable that we could write new pure
Scala libraries that replace things from the JVM like Netty, right? That we're unlikely to ever be
able to port to C. So maybe that's the sort of oblique strategy of the DIY approach is that it really just opens up the possibility of building
this new ecosystem. And I hope this better and simpler ecosystem than the sort of JVM
environment that Scala sort of bootstrapped itself upon.
Yeah, I love the approach. I picture you playing around with Scala Native and you're like, okay, you want to use whatever, Netty, and you can't. And so you turn around and you have your
three giant brick books and you're like, well, I know how to open a socket. And you're just
coding away. So is this an attitude that's lacking in the high-level language world?
You know, I'm hesitant to even generalize about the high-level language world? You know, I'm hesitant to even generalize about the high-level language
world in general. What I would say is that I think there's probably a crisis across software
development about dependencies and libraries. And a lot of the really serious examples of this come from Node.js and the NPM ecosystem,
right?
Where if one library gets taken down or compromised, like LeftPad or EventStream, hundreds or
thousands of upstream libraries and any number of large, serious projects can get harmed. And I think it comes from this notion that
it's easier and faster to grab a library off the shelf for every possible need we might have.
And maybe this is me editorializing or getting a little bit cranky. In fact,
I'm sure this is me editorializing or getting a bit cranky.
Run with it. run with it. Yeah, no, I'm not sure that many of the things
we use libraries for are that hard.
And maybe we should consider the possibility
of having programs that are a little more self-contained
and that don't necessarily rely on, you know,
a hundred or 200 dependencies to tick every possible box.
I'm a big fan of writing software that is simpler and more rugged or more performant
that can be relied on. And sometimes it means taking a different approach to that. It can also mean thinking more about infrastructure and how your code is going to be deployed
and figuring out how to rely on infrastructure to solve a problem.
And just thinking about this whole life cycle of your code and the environment it lives
in and what it does and what it doesn't need to do.
So what do you mean by that? So a really good example of this would be something that I just put out on GitHub and
on Twitter was a Scala native runtime for AWS Lambda, actually.
I don't know, your readers might not be so familiar with it, but a couple months ago,
AWS announced custom AWS Lambda runtimes.
Previously, there were only three or four languages that you could run a Lambda function in.
And it was, I think, JavaScript, Python, C Sharp, and Java, right?
Yeah.
The idea is you would upload your code or a jar or some other archive with a runnable artifact.
And then AWS managed a runtime that something like that, you actually just
get a bare Linux VM with sort of a magical local HTTP endpoint that's provided by Firecracker,
basically by the new AWS virtual machine monitor, right? And any executable program in any language you can write if it wants to serve lambda functions
it just has to hit this little local http endpoint wants to get a request in and then it hits it
again with a response and that's it like there's no encryption there's no request signing because
it's all local none of this is even going over the network between my code and the sort of virtual machine
boundary that's standing between my code and the rest of AWS and the rest of the internet.
So it's this really fascinating model where it allows you to write really simple code, you know, because I already had like a,
an HTTP client in Scala native from the book, in fact. So I took that and I added about 20 lines
of additional Scala code to just hit the two endpoints basically in a loop. And that was it.
That's all it takes to like provide an AWS Lambda runtime now.
Because there's this beautiful synergy between a really streamlined interface and really sophisticated and elegant infrastructure and API design.
So some of it is that our dependencies can be services rather than actual libraries is that yeah yeah that's a really great way to put it
that instead of having dependencies on 15 libraries with 15 different interfaces
if your dependency is a service then you might only need to implement one protocol to to work
with all of your different dependencies an interesting interesting thing is containers. So the JVM sort of is an early
type of containerization in a way, right? It's a little VM that runs the same everywhere. And now,
like I regularly am shipping to production, you know, JVM processes that run inside of containers.
Yeah. Yeah. And it's really interesting how that there's a bit of an impedance mismatch there, and that so much of the complexity of Java as an ecosystem is built around things like runtime class loading, right?
With the ability, presumably, to swap versions of an application in and out within a larger sort of mothership-grade Java-like super server
that's hosting lots of such applications.
And now with containers, we've accepted and I think pretty happily adopted
sort of shipping these immutable artifacts for all of our code.
But we sort of still are paying the memory overhead and the runtime complexity
of this virtual machine that's designed with,
honestly, much more sophisticated capacities that we just aren't using.
Yeah, you're a fan of the operating system.
I don't know. I mean, a fan? I guess I have an interest in the operating system. You know,
I can complain about the design of any particular operating system for days, right? Linux can be
a pain to work with. Linux is also amazing, but there are parts of it that can be a pain.
I mean, from writing this book and trying to explain even basic Unix socket APIs,
I could go on about how poorly designed some of the socket APIs are, right? And like how little sense some of it makes
if you try to write down how it works. And that was something I really struggled with.
So it's not as much that I'm a fan of it. It's more that I feel that it's real and it's there.
And the more we try to abstract over it, we'll both, one, our abstractions will get messy.
The underlying details will leak because they inevitably do.
And then it will end up working with a sort of unpredictable
and not really performant abstraction
that's probably worse in every respect
than working with the underlying,
somewhat poorly designed operating system interface in the first place.
Yeah, like I feel like probably rightfully so.
You may more than others solve an architecture problem
by leaning on the resources that the operating system provides.
So recently I was like downloading,
I was using W get to like scrape
some website and then it was slow. So I wanted to run like multiple in parallel. I like looked at
the documentation and it's like, just keep starting it up with the same arguments. Right.
Nice.
So, so I saw that as an example, right? They were coordinating just by before it downloaded a page,
I think it creates the file. So then if there's multiple instances, they just move on. So this type of design maybe is less
common where you are utilizing what the operating system provides. Yeah, I think that's a really
great example, actually, just because like if you did that with a shell script, you could run that
pretty ridiculously fast, just because shell scripts are
great for spawning dozens of processes in parallel, whereas every modern, high-level,
pleasant-to-use programming language doesn't expose multi-process programming elegantly or at all,
in fact. We all are used to threads, right, where you have multiple
contexts of execution, potentially multiple CPUs going at once within the same sort of
memory address space. But actually having multiple separate processes running within a single logical
program in some sense is actually a little more alien. And it's something I do get
into precisely because it opens up things like spawning 15 WGIT or curl instances or something
like that. Yeah. But where do we learn about this? Besides your book, do we just have to go to your
three tomes of Unix system programming? I mean, those are great. Actually are a lot of really good resources online
for stuff like this.
Beej's guide to C covers a huge amount of this stuff.
Unix sockets, memory management,
fork and multi-process programming.
The actual Linux man pages for this stuff
are also really, really good.
And I've spent a lot of quality time with them while trying to write a book about this.
There's a few somewhat older programming languages that have good support for a lot of these
multiprocess programming techniques.
And just for like sort of Unix style programming in general, I guess they're not even that
old. But Python and Ruby both have a
slightly closer affinity to C into the operating system than like Java or Godo in some sense.
So like Gary Bernhardt has a lot of really great podcasts where he does pretty similar kind of low
level systems programming to what I do in Ruby, right? And just write stuff from scratch
using basic C system calls. Because Ruby also has a pretty good, obviously, untyped C FFI.
And then Python actually can be pretty good for aggressive multiprocess and off-heap data
structures. You can do some pretty cool stuff with their multiprocess and numeric computing libraries, where you can have multiple processes with isolated state, mounting a shared memory map with a giant array in it or stuff like that.
That's probably where I picked up a lot of these techniques originally, for the more aggressive process and off-heap stuff, actually. Very cool. Yeah, I mean, the JVM, like, I guess we've been bagging on it,
but it's quite performant generally.
So it's not often where I have to reach.
Like I can see why Ruby,
maybe you're more likely to have to do some FFI.
Yeah, no, I mean,
and that's a really good thing to call out.
And I love the JVM.
The JVM is great.
The quality of the JVM's just-in-time compiler is, I mean,
it's a really phenomenal piece of human engineering. It's so good for so many things,
and I'm not hating on it. I think I just, it's more like I have qualms about its applicability
to the situation we find ourselves in, you know? Maybe it doesn't fit that well with containers.
Maybe it doesn't fit that well with really large on heap data structures.
Maybe it's not perfect for latency sensitive distributed systems.
You know, those are sort of the angles that I find myself sort of pushing against the
limits of the JVM, if that makes sense.
Yeah, no, it totally makes sense.
And yeah, I mean, I think containers for certain are,
are the way forward. So yeah, I don't know what that says long-term for the JVM. You see a lot of
new languages are, are native. One thing, one thing I noticed was that the standard C stuff
that you interrupt with, uh, like some of it by my maybe more modern taste, seems somewhat insane. I don't know.
To use the technical term. No, it's really true. There's this particular technique called a type pun, where you have types that are genuinely unrelated to each other, have totally different
structures, different sizes, right? And in some cases, you'll have types like the socketter type,
which nominally exists and has a size of like 14 bytes, but you can't actually instantiate it,
because there is no generic instance of a socket address. Instead, you do things like you allocate
an IPv4 or an IPv6 address, which will be twice or four times as large. And then you just cast it down to this other much
smaller data type and pass it into a system call. And that's terrifying, right? Any rational person
doing that would think, oh my God, the system is going to chop off the last eight bytes of this
thing I just passed in when I cast it, or it's going to overflow or something else, right?
Because certainly anytime you're working with C, you're going to be scared of overflowing pointers.
And when you have these basic foundational APIs that require you to do unsafe casts that are
effectively overflowing buffers going into the kernel, it's sort of horrifying that this is how everything works.
It's all of this. But a lot of that has historical reasons. The POSIX socket API,
I think, mostly got written and mostly implemented around 1981, 1982, which is a really long time
ago. It's close to 30 years now, right? And it actually doesn't quite predate
like modern ANSI C, the sort of standardized modern C we all know and love. But it sort of
came into being around the same time. And a lot of the implementations didn't have access
even to a full compliant standard C implementation. And what that meant was that some of the nicer features C has,
like C has union types, which are a more clean and sort of type safe way to represent sort of
related types in the same, that'll sort of slot into the same memory space. It's not quite as
elegant as sealed trait and Scala or like subtyping in a OOP language.
But it fulfills the similar role in a modern C program.
So this is something the C language has, and it just wasn't there or wasn't stable enough
in time that a lot of these legacy APIs got rid of, which is both a little terrifying, but also
maybe it's encouraging. We have all these new systems programming languages coming up. People
are writing new virtual machine monitors and new operating systems and things like Rust and Go even.
And I think if people can write these things, and certainly
in a garbage collected language like Go, the notion of using Scala native for a sort of real,
deep and like low level implementations of some of these primitive facilities,
it starts to get really exciting. It makes you wonder what could we do if we tried to implement a fundamentally sane network protocol, for example, which might not be something we have yet.
Yeah. So it does seem like an opportunity. I can see what you're saying. Like just Scala
native could have a wrapper around whatever string copy and it could provide a little bit of sanity.
Yeah. And that if you might be able to find a sweet spot
where you provide more sanity than C does,
but you don't have the overhead insanity
that the Java string API introduces.
I think what I'm finding from writing this book
is there is a sweet spot, in fact,
that is much closer to the metal,
but also much saner than, you know,
what someone, some really smart person designed in a hurry 30 or 40 years ago.
Yeah. And we have more expressive languages, better compilers now,
like a lot more validation can be done.
Yeah, it's very true.
I think that that's probably most of my questions. Let me take a look here.
Is there anything that we didn't cover that you'd like to talk about not that i can think of off the
top of my head i really appreciate you uh giving me the chance to talk about all this stuff yeah
it's interesting yeah like i'm not a systems programmer and i like the approach of the book
because it's easier for me to take in than reading something and see i suppose like it's just it's easier for me to take in than reading something in C, I suppose. Like it's just, I'm used to Scala, so.
Yeah, like this book, it's not a reference
in the way that one of the brick-sized tomes is,
but I hope it does open this up to more people
than some of these older, larger, scarier books do.
So that's really good to hear.
Awesome.
Thank you for your time, Richard.
It's been fun. Yeah, Adam, to hear. Awesome. Thank you for your time, Richard. It's
been fun. Yeah, Adam, this has been awesome. Thank you so much. That was the show. Thank you for
listening to the Co-Recursive Podcast. I'm Adam Gordon-Bell, your host. If you liked the show,
tell a friend about it to help spread the word. Or you can join our Slack channel or spray paint our website on the back of a bus.
Or maybe not that one.
This month on the co-recursive Slack channel, there's been some interesting talks.
User GraphBloodwurst, great name by the way, has been working on an interesting project, Scala Z Schema.
And he has volunteered to explain context-free algebras to our group.
I have no idea what those are, but hopefully I will soon.
I looked at some of his PRs and it's a bit above my head, so I always have more to learn, but I
promised I would give him a shout out in the podcast. Also, user Joe, nice short name, he's
been asking questions about learning functional programming. He's been learning in F sharp and now
he's dabbling a little in Scala and trying to get
the feel for what is the best language to kind of learn some functional programming.
So these are the types of discussions happening in the co-recursive Slack channel.
Check it out.
Until next time.