The Changelog: Software Development, Open Source - Reproducible builds and secure software (Interview)

Episode Date: February 3, 2017

Chris Lamb joined the show to talk about his project Reproducible Builds — which is funded by The Linux Foundation's Core Infrastructure Initiative. We talked about the importance of having a verifi...able path from source code to compiled binary, what this set of software development practices is all about, what it means to have Reproducible Builds, the challenges faced when implementing these development practices, and the inherent security you gain from them.

Transcript
Discussion (0)
Starting point is 00:00:00 Bandwidth for Changelog is provided by Fastly. Learn more at Fastly.com. I'm Chris Lam, and you're listening to The Changelog. welcome back everyone this is the changelog and i'm your host adams dekoviak this is episode 237 and today we're talking to chris Lamb about reproducible builds and the importance of having a verifiable path from source code to compiled binary. We talked about all the details of the project, what it means to have reproducible builds, the challenges faced when implementing these best practices, and the inherent security you gain from using them. We've got three sponsors today, GoCD, Linode, and our friends at Flatiron School.
Starting point is 00:01:08 Our first sponsor of the show today is our friends at GoCD. Head to gocd.io slash changelog to learn more about this awesome open source continuous delivery server. GoCD lets you model complex workflows, promote trusted artifacts, see how your workflow really works, deploy any version, any time, run and grok your test, compare builds, take advantage of plugins and more. Once again, head to gocd.io slash changelog to learn more. And now onto the show. And we're back.
Starting point is 00:01:43 We got Chris Lynn joining us today. Jared, this show is one of those shows you have to listen to if you care about software security, making sure what your source code is matches the thing you actually embedded into your device or whatever you ship that binary you put out there. This came from a ping repo issue, though, actually presented by Chris. What do you think? Yeah, it's reproducible builds is the topic of the day.
Starting point is 00:02:05 And this show was very much Chris's idea. So you and I can't take any credit or any blame if there is any to assign. We'll see how it goes. But Chris pitched this to us and interesting topic. And one, Chris, that you think more people should know about.
Starting point is 00:02:24 So first of all, thanks so much for joining us on The Change Log. No problem at all. Very nice of you to have me on here. So the spirit of getting to know our guests a little bit before we hop into reproducible builds and why you believe they're so important. We'd like to get people's origin stories and kind of find out where they're coming from. So can you tell us how you got into software and how you got to where you are today? Well, my software journey starts pretty early, I guess. I was brought up in the UK and in primary school,
Starting point is 00:02:53 so I may have been around seven or eight. I started experimenting with programming on the school computers, but it wasn't until a friend of my mum's, he was clearing out their um a technical guy he was clearing out his his sort of uh hacker shed of um old equipment and he was going to take it down to the local um you know where you could get rid of all computers and stuff like that but on the way he unfortunately had a small car accident and the the computer that was on the the passenger seat went right the way through the windscreen into the field.
Starting point is 00:03:26 Police came, et cetera, et cetera. And they brought all of the equipment back to his house. You know, they just put it all back in his car, you know, whatever. You know, he wasn't going to continue on to the tip, as we call it, for some reason. Anyway, he went back. And this computer that was meant to be broken and old, he was this kind of person that, you know what, it would be absolutely magical if this computer now works. So he plugged it in and flipped the on switch. And lo and behold, for some reason, this car crash had actually resurrected this computer from the dead.
Starting point is 00:04:00 And he took this to the site so he couldn't throw it out then. And eventually it was basically in the way of being a doorstop. So he managed to offload it via my mum onto me. And it was this old, extremely old 8088 IBM computer. It was dreadfully old, even for the time I got it. But it had no games or anything on it. It just has a copy of Turbo Pascal and every 10th reboot,
Starting point is 00:04:28 for some reason, it would revert to basic, the basic programming language. There was a ROM built into the motherboard that for some reason, if the main operating system didn't boot, it would revert to a basic environment. So I got some books out of the library
Starting point is 00:04:44 and started programming my own basic things like that. And then eventually on from there, really, just sort of steady stayed up. Moved into some parallel programming, I guess. And by university, I was programming Python a lot,
Starting point is 00:04:57 C, C++, and doing, you know, the usual Java, blah, as university courses go and things like that. After university, I joined a startup in London and did that for two years. We were acquired. And then because we seemed to work together quite well as a team,
Starting point is 00:05:14 we decided to stick together and we did Y Combinator. And I was with that company for four years. What was that? This is a company called Thread.com. It's still a growing concern. Really great guys. I just thought I sort of had enough of London by then and wanted a new challenge.
Starting point is 00:05:35 And this sort of freelance digital nomad lifestyle was sort of calling out to me. So I sort of jumped two feet into that. And that's what I've been doing for the last couple of years, doing freelance projects um doing a lot of debian work as in the operating system debian and uh all sorts of really interesting varied projects sort of all around the world really it's been really fun digital nomad that's a lot of fun so you're, it's a pretty pretentious title, but yeah, of course. I mean, it's the dream, right? To travel the world and write code and or seek out your personal hobbies and fun stuff like that in all places.
Starting point is 00:06:17 That's a lot of fun. Yeah, it is really rewarding. Yeah, I can recommend trying it at least one time in your life. In fact fact you're calling in today uh from new zealand so quite a ways that's right calling from from auckland new zealand i'm looking over a beautiful bay right now and um yeah it's a little chilly here um but um it'll warm up it'll warm up and going back to the origin story of yours yeah i can't have a notice that you mentioned that every 10th boot went back to basic i was just thinking jared how much fun that might be to be like having a
Starting point is 00:06:47 computer roulette so to speak like what will i program today because of the computer and what will force me to do yeah that's interesting i mean to some degree i wonder if all computers did that like it you boot your mac today and it's not a mac it's a windows or something i don't know that'd be great or it goes the other way where um instead of reverting to basic it's it says no sorry today chaps you're you can only program in haskell no so origin story that's a that's a fun piece there what uh what got you into open source where where did that happen for you um somewhere along the line i got a came across a book about slackware linux and it came with a cd and things like that and uh this is before um the internet was you know of any
Starting point is 00:07:32 reasonable speed and so you pretty much have to send off the linux distributions and all my computers were always very old so i was never really playing well playing and getting distracted by gaming and things like that so i played around with the slackware thing but even that was very old so i said there was a company in the uk called the linux emporium and if you sent them you know sort of five dollars worth they'd send you you know the latest red hat cds on seven discs or something ridiculous and i'd heard of red hat oh you know reputable blah blah blah i get that They sent off for that. They also said, oh, we could include some free extra CDs if you want. Yeah, sure, whatever. I'm 13, I have no money,
Starting point is 00:08:12 so whatever. Send me as much free stuff as you like. Anyway, I went to install this Red Hat, and it said, oh, I'm sorry, sorry, sir, you need at least you need a very powerful computer, you need at least 12 megabytes of ram to install red hat and i think i only had eight on this rather rather lackluster machine so got a bit rather annoyed and um so reached one of these free cds which again were old for the time they were free because they were the previous releases and one of them was a very old release of Debian. And the whole operating system there just completely clicked with me.
Starting point is 00:08:51 Installing stuff was pretty simple. Installing the operating system itself. And I ended up using that for many years just as a user, running my own little web server between me and my cupboard. Just like, oh, this is amazing. But I didn't have the internet, so, you know, I can, wow, I can, you know, type in HTTP 192.168.0.2. But what would be on your web server that you could possibly want to,
Starting point is 00:09:19 like, would you write up there and then read it later? Or like, what kind of stuff would you even access to your own house i don't really know that i wrote then because that was in my own house you're quite right but it was um i think it was copies of software i'd seen on the internet at school like i i was but pearl based um guest books they were all the rage at the time that might be way of aging where it was also um it was beginning time of those short url redirectors so this is when you had domain names like i.am so you would basically a free redirection service that i am you know your name and it would redirect you so i was writing sort of pearl versions of those in cgi script the good old days yeah good old days i know you're quite a prolific open source-er
Starting point is 00:10:08 in terms of, well, in terms of what prolific means. You have lots of open source code. You've been working on Django quite a bit. You've been a Debian package maintainer, I believe, or at least involved in the Debian project since 2008. On your GitHub, you have 216 repositories, and 129 of those 216 repositories, and 129 of those are source repositories,
Starting point is 00:10:29 so you actually began all of these. Yes. What's the deal? Do you just code all day and all night, or how do you get so many things going? Well, a lot of these things are sort of spinoffs from other projects or perhaps from freelance work as well. So a lot of the Django tools I've done have been like, well, I think this will be,
Starting point is 00:10:47 you know, in the code base, this should be modular anyway. And as it's a completely reusable component, let's just remove it out there. And then it can become more generic, abstract, other people can contribute to it, et cetera. And it's sort of good to share back because et cetera, et cetera. So that speaks to most of the Django ones.
Starting point is 00:11:06 The other projects, a lot of them are just scratching my own itch. Like I wanted something to, I think I was looking for a new bike. And there's a sort of Craigslist in London called Gumtree. And so I decided to, I knew exactly what size and what sort of make i wanted so i made a script to um yeah to poll it every five minutes and to send me an email when a particular when something that matched my specifications arrived and so i was ringing up these people within five minutes of their advert going alive oh yeah is the bike still available i've just posted it mate i don't know how did you get it so quickly so a lot of these are scratching my niche.
Starting point is 00:11:46 Some of them people use, some of them people don't use. But I find putting the code out there keeps myself honest. It also makes me follow through on projects a little bit better because there's some sort of vague accountability if you're putting it on GitHub. Not much because no one's looking over your shoulder. That's the basic idea. I wonder if everybody has the same
Starting point is 00:12:08 I don't know how to describe it, but like the fact that you do some freelance work or you've done freelance work over your career and instead of simply writing it into that code base that you're writing into, you think in a modular way and you think about the community. I wonder if that's just like
Starting point is 00:12:23 a common thought amongst developers, if that's, if that's something that like they need to hear something like your story to think like, I should do that too. If I'm writing software for somebody, you know, if I could bake that into my,
Starting point is 00:12:37 you know, my contract with them, like, Hey, if there's an opportunity for me to open source a module, whatever, you know, obviously I'll disclose that to you or whatever,
Starting point is 00:12:44 but baking that into the ability to be a freelancer and actually give back to open source a module, whatever, you know, obviously I'll disclose that to you or whatever, but baking that into the ability to be a freelancer and actually give back to open source. I wonder if that's just common knowledge to do that or it's common things to do. I think some of it depends on maybe your attitude and your outlook. So from my own personal perspective, I've always felt like I should only like open source the things that I think are great or useful or polished.
Starting point is 00:13:07 And that always leads me to not open source anything. The imposter syndrome, basically. Yeah. It's not really imposter syndrome. It's more like values, non-valuable. It's not like I don't belong here. It's just like, who would ever want to use this? That's imposter syndrome.
Starting point is 00:13:23 I don't know if it is. It doesn't really feel like it's an edge case of it i don't belong here it's just like you know this is maybe i just code for myself so just to compare with you chris just the other day i was writing a little script that you had your bike script that would like check the you know check for you every five minutes that was very similar only it's like a cigar bidding website. Anyways, I like cigars. And so I'm just writing this thing, you know, that's just helping me get cigars at good prices. And like, I never even thought once to open source that. Like, it'll probably never leave my hard drive.
Starting point is 00:13:55 But you, on the other hand, you're like, I'm going to put this up on GitHub. Yeah, I think it also immediately solves the where do I put this file as well. Good point. so it immediately solves the where do I put this file as well good point do I lose it in my directory like random directory structure but if it's on github then it's kind of a backup right if you squint it's a backup
Starting point is 00:14:15 right I hear you say Jared it's on my hard drive where if your hard drive dies today Chris's github hard drive does not. And even if maybe somebody doesn't find it useful or even desire to watch it or fork it or whatever or contribute, you know, it's still yet. It's like Chris said, there's a backup there. And worst case scenario, somebody else is like, hey, that's a really awesome idea.
Starting point is 00:14:38 I love cigars too. And now you've got a new buddy. Yeah, sure. And also like it's a generally like, you you know it uses a mechanized library and it it logs in it does a few things where if you would like to automate some things on the web you could look at that little script that to me doesn't seem a much value and you could say here's how you might do that and you could tweak it to your own uses similar to maybe i could take your bike script and apply it to tricycles or something um i don't know why i came up with that
Starting point is 00:15:04 example but so now i'm talking myself into i should open source some more stuff basically and apply it to tricycles or something. I don't know why I came up with that example. So now I'm talking myself into, I should open source some more stuff, basically. We should all just be, but aren't we like, somehow we're just maybe like heaving crap out there for other people to sift through, you know, like adding more noise to the ecosystem.
Starting point is 00:15:19 I think there's levels to open source, right? There's like infrastructure open source, which is like, in quotes, important, you know, and useful. And then there's levels to open source right there's like infrastructure open source which is like in quotes important you know and useful and then there's other things that are sort of like uh tinker tools that sort of just embrace the inner kid in us the playful manner and there's a side of that playful manner that helps you get into the state of flow and helps you go beyond just like simply learning and it's like right now you actually you know absorb what you're doing and so it kind of brings out these different attitudes and the developer behind the code and those who interact with it so i think there's room for that i don't think there's i don't think
Starting point is 00:15:54 it should all just be so serious sure i think in fact uh shout out to cody peterson who was our designer on changelog.com. You front-ender. He has this idea, which I'm sure everybody's had this idea, but he brought it to my attention of GitHub should have tags, arbitrary tags that you can assign to your own repos in order to provide context. You could tag something satire if it's a joke, or you could say this is a one-off,
Starting point is 00:16:21 or you could have all these different tags that would basically say, look, this was me messing around. It's not a serious project, you know. Or you could tag it, like, you know, the problem with tags is they're so arbitrary. Point being is, like, if we could classify our repos a little bit better in public, it might help.
Starting point is 00:16:38 What do you guys think about that? I think that'd be really good because then I think a lot of people wouldn't be making these decisions about that gray area of, well, shall I put it up there? It's probably not going to be useful. They just put it up there by default, not having to think about it, but just shove one of your tags on it saying, yeah, this is a bit of a toy. You know, it doesn't even work. It's broken now.
Starting point is 00:17:00 But it certainly has like more value out there than being on your hard drive and then it'll eventually die and you'll get lost. Yeah, absolutely. Well, let's get into reproducible builds. So give us the, I don't want to call it an elevator pitch because it's not a business, but it's a concept, it's a best practice. It's something that, Chris,
Starting point is 00:17:21 you think people should know about and do. It's also something you've been giving sessions on. You spoke at Linus Conf in Australia recently, which is kind of why you're in the New Zealand area. So give us real quickly an understanding of what reproducible builds is, and then we'll come back from the break and we'll dive into it. Sure. So reproducible builds are a set of practices and philosophy, and it's all designed around to ensure that there's a verifiable path from the source code and the binaries that are being run in your machine. So if you get, the basic problem is that whilst you can inspect the source code of free software, most Linux distributions, Android, et cetera, provide pre-compiled binary packages. And so you needed a way of being able to correlate
Starting point is 00:18:10 the binary that's being running on a machine with the original source code. And this is particularly important in the modern era because there's incentives to crack build infrastructure. If you want to, you can go after a lot of users by attacking the developers. And if you can get some malware into a developer's machine, you can infect all of their users in one go.
Starting point is 00:18:33 I never really considered that part of it, Jerry, when we were doing the pre-call. It was like the attack on the actual developers. Yeah. I was thinking just simply source code in the binary that gets put on whatever and runs and how that gets circumvented, not the developer's machine or themselves. Indeed. And there's a psychological angle to
Starting point is 00:18:49 that as well. I mean, you can, I could hack someone's developer's laptop, for example, without their knowledge, but also I could come around their house with a baseball bat. I mean, it's pretty crude, but, you know, please include this backdoor in your software or blackmail and things like that. So all of these things protect developers from that happening. So it'll be of no value to threaten a developer with such things because anything they would do would be caught by the rest of the community. Well, let's push the pause button real quick. And on the other side of the break, we'll talk more about reproducible builds, why they're important, who's working on them, and what Chris thinks everybody should know and take away. So we'll be right back. We're working closely with our friends at Flatiron School to promote their free
Starting point is 00:19:38 online courses. They've got Bootcamp Prep, Intro to Ruby, Intro to JavaScript, and also Intro to Swift and iOS. In this segment, I'm talking with Kaylee Gray, an alumni of Flatiron School, who started with their free Intro to Ruby course. Then she enrolled in their online web developer program. And now she's working full time at FBS Data Systems as a developer in Fargo, North Dakota. Take a listen to Kaylee's story. I studied math primarily in undergrad, but I was also a computer science minor. So I've had exposure to programming, but before Flatiron, I was pretty timid as far as programming goes.
Starting point is 00:20:18 I definitely didn't have much confidence in that arena. After Flatiron's Intro to Ruby course, I felt more confident in my ability to pursue programming as a full-time career. One of the things that I liked about Flatiron's Intro to Ruby course was that I was forced to use the terminal, which up to this point had been daunting to me. So it was really empowering to feel like I could go in and make these changes and program these things
Starting point is 00:20:52 that I didn't really know I could do. If you're like me and you're curious about programming, but you're feeling a little unsure that it's something that you can do, you can try Flatiron for free and see if it's right for you. And you'll probably like it. This is great. All right.
Starting point is 00:21:11 There's nothing I love more than a success story. And Kaylee is an awesome example. You can follow in her footsteps. Head to Flatiron500.com to learn more and enroll. These courses are totally free to enroll. The bootcamp prep course is only available to 500 students. So if you're considering this, do it today. Once again, head to flatiron500.com to learn more and enroll and tell them the change law sent you. All right, we are back with Chris
Starting point is 00:21:37 Lamb talking about reproducible bills. And Chris, we gave it a definition before the break. Like we said in the intro, you opened up this idea of saying more people need to understand this as something that's important for various reasons. Can you reiterate a little bit exactly what reproducible builds is and then again, why they're so important? And we'll kind of dive in from there. Sure. No problem. So this isn't about reliable builds or repeatable builds or anything along those kind of lines. It's really about ensuring that there is this connection between a user or developer can confirm that the binaries that they're running on their system correspond to the source code
Starting point is 00:22:18 they're expecting to be run on their computer. So if you kind of wind history back to the sort of richard stallman's early ideas about being able to run software on your own computer whilst you can get the source code for you know or a free software operating system and etc most of these distributions are providing binary packages to you that are being compiled by someone else or different build farms. And it's really important that no inadvertent, malevolent or accidental changes have been introduced during that code path. There was an example given a few years ago of an open SSH binary that differed just by one bit of one byte,
Starting point is 00:23:02 which changed a greater than or less than comparison to just a greater than. And just that one bit meant that you could have a root exploit. So the difference is, I mean, if you ran them through a diff tool, you'd only see that one byte change, that one bit change. Yet one would be secure and one would be, well, hopefully secure. One would be hopelessly insecure with that root, with the backdooring, and one would be hopefully a little bit more secure. So reproducible builds prevent these changes being added behind your back as a user.
Starting point is 00:23:40 At what level does the reproducible build take place? Is it like, you know, you you got your list of who's involved and it involves various levels of linux bitcoin things like that is it us trusting them to say they adhere to reproducible builds and that's what gives us faith and trust or is it is it a different level it's i think it's on a different level it's sort of a kind of community set of tool practices and things like that. If you jump into the details, what perhaps reproducible builds
Starting point is 00:24:08 can be quite a misleading term. I mean, code provenance might be a better way of phrasing it and things like that. The way we use the reproducibility is that we ensure that compilation of any piece of software always has identical results. So that means if you run GCC on a C file, you get an ELF binary at the end of it. And if you reran that compilation process,
Starting point is 00:24:35 you'd get the exact same ELF binary. The MD5, the SHA1 checksum would be just identical. Then what happens is that you ask multiple other parties to do their own builds of this same source code and then you get together um hopefully electronically and uh compare your results so if i got results you know one two three four assuming that's the checksum and uh you got one two three four and everyone else got one two three four we can pretty much agree that if you compile this source code you should expect this binary and if someone come came along saying oh i get one two three five you would have an inkling that something was different about his
Starting point is 00:25:15 build environment and he could have been hacked he could have something uh breaking his compiler and things like that but basically there's just something fishy going on that warrants further investigation. So that's where the reproducibility comes from. So ensuring that everyone gets the same result is where the word reproduce comes in. So if someone can reproduce your build, that's where that verb gets added there.
Starting point is 00:25:43 Hasn't it been for a very long time that when package managers or anybody who's pre-compiling binaries and releasing them publishes their checksums alongside the downloads so that you can download the file and then run your checksum
Starting point is 00:25:58 and make sure it matches theirs. Isn't that, is that basically what you're talking about? Or this is just another level of saying, okay, well, that was two computers. We're going to do it on thousands and make sure that it's always the same? Yes, pretty much.
Starting point is 00:26:12 I mean, I think when a developer has a checksum extra file, what they're trying to do, if it's just a SHA-1 checksum, for example, that's typically only to ensure that an end user can validate whether the the download completed successfully so for a very large iso image it's very useful to to say oh yes it did download correctly so that's i think that's a different intention there but you're
Starting point is 00:26:37 right i mean if you had a hundred different checksums that people have provided it is pretty much like that i built this piece of software and i got this checksum and then it multiple people did the same thing it doesn't provide any um authenticity so you would need to pair that checksum with say for example a gpg or pgp signature you know to sign that binary just to say that I, Chris Lamb, generated this binary. You see what I mean? So you need to be very wary about what these checksums are actually claiming about the source code.
Starting point is 00:27:12 Yeah, and just to explain it, and you can help me if I don't have it correct, but I think I'll lay out in terms of the checksumming. A checksum is a one-way hash that's run on the binary. That's right, yeah. It'll always produce the exact same fingerprint on the other side. The problem with that, especially as cryptographic algorithms get torn down over time, is that while that exact same binary will always reproduce the exact same checksum, depending on your
Starting point is 00:27:39 algorithm, there are other binaries that can also produce that exact same checksum. And so we call them hash collisions. And so that's why it's not giving you the level of confidence that it's secure. It's simply a tool that you can use, like you said, to say, okay, I did get the file all the way downloaded. It's not corrupted or there's no issues. So while people think that those checksums are like giving us some sort of security confidence they actually aren't is that fair that would be fair yes you can you can immediately make them a little more secure by providing multiple checksums so particularly from different families of cryptographic algorithms so i mean the the advice for years has been to stop using md5 right and things like that. And if you provide multiple SHA-1, SHA-256 checksums,
Starting point is 00:28:28 you can start to be pretty confident that your download completed successfully and things like that. So give us the doomsday scenario where we all go away from this conversation thinking, well, Chris has an interesting point about this reproducible builds thing, but I'm not convinced. I don't care.
Starting point is 00:28:48 And so us as a community, we don't care. We know that's not the case because we have the core infrastructure initiative is supporting this and lots of distributions. And a lot of people do care. But let's say that we just don't get it done and we don't have reproducible builds. What's the worst thing that could possibly happen in terms of hacks or security problems? Use your imagination if we can't have this guarantee. Well, one advantage is I don't have to use my imagination. Some of them have already happened, although in small isolation. So for example, someone used social engineering to offer a backdoored iOS software developer kit for download.
Starting point is 00:29:27 Maybe they bought a Google ad that looked like the download link, you know, that kind of thing. So a whole bunch of developers downloaded this, and it worked completely like the normal iOS SDK, except it would replace their adverts, as in any adverts that the developer added, with the attacker's adverts as in any adverts that the developer added with the attacker's adverts. The idea was just to make money, really.
Starting point is 00:29:51 And so these developers in their making their iOS apps were happily making them. Brilliant, you know, fine. And then they went to upload them to the Apple store and they signed them. And so the signing process was completely
Starting point is 00:30:06 accepted. You know, Apple said, yes, this is you. The signing stuff checks out absolutely fine. But because their software development kit was backdoored, at the very last moment, it would just simply replace the adverts with the ones from the attacker. And they only really noticed when they weren't really getting any ad revenue back. But you can quickly imagine what would happen if the code wasn't necessarily to just replace adverts, which sort of sounds a little bit harmful. They're harmless, should I say.
Starting point is 00:30:35 Ignoring the economic blah, right? But what if it was sending your address book or things like that? The original developer would swear blind they're doing nothing wrong. And from their point of view, they are innocent, apart from perhaps some rather lackadaisical security on their part. But you wouldn't really know who to point the finger at. And so that's pretty much the sort of worst case where you have no idea who these attackers are.
Starting point is 00:31:00 You have no idea where the software is going. It sort of seems to bypass a lot of security features that were put in place, like the signing, which is entirely designed to prevent arbitrary code being uploaded to these repositories and things like that. So yeah, that's pretty much the doomsday scenario. Another thing that makes reproducible builds quite salient in modern times is that some of the Snowden revelations refer to using backdoored compilers in a similar way in order to infect machines and things like that. This is something that the NSA or have been, well, we know for certain they've been looking at it because of some documents released via Edward Snowden. So yes, I mean, the doomsday is sort of
Starting point is 00:31:48 here already in a sense, but we just don't really know how pervasive it is. Yeah. It is particularly insidious that you're not coercing. You don't have a bad actor. The developer doesn't have to be the bad actor if you can infect the developer tools or the developer you know pipeline in any way and and
Starting point is 00:32:10 then when the point that the attack is successful like you said it's very difficult to trace back to the original threat vector when the developer is ignorant of anything going wrong yeah they're usually the one to blame that's an interesting though, that you can do that with the iOS SDK and do something like you had said, so harmless, but it could have been something so harmful, like an address book or bigger exploits or whatever.
Starting point is 00:32:35 But that actually takes place. But we all day-to-day utilize some sort of software, whether it's open source or not, in a way where we just sort of like inherently trust it. I don't know how often either of you do MD5 hashes or any of this thing that you could do to sort of like determine if it's truly
Starting point is 00:32:56 what you should be using. How often do you do this, Jared? Is this something you do day to day or how often do you check the software you're actually running? So I used to use the checksums when I would, chris said when i download a large file um i and i used to i used to do it thinking it was a security thing so a lot of people believe that and this was a little bit of a misunderstanding is to think oh i'm more secure because i do this step right and um by the
Starting point is 00:33:20 way the other problem with that step of i'm going to download a file from this web page and then I'm going to run the checksum to make sure it's the same is that if a hacker actually hacked the web server insofar as they could replace the binary that you're downloading. Right. They also could have very easily changed the checksum to match that binary. Right. So it's completely not a security thing, even though I used to think it was for a long time. And so that's that's more though I used to think it was for a long time. And so that's more when I used to do it. But also back in the days where, you know, a 600 meg Debian ISO was like an all-day download. You wanted to make sure that it worked right.
Starting point is 00:33:56 And so I would do it back then, but I don't do anything. I'm very security lax, sadly, in my current. Well, how much binary code are you running these days that where you would check it? Like, how often are you either using anything that's binary where this plays into where you take a source code down to one file or something like that? Well, I mean, anything that you nowadays Homebrew has a lot of precompiled binaries. Right. And I assume Homebrew has some built in, you know, I know there's some certificate checks and stuff going on there.
Starting point is 00:34:28 Chris, you can probably talk to Debian's process since you're involved in the Debian package maintenance, but what kind of security is in place around package managers where people are pre-compiling binaries and then, you know, we're downloading them and using them? The Debian and by extension Ubuntu and Ubuntu Mint, et cetera, they use internally, they use GPG and signing. So there's a known web of trust. So whilst it does validate the checksums of the files when it downloads them, that's simply for integrity. In other words, it has the download completed successfully. But there's an additional step, which is documented. If you search for apt secure, there's a quite interesting wiki page on the Delian wiki about it. In a nutshell, it basically uses GPG signatures and a key ring of trusted keys to say, okay,
Starting point is 00:35:15 the checksum of this file is X, and we have a valid signature that's in the key ring. So therefore, we can trust this file to that degree. Is that what I mean? Yeah. Whilst you're completely right. That prevents your example of the attacker being able to get into one of the many Debian mirrors and replacing both the checksum and the binary. That would fail because they would not be able to forge the signature, the extra step. Right. So there's that. And then if we go into like the apple side of
Starting point is 00:35:46 things with the ios you know app store and whatnot those are all developer uh signed uh you have to have your own certificate and sign your binaries before apple will accept those in and there's a web of trust that they create in there as well so there are things that are in place but what you're advocating for with reproducible builds is even more guarantees, not just that these binaries are trusted, but that we can verify their origin, the source code that originated them in a reproducible way. That's right. That's a good summary because whilst you could have a, I could, using the
Starting point is 00:36:24 checksum and a signature, I know that I've got this binary from, say, you. And it's like, brilliant, I know exactly who I've got it from. There's no guarantee that that corresponds to any particular source code that you claim it belongs to. I have to take your word for it. You have to actually trust me. Yes, for that sense. Yeah, yeah. With the reproducible build setup, you could provide the source
Starting point is 00:36:46 code in that binary, and I could not only compile it myself to say, yeah, okay, it does check out. I could also ask multiple third parties to perform that same step. And then I can start to trust you and be saying, yes, this
Starting point is 00:37:01 checksum with this signature does correspond to this particular source code. So it's sort of extending that one level back. Right, so you gave it to us before, but now that we've, I feel like we've kind of wrapped our heads around it a little bit more, explain it to us again in terms of the process.
Starting point is 00:37:19 So reproducible builds is not like a feature that you check box. It's a set of practices that you can operate under as a development community, right, that gives you this verifiable path back to the original source code. Describe it to us again, the steps that get put in place before we can say, you know what,
Starting point is 00:37:39 this is verifiably reproducible. Sure, so the steps are, the first step is you ensure that your source code always produces the same result in a bit-for-bit identical way. So this is removing any timestamps, any variations based on
Starting point is 00:37:56 your time zone that you're in, any non-deterministic behavior, any randomness and things like that. Basically, so if anyone took the same source code and recompiled it themselves, they would get the exact same binary out that was completely bit-for-bit identical. Then you ask multiple parties or multiple build servers or distributed around the web, different isolated environments, perhaps, to compile that same source code. And if they get the same result,
Starting point is 00:38:28 if everyone gets the same result in that same binary that you got, then you can start to say that, oh yes, this binary here corresponds with the original source code. And therefore you can make this claim that as it's very unlikely an attacker was infected everyone simultaneously, that this really is the binary you get when you compile the source code.
Starting point is 00:38:50 There isn't any nefarious goings on and nothing has been introduced along the development tool chain. So who all is involved in this process? We mentioned before, but you've been awarded a grant from the Core Infrastructure Initiative to work on reproducible builds. Is this something that the Free Software Foundation is working on? Is it the Linux Foundation? Is it Debian? Give us a lay of the land on what actors think this is important or actually putting efforts towards putting these systems in place for a lot of our underlying operating systems and other things. Well, it's quite a diverse group of projects, really.
Starting point is 00:39:29 I mean, you can find some old mailing list posts about people sort of attempting this in the mid-90s, but it wasn't really on anyone's radar as a security vector for a while. And after the Stoddard revelations and the iOS, et cetera, a lot of people started getting interested in it. Debian was perhaps the forefront of the distributions, certainly putting a lot of the initial activity into reproducible builds. But now we're a completely distribution and project agnostic initiative and endeavor.
Starting point is 00:40:06 It's the Linux Foundation that are very generously funding my time and others' time to work on this. But there's all sorts of distributions involved now. We have Sousa, Fedora, Geeks, a bunch of BSDs as well. So it's not even a Linux-only operating system of Arch Linux. But we also have projects such as Bitcoin and Tor. I mean, you can imagine the incentive to crack the binary of a Bitcoin wallet. If you could upload a backdoored Bitcoin wallet and replace the developer's version, then you would become rich fairly quickly. You see what I mean? That's my plan.
Starting point is 00:40:47 That's your plan. That's your retirement fund. I don't even know why you're saying this. I mean, you're telling everybody my plan. I'm just going to skim a little bit off of everybody's Bitcoin wallets and become rich. Yeah.
Starting point is 00:40:58 Just a fraction of us sent every transaction. Yes, sir. They'll never notice. But it's pretty scary. I mean, to go back to the psychology earlier, I mean, imagine being that developer. You're sitting at home,
Starting point is 00:41:08 you know, how much money could you make by adding a backdoor to that Bitcoin wallet? How much would it cost to hire a bunch of heavies to go around to this house?
Starting point is 00:41:18 I think the economics would work out. The moral economics perhaps wouldn't. But just in terms of money, just, yeah, very scary. So if I was that PP person, I wouldn't want to put myself in that situation. I wouldn't want to put myself at risk for being targeted in that way. And before we go into the break, I want to ask maybe a hypothetical.
Starting point is 00:41:39 Maybe I'm just being naive when asking this because I don't do this too often, but it sounds like reproducible builds is a philosophy and a set of best practices that enables you to verify this binary from a source. And often we have the option to pull down a compiled version or a pre-compiled version of whatever we're using. Why not just opt to compile yourself? And that essentially, if you're compiling from source, you're essentially doing the same thing as reproducible builds. That's right.
Starting point is 00:42:14 It's still a convenience factor. Right. People don't do that too often. No. Gen 2 users would disagree, right? They would disagree, but they would disagree on many things. Much less Gen 2 users would disagree, right? They would disagree, but they would disagree on many things. But I mean, for example, do you really expect your phone to compile the software before installation? I mean, I wouldn't want my phone to have to sit there and compile Chrome before it gets installed.
Starting point is 00:42:41 For example, that would be a little bit inconvenient. I'm thinking more at the developer level. Drain your battery, Adam. I'm not thinking at the end user level they're going to do that because that's just too much to ask any user to do. I'm thinking at the developer level. Maybe I'm closed-minded
Starting point is 00:42:57 and only thinking of this in one lens, but so far the concern here had been installing a Linux version or iOS SDK. I was just trying to play the devil's advocate of why would you just not compile yourself? I guess if it's a developer tool. Yeah, I think convenience is a huge aspect of that. I don't know when it was,
Starting point is 00:43:18 but coming from the Mac side, Adam, like I said, Homebrew now has pre-compiled binaries for lots of packages that you install often. And so if you have to compile postgres from source every single time you have a point update you know that could take depending on your machine it could take 10 minutes it could take 40 minutes who knows right and so if you have every single piece of software that you run you're going to compile from source does that how far does that go do i also have to configure it myself and make sure I've actually configured it correctly?
Starting point is 00:43:46 I think there's a huge inconvenience there. Well, it may be missing a bigger picture here in that the security affordances that reproducible builds provide should apply to everyone, really, to all users on any technical spectrum. So they don't want their to-do list app and things like that that they're just using as a thingy
Starting point is 00:44:06 to have any backdoors in. Well, I think the bigger picture there is just trying to figure out, I'm just thinking if I'm a developer and I'm going to use something that has a binary, why don't I just compile it myself? And none of the arguments you've made there
Starting point is 00:44:20 of the conveniences and affordances and if I'm going to every point release a Postgres, I'm going to recompile a new version of it. That's probably a big pain in the butt. I'm just trying to figure out, I guess, if is reproducible builds this philosophy, this best practices, is it enabling me as a developer to have the ability to reproduce it if I wanted to, and that's the security or is it?
Starting point is 00:44:43 Okay. So if I wanted to take the convenience hit to actually compile it myself, I could to prove that what I've gotten is coming from the source. Indeed. And what our goal is in the reproducible builds project is that there are enough people out there already building the software that you can simply rely on those people to provide you with a checksum that has consensus across say 20 or 30 different people. And so you would never really have to rebuild anything yourself because all of these 30
Starting point is 00:45:17 other people agree that the binary should be X. You also have binary X. I'm happy with that. That's fine for me and things like that. And that also speaks to the end users as well. So they don't have to compile software themselves necessarily, and if they want to, but if they just want to install a random app on their phone, some sort of to-do list, for example, they can trust that the 30 or 40 rebuilders, as we might call them, agree on a particular checksum. And as they've got that same checksum, they're happy installing that and saying,
Starting point is 00:45:52 okay, cool, the binary I've got corresponds with this source code. There isn't any nefarious, nasty stuff being added somewhere in the mix along the line. You've certainly given me a speckle of fear when it comes to installing potentially minifarious apps in the app store because like there are times i want to use an application in a genre like for example recently like with with music or something like that i'm just like i don't know if i should trust any of these people because i don't know
Starting point is 00:46:18 any of the brands they the design isn't that great so like there's some known trust factors you sort of apply to potentially trustworthy software and that doesn't exactly define security or trust like by its look but it certainly helps it if you care enough about its design that that it's uh trustworthy but you know just in general you've given me this slight fear that somebody out there is using a hacked version of ios that's replacing ads and or stealing my data and and now I have complete fear but let's let's break that let's not open that can of worms my fears out there the world knows about it let's break when we come back we'll talk about uh other advantages beyond security things like that so we'll break here we come back we'll go into that with Chris right back Linode is our cloud server of choice.
Starting point is 00:47:06 Everything we do runs on Linode servers, the most efficient SSD cloud servers on the market. And you can get your own Linode cloud server up and running in seconds with your choice of Linux distro, resources, and node location. They've got eight data centers all across the world, North America, Europe, Asia Pacific, and plans start at just $10 a month. You get full root access for more control, run VMs, run containers,
Starting point is 00:47:30 a private Git server, enjoy native SSD cloud storage, a 40 gigabit network, Intel E5 processors, super fast. Use the code CHANGELOG2017 for a $20 credit. Unlimited uses. Tell your friends. Once again, CHLog 2017.
Starting point is 00:47:47 Head to lino.com slash ChangeLog. And now back to the show. All right, we're back with Chris Lamb talking about this awesome thing called reproducible builds. You need it to have secure software. And maybe it's just a pain in the butt to compile from source. As I learned that today, you can't do that every day. But Chris, take us through some other advantages. I mean, obviously, you got some security advantages here.
Starting point is 00:48:10 Where else should we go for this to help explain to the community why this is so important? I think the biggest non-security advantage is given that every time you rebuild the software, you should get the same result. It means that if you make a tiny change to the source code, you should expect there should be a result in the resulting binary. And only those changes should be apparent in there. You want to do a new release of software, and you want to make sure that this new release only contains the changes that you want. Reproducible builds make it very easy to analyze that, your new version with the previous version. And if you compare them, you should only see the changes that you expect.
Starting point is 00:48:53 We've even written a very good tool for this called Differscope, which can recursively unpack binaries and things like that and look inside them and provide a human readable view on a particular binary and if for example it will decompile java files and things like that and pretty print python source code and javascript source code and things like that which makes it very easy to say okay i've released a bug fix release for this particular thing and only the changes i expect are in this new release so that's fine i'm happy to push it out now this is a massive boon for anyone doing security releases for example but it's also just if you just want to have really good um quality assurance you want to ship something to your users you don't have any inadvertent changes like oh yeah it pulled in
Starting point is 00:49:40 this extra dependency whoops and now it's broken everything. Oh, sorry. If you just change one line of code, you should kind of just see that reflected in the resulting binary. The other advantage when things always build to the same result is that you just by design, you get better cash hit ratio. When you speak to the guys at Google about this,
Starting point is 00:50:03 they're saying this is saving them thousands and thousands and thousands of dollars simply because when they compile a large piece of software, many parts of it haven't changed. And as they're reproducible, they will always produce the same result. And so therefore, it just says, well, there's nothing new to compile. So therefore, I don't need to do anything. So you can reuse the previous result. So this not only saves developer time, it's saving the company money, but also it's saving the environment in a sense, because you're not wasting CPU cycles and generating CO2 and things like that.
Starting point is 00:50:41 Further technical advantages are, by design, it removes any unreliable or non-deterministic behavior in the development process. So if you really want to get the same result, your build can't rely on anything that's based on timing. So any quote unquote unit tests that do, for example, using time to check whether something should run in a particular speed or in a particular ratio of inputs to output time, algorithmic complexities, as it were, that becomes unreliable and therefore non-deterministic and therefore can't be part of a reproducible build
Starting point is 00:51:16 and things like that. It's a good way of finding bugs in uncommon time zones and locales. So the two or three times I've come across it, it's been Ruby libraries, so I'm not sure. But a few Ruby libraries that have been designed to manipulate dates and times, their test suite fails when you run it in, for example, UTC plus 14,
Starting point is 00:51:41 which should be a little worrying because this is a library that the developer might be using to say, okay, I know time zones are difficult and date processing is difficult. So I'm going to leave it to this library and the library doesn't actually work in these strange locales and things like that. How would the reproducible builds help you track down
Starting point is 00:52:02 that specific time zone based bugs then? So within a Debian project, we have a reproducible torture chamber test. So what we do is we build every piece of software in Debian and there's 23,000 different source packages there. Give you a scale of what we're talking about. We build it twice, one after the other, the A build and the B build. And we try and vary as many things between those two builds as possible.
Starting point is 00:52:32 So for example, the second build will be on a different CPU type. It will be done a few years in the future. We just set a fake time and things like that. We change the shell. we change the path environment, we change all the environment variables we can possibly change, your username, anything we can think of, we change the file system. Basically anything you think of, we try and change. And this hopefully surfaces as many differences that would affect reproducibility.
Starting point is 00:53:03 So we want to make sure that any end user can compile a software on their own machine regardless of their own local environment and get the same result. And so this is a way of reducing the set of variations that would actually result in a variation in the end binary. And this also shows up some of these QA advantages as well. I mean, wouldn't that, that would help us out compile time differences but what about runtime like i'm thinking i'm thinking
Starting point is 00:53:30 about like things that are legit compile like ruby for instance like how does it suss out those problems if you're packaging ruby things you're quite right um as they're jit compiled the what gets distributed then is the ruby source code so although it's. So although you have to squint, the binary for a Ruby package is actually still text-based Ruby code and things like that. Saying that, it can still surface interesting things. And just on this one happens to be a security, but if there was a repository browsing tool that had an open ID based login system, and during the build process of it, it was generating a open ID secret. You know how it's based on a secret that the private, that the server knows about, and it uses public key, Diffie-Hellman, et cetera, style cryptographic
Starting point is 00:54:29 algorithm to validate logins are secure. So during the build, it would generate a random number, and that would get put into the binary package. Unfortunately, this meant that every installation of this browsing tool would share the same secret. Yikes. Because, yes, this was surfaced in our QA torture environment
Starting point is 00:54:51 because in the A build, it would generate secret 1334 and in the B build, it would generate the secret, you know, 1943 or whatever. And we would flag that up as, oh, it's different
Starting point is 00:55:04 between the A and B build. What is different? It's some sort of secret key. Oh dear, this should not be the same for all the packages that get built. You guys have a reproducible build torture chamber. That sounds terrible. I like the name.
Starting point is 00:55:18 It definitely conjures thoughts of visualizations. Probably, it sounds like it's well-named if you're definitely changing so many things, you're putting it through, you know, torturous things. Thinking about how do you, like, so this is the Debian project. You guys have a great setup. You put the time and money into this.
Starting point is 00:55:38 How do other people do it? Like, there's a set of best practices. You've described the process, which seems relatively straightforward. There's a few steps. But tactically've described the process, which seems relatively straightforward. There's a few steps. But tactically, how do you go ahead and say, you know what? Our group is interested in reproducible builds. How do we get from where we are to where we want to be?
Starting point is 00:55:55 A lot of the work is being done with liaising with compilers and other toolchain-based utilities that are introducing non-reproducibility. So we speak a lot with the GCC developers. For example, we had C developers will know the underscore underscore date and underscore underscore time macros, CP processor macros. And previously they embedded the date and time directly into the source code as macros. This affects reproducibility because obviously every time you rebuilt it, it would put the current time in. And so therefore, every single time you would build, you get a different binary. So a lot of the time we are speaking to developers in those kind of areas rather than developers of, shall I call them, leaf packages,
Starting point is 00:56:46 you know, sort of ones that depend on other packages rather than where packages are depending on them. Documentation generators are another example of upstreams that we're speaking to quite a bit. In terms of just getting the word out about the potential problems of a world where we can't trust the binaries that are running on our own computers. That's a lot of what we do and talking about the problems and talking about the doomsday scenarios as we outlined before. So we've outlined a little bit who's involved and you mentioned all these different projects doing it, working hard on this, Arch Linux, Bitcoin, Debian, FreeBSD, NetBSD, so on and so forth, Tor.
Starting point is 00:57:26 Who's not involved that should be? You know, if you could get their ear and say, you know, you guys need to be doing reproducible builds and here's why. What are some groups or some people or some companies that should be, you know, doing these things? And as far as your knowledge is that they aren't. Well, one thing we know they aren't in that basically people outside of the free software
Starting point is 00:57:48 space. I mean, for example, what made the recent Volkswagen emission scandal possible is software that has been designed to lie about the sensors in a lab environment. If you had the source code under public scrutiny, adding some sort of new features would have made that sort of impossible. I mean, without reproducible builds, it's hard to confirm that the binary code installed in the car was actually made during the source code
Starting point is 00:58:13 that had been verified, if you know what I mean. Well, nobody has access to the source code anyways, right? Well, yes, that should be another... So anybody who has proprietary source code, they're not going to be doing reproducible builds as far as a public community has access to them because their source code is private anyway. That's true. But because of things like the emission scandal and things like that, we may see more legal-based requirements for these things to be. And then even if someone did provide the source code for a piece of software along with the binary, in other words
Starting point is 00:58:45 your car, you would still need to be able to verify that one came from the other. Well, you would think the EPA might step in there Jerry, where the general public may not have access to the source code, but maybe because of the submission scandal, maybe a new law or something is put into place where
Starting point is 00:59:01 the EPA has to have access to the original source code to produce a reproducible build to confirm that the software installed in the car matches the result they got from source. I mean, that's a possibility there rather than saying, you know, open the code up to the world because it is proprietary. Maybe certain security levels might have to be in place. That just means bigger government.
Starting point is 00:59:25 So different podcast altogether. Something I wanted to ask, maybe it's we're talking about who should be involved. But I was I was thinking maybe I was almost going to interrupt you, Jared. But maybe now that I have the chair here, I can ask it. Going back to the example originally with the iOS developer who got circumvented and pulled down the wrong version of the iOS SDK. What could that person have done differently to prevent them using a scrutinized iOS SDK? Well, one, they could have. I mean, the obvious things of ensuring that they download it from a reputable source, assuming that Apple are not going to release the source code for the SDK,
Starting point is 01:00:02 which is probably a given. There's very little they can do, and that's basically the quote-unquote risk you take when you run for actually software. What hopefully would have happened is that if they were in a free software world and they released the source code for their software, it would have been very obvious and very quickly that the binaries that they were producing did not match with their source code because, and they would never have matched because of the way that their SDK was introducing the change of adverts and things like that.
Starting point is 01:00:35 So hopefully very, very quickly, when a third party recompiled their piece of software, they would say, that's interesting. You're distributing checksum A, B, C, D, but when I compile it myself, I get D, E, F, G. What's going on here? And questions would be raised very early and things like that. So in the case of this SDK, they pulled down iOS SDK, downloaded from a reputable source, which seems to be a logical first step.
Starting point is 01:01:05 But let's say they didn't. But since it's proprietary code, they can't essentially leverage the best practices of reproducible bills because they don't have access to the source code, and they can't confirm that. Indeed, yeah. So they're screwed, basically. They're forced to use this nefarious version
Starting point is 01:01:24 because they download it from somebody's hijacked website not apple.com slash developers or something like that back to who should be involved because i i had a thought in my head but i want to go too farther but uh i do want to get back to who's involved so we've got a list here i think the url is reproducible dash buildss.org. And you can go there and see everybody involved. And Jerry's question was like, specifically, who's not involved in this that should be involved in this? Well, I wouldn't want to embarrass anyone in particular.
Starting point is 01:01:56 Oh, come on. Well, maybe it's not an embarrassment. It's more like a call to arms. Get in the pool. Yeah, convince them. Don't embarrass them. Convince them. So someone who's not really represented here is
Starting point is 01:02:09 Ubuntu, and they do have a large installation base that would obviously be great leverage and would provide a lot of reassurance to a lot of users if Ubuntu got involved. We have actually spoken to them, and they are kind of waiting to see
Starting point is 01:02:26 whether the Debian tool chain, et cetera, kind of settles down because it's a little bit in flux at the moment. So they have no philosophical objection to it. It's just not on their radar right at this second. But hopefully that will change in the next three or four months. And we'll start to see
Starting point is 01:02:41 some of the Debian reproducible builds work trickle down into Ubuntu like a lot of the other work that's shared between those two projects. So I think that would be the main one that's missing in terms of user leverage. In terms of people who don't really care about it, I guess anyone in proprietary software can't care about it because it doesn't really, I mean, it doesn't really work if you don't have force to at least. Yeah, indeed. So I can't really speak to them. I mean, it'd be nice if more Windows developers had that kind of mindset and things like that. But there's still a lot of free software that's being developed for Windows.
Starting point is 01:03:18 I mean, things like Putty, a whole bunch of browsers are free software and released on Windows operating systems. So perhaps more in that space and things like that. Microsoft themselves could definitely get involved when it comes to open source developer tools in their ecosystem. They have many,.NET Core and Visual Studio Code and many things that have been open source for a while now that they could at least,
Starting point is 01:03:43 and people are actually, developers are relying on them as their tool chain. So they could get involved. That's true. They could certainly help ensure, or certainly make it easier for developers to make their builds reproducible. Yes, that'd be very nice to see.
Starting point is 01:03:58 And again, there will be another great source of leverage there. It'd be one company getting involved and would help quite a lot of developers and users. What about at the individual level, you know, Jane developer, you know, Jane web developer and Joe game developer, you know, Linux users, people like Adam and I, our listeners out there that maybe, I mean, we take security seriously, maybe we can do better, but what can we do to help this initiative? So one thing is you can ensure that any source code you do release can be built reproducibly.
Starting point is 01:04:34 So this means removing any timestamps from the build, ensuring that it produces the same result in as many different environments as possible. It doesn't have any varying behavior, things along those lines. Most software would not require any of these things, but a lot of software likes, for example, likes to include the version or which is based on the current date, or they like to include the machine name that it was compiled upon and things like that. So removing those sorts of things is pretty much the first step. And for most software, the only step required to make
Starting point is 01:05:05 make the build reproducible the other thing you can do is is to occasionally check whether the code that you're running does match the source code and if it doesn't you know raise a red flag to whoever is producing that binary and the source code and saying that's interesting i you're providing this binary and this source code. They don't seem to match. What gives? What gives? Would they compile their own version?
Starting point is 01:05:31 Would they use this torture chamber you're mentioning? What's the step they take to ensure that? I think right now they would compile it themselves. That would probably be the easiest way of getting a single checksum rather than setting up a torture chamber because that requires isolated environments, changing clocks, etc., and things like that. So the first step would just be just to recompile it on their own machine and see what they get and compare it with the result that's being distributed by the original source code maintainer.
Starting point is 01:05:59 I feel like that's still a hurdle that most people will just be like, whatever, maybe I'm just being agnostic against it, but I just feel like we're back to the same original problem where we said you know you're just verifying a checksum essentially you're back to which is what you want because you want to reproduce a bill by compiling which is the question i asked kind of in the middle there where why don't we just compile the software ourselves and that's essentially what we're doing to confirm we have the right thing. So if they're not going to do it to use it, I'm just wondering if there's an easier hurdle to put in front of people to get over versus that one. I suppose one difference here is that checking the
Starting point is 01:06:37 checksum not only helps Jane developer, it also helps when they are validating a checksum, they're also checking it for the rest of the communityating a checksum they're also checking it for the rest of the community as well and they're also checking on behalf of the original distributor as well it's it's not just helping them and so if they built all their binaries themselves it would be it's too strong a word but something like it would be sort of somewhat selfish to do that because only they would reap the advantages of doing that recompilation. But if they recall and check with the upstream's
Starting point is 01:07:10 version of that binary, what they're doing is helping the community at large saying, yeah, I've confirmed that this binary matches this source code at least. General notes one other go as well. So it's a bit more friendly in that sense.
Starting point is 01:07:29 Sorry, Jared, go ahead. I'm just to think of like ways that we could actually get people to do something on this because i agree that like the ad hoc you know check a binary here check one there type of a thing probably doesn't have much legs with people but it seems like there could be some like community tooling built around some sort of reproducible build chaos monkey thing. Somewhere like how Netflix has on their internal networks where you could just like build a system that like pulls a random GitHub repo. Maybe it has to be language specific or something, but you know,
Starting point is 01:07:55 spins up an EC2 instance, runs the build, gets the checksum, checks it against, you know, the, the published or whatever. And then like as a webpage with red,
Starting point is 01:08:04 you know, people with red X's and then like as a web page with red, you know, people with red X's and green checkboxes or something where it's automated, but accessible, accessible and, you know, community effort. I don't know, Chris, is that. Yeah. Well, he said recompiler earlier. That's what I was going to ask before I was going to interrupt you. But you went, it was Kristen,
Starting point is 01:08:23 you mentioned something about recompilers, a farm or something like that earlier in the call that we didn't go into? That's right. Yes. I think I referred to them as rebuilders. Rebuilders. Yes. That sounds cool. Yeah. I think it's a pretty cool idea. It's quite interesting philosophically as well, because you would want as many different and diverse groups of people recompiling the software because if you did have a community effort whilst you've removed the original building on of the binaries you've
Starting point is 01:08:52 said you've just if you have a central community way of doing this you've essentially then re-centralized the confirmation of all these checksums so you want as diverse a group as possible building all the software all of the time. For example, you might have servers in Greenpeace's data center and the Department of Defenses. People with rather different views of the world, but if they can agree on the checksum,
Starting point is 01:09:20 the final checksum of a binary, and they have different motives in this world and things like that, then you can start to say, oh, I can trust this. Yeah, cool. It'd be kind of like SETI at home, only the results would be actually useful.
Starting point is 01:09:32 It would be very much like SETI at home. I never thought of it that way. Yeah. Rebuilder at home. There it is. You know, somebody go out there and build that thing and we'll all just dedicate CPU cycles, you know, at nighttime when we're sleeping
Starting point is 01:09:43 to making sure all of our software is secure. That'd be amazing. I'll get onto that. So that's what I was getting at was like this hurdle to do the thing that, you know, does what we're trying to do with this conversation and this entire initiative and this best practice is like that last step. As Jared said, is going to be less likely to be done by the general public if it's just sort of like, if I think Jared's doing it and he thinks I'm doing it, neither of us are doing it, right? And maybe a few, a small handful.
Starting point is 01:10:13 And those are the people getting burnt out. Those are the people giving talks. Those are the people running meetups. Those are people getting pulled every which way forward. And those are the people getting burnt out. And, you know, it doesn't scale it doesn't sustain we have a heartbleed issue again or we have another issue or we got an emissions scandal going on because you know this uh rebuilder process is just too hard to put on the individuals
Starting point is 01:10:37 of the world indeed yeah and um that's probably our um our weakest point at the moment is how we can really translate the reproducible build effort down to the end users. So, for example, providing end user tools to say, oh, you're about to install this particular binary, but we don't believe it's reproducible. So what do you want to do about it? We don't have tools for doing that yet. We don't have these sort doing that yet we don't have these um sort of automated or at least semi-automated rebuilders yet it's certainly in the pipeline it's just that we are still not quite there as a project to in in terms of we'd like to move the reproducibility effort on a little bit further before we attack those angles and
Starting point is 01:11:23 there's a few unresolved questions as well i mean just for one example say we had 10 different builders you know greenpeace department of defense you me etc all publishing their checksums for their binaries what would be the algorithm the end user tool used would it say um oh all 10 have to agree? Do nine out of 10 agree? Is that okay? What if I'm a malicious actor and I upload 10,000 checksums that are all bogus? Would they outvote the others? There's these difficult questions that haven't really been resolved yet in terms of policy. Put it on the blockchain. Blockchain solves all problems.
Starting point is 01:12:09 It's the new spade card or trump card, you know, just say blockchain and then that's the answer. Well, blockchain would be part of this thing to ensure that someone could not unpublish their checksums. So yes, we are actually ironically thinking of using blockchain-like technology. I'm full of good ideas today i tell
Starting point is 01:12:26 you what this is that's two in a row i feel like you should join our project yeah maybe i will well chris in efforts of closing up here what uh what closing thoughts do you have to share this is last chance on the show to sort of get that final person who's like you know i really like this idea you know what's the next step what uh what final closing thoughts do you have on this i suppose the next if someone's vaguely interested in the project they should totally check out our our website it's got some a bunch of talks a bit more background information some recent presentations with some more interesting gotchas about interesting things that we have surfaced QA wise in the reproducible builds effort. We also have a mailing list and some
Starting point is 01:13:13 interesting, as I mentioned, the Differscope tool. So one thing that everyone can try right now is a website called try.differscope.org, where you can upload two files and it will recursively unpack them. So if you give it two ISO files, it will unpack the ISO files and look for differences within, look for meaningful and human readable differences between those two files. That software is also available on your desktop, but this is just a web-based interface to it. So that would be the next things to check out if someone's interested in the project good deal well we'll certainly leave links in the show notes to reproducible-builds.org which is the site that
Starting point is 01:13:56 chris is referring to the talks resources tools events even the news stream you have there is great uh great documentation so highly encourage those who are listening to this and interested to check that out. Check the show notes for that. And, uh, Chris, thanks so much for joining us on the show today, man. Really appreciate it. No problem at all. Thank you for having me on. Thanks again to our guests this week, Chris Lamb.
Starting point is 01:14:22 Also thanks to our sponsors, GoCD, Linode, and Flatiron School, as well as Fastly, our bandwidth partner. Check them out at fastly.com. Our theme music was created by Breakmaster Cylinder, and this episode was edited by Jonathan Youngblood. The best way to keep up with all things open source and software development is to subscribe to our weekly email,
Starting point is 01:14:43 changelawweekly. Head to changelaw.com slash weekly to subscribe. Don't miss an issue. And thanks for listening. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.