Python Bytes - #51 How to make your code 80 times faster

Episode Date: November 9, 2017

Topics covered in this episode: Exploring United States Policing Data with Python How to make your code 80 times faster Giving Open-Source Projects Life After a Developer's Death Solar Powered Inte...rnet Connected Lawn Sprinkler Project Talk MicroPython and Open Source Hardware at Adafruit: https://talkpython.fm/108 Some New Python Books Anaconda Distribution 5.0 released Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/51

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 51, recorded November 7th, 2017. I'm Michael Kennedy. And I'm Brian Ocken. And we got a bunch of awesome Python news lined up for you, as always. Before we get to that, though, let's just say thank you to Datadog. Yeah, thanks, Datadog. Yeah, Datadog is sponsoring this episode.
Starting point is 00:00:23 They've got some really cool sort of whole platform monitoring tools. And you get a t-shirt if you do their little tutorial. So we'll talk more about that in a minute. But now I'd like to explore the United States with some data science. Yeah, so I ran across this article called Exploring United States Policing Data with Python. And since that's actually kind of probably a hot topic and a lot of parsing some of that data, I thought this would be a fun thing to walk through. So I walked through about half of the paper. Anyway, but it goes through using Jupyter and IPython and all of
Starting point is 00:01:02 those fun tools like Pandas and NumPy to grab some publicly available data. It's in a CSV file that's zipped, and you can just import that directly or read it directly with the appropriate tools with a Jupyter notebook and ask some questions like the race of people that get pulled over more often and things like that. I know it's a very political topic. I don't really want to get into that part of it. It is interesting, but mostly I think it's a very riveting example for walking through why it's important for more and more people to be able to examine public data and be able
Starting point is 00:01:42 to figure out what's going on. Yeah, I think it's really interesting that you bring this up, this example of working with public data. Like there's a ton of data that we could be asking and answering questions about, right? This policing data is an example. Was it fun for you to like play data scientists and play with Jupyter and Matplotlib and stuff like that? Yeah, it really was. And one of the things I really actually enjoyed was that the example goes through pretty quickly because it doesn't stop to tell you
Starting point is 00:02:10 exactly what all the code does. It just has the code snippet that you just plop into a Jupyter cell and hit shift enter, and it just runs it and plots things. And I know I can look that stuff up later. There were a few gotchas that I ran into that I'll put in the show notes. Since this was my first time playing with Jupyter,
Starting point is 00:02:33 I didn't really know that you had to hit shift enter, but I remember hearing that from somebody else to get it to run. Yeah, it's kind of its own world, but it's really nice. Yeah, but the example really does start with just from the beginning. If you've never run it before, you can walk through some of this. It's pretty cool. There was some really interesting stuff you can do with police data and data science. I don't remember exactly the details, so just take this as kind of a general idea. But I think on Partially Derivative, they had someone on talking about analyzing this kind of stuff. And it was something like when there was some kind of complaint or episode of violence involving police, it was
Starting point is 00:03:11 pretty frequent, like that the person, the policeman involved in that violence had previously somehow just come off of some other horrible thing, like policemen who went to like check on a tech to deal with a suicide and then pulled somebody over who was, you know, noncompliant was much more likely to have a violent interaction with that person at the traffic stop. So you can do things like say, well, let's change our policy so that people who just had some traumatic event get the rest of the day off so they can process that. Right. Right. I mean, these are really powerful and important things. One of the things I want to caution people for is, like, I'm clearly an amateur data scientist because I just did this one thing and just followed a tutorial. But now you have that title. Yeah, no.
Starting point is 00:04:00 But there is, be careful when you draw conclusions and plot things and show charts. Suddenly it looks more legitimate. And yes, there's good information you can find, but you've got to be careful as well. So at the beginning in this article, for instance, not all the data is filled in for all police stops. So you have to deal with, like, fields that are empty. What do you do with that? And this article deals with it by just throwing them away, like throwing away a bunch of, it picks one field and says, well, I can fill that one in, but the rest of them, rest of the rows, if there's any empties, just throw them away. I don't think that that's valid. And I
Starting point is 00:04:44 think there's probably a better solution. But I think that if you're going to publish something, you should discuss what you did to clean up the data also. One of the things we've talked a lot about here on the show is performance. Sometimes that performance comes in terms of asynchronous programming when you're waiting on the network and things like that, or other times maybe looking at things like PyPy. So most people know there's all these different implementations and runtimes for Python.
Starting point is 00:05:11 But if you're a new listener, maybe new to Python, we have CPython is what most people mean when they talk about Python. We also have a JIT, just in time, compiler version of Python called PyPy. We have IronPython, we have Jython, we have Cython. There's all these different variations. So one of the ones that promises to take basic working Python code and make it a lot faster is PyPy. So you'll hear people talk about, hey, you could make your code five times faster with PyPy under a whole bunch of constraints. So the thing I want to talk about this time on this show is how somebody went and took their code for IoT stuff and made it 80 times faster with Pi Pi. Yeah, that's incredible. 80 times. Wow. Yeah. And I don't think they even really
Starting point is 00:05:55 changed the code much. They did change one little bit of an inner loop, but that was about it. All right. So here's the deal. This person was working on evolutionary algorithms, which they were trying to create basically a self-learning adjusting algorithm that could evolve a logic to control a quadcopter. Oh, nice. Right? Like one of these drone things. It was a simulated one, but, you know, it could be hooked to a real one. It didn't really matter.
Starting point is 00:06:19 And in order to drive the quadcopter, this object had to basically run a certain operation every so often. And the so often is like 100 times a second, so quite often. And they would input this numpy array. They would do some processing on it and output another one, like how much thrust goes to each motor and things like that. So this is happening 100 times a second. And they ran it with CPython.
Starting point is 00:06:44 And their test that they ran, not just once, but a whole bunch of operations. It took about six seconds and said, okay, well, let's try PyPy. I heard making, you know, using PyPy makes it faster. So they just sort of typing Python space, my program, they type PyPy space, my program. And it turns out it was wait for it five and a half times slower. That, no. That's not faster. Exactly. It's like, oh, who sold me this bill of goods, man? I thought this was supposed to be faster.
Starting point is 00:07:10 So it turns out that the integration with NumPy, which is what they're using, actually some of the C interop stuff is quite a bit slower. So they actually use this PyPy implementation of NumPy. It's called NumPyPy instead of NumPy. And that worked, that made it faster. So now it was two times faster just by switching out the imports in the library. So that's pretty good. I said, but you know, we actually think we could do better. And so what they did is they started profiling it, and they looked at where it was slow and said, you know, this thing, actually, we could just rewrite this algorithm just a tiny bit so it's more friendly to the
Starting point is 00:07:48 jet compiler and they got it to go 80 times faster than the original c python plus numpy version no that's nice that's awesome right and that was even 35 times faster than the native pi pi plus numpy pi version so i think that's a really interesting lesson yeah nice nice use of uh profiler too yeah exactly that's what the point i was going to make. Yeah, it's one thing to throw PyPi or some other technology at it, but it's not always just going to be this silver bullet that solves your problem, right? But PyPi plus a little intelligent problem solving, that made a huge difference for these guys. Yeah, so you said this was a simulation?
Starting point is 00:08:24 Yeah, they were doing it as a simulation okay i was wondering i wonder if you can run i didn't know don't know whether or not you can run pi pi on a little tiny like a quadcopter controller i guess it depends on your definition tiny like if it's um if it's raspberry pi i think you probably could if it's ate a fruit size maybe that might be too small. I actually don't know. Okay, cool. Yeah. So the next one is an interesting one I saw go by and I saw that you picked here as well.
Starting point is 00:08:51 It has to do with the longevity of open source projects. Yeah, this is actually a Wired article called Giving Open Source Projects Life After a Developer's Death. And I hadn't really thought about that too much before. But there is, I mean, we've got more and more, as the article goes on, we've got more and more critical projects using open source projects. And there's a lot of them that don't have that many maintainers, or sometimes just a handful or one. So really, how do you deal with that? So part of the article is just talking about this as a problem, but then also there's possibly some solutions. And I was just wondering if we had any solutions. I also had some terrible puns I was going to try to throw in. What to do after you hit your corporeal seg fault and raise an end-of-life exception. Yeah, it's definitely something that people, I guess, really want to think of.
Starting point is 00:09:55 I mean, if you're in business, they do talk about the, like, what if so-and-so gets hit by a bus, right? Will that kill the business well if this open source project is used by many businesses it could be that that actually kills a lot of projects a lot of companies yeah so i didn't know this was out there that apparently there's a place called libraries.io that has a bus factor evaluator for different libraries so you can plug in a library that you depend on and look up its bus factor. Oh, really? Yeah, how many of the core developers would have to be on a bus
Starting point is 00:10:31 that got hit before the project went away. But one of the things that it did bring up, which I looked up some of the Python stuff, and some of them are core things that, even though there's a handful of maintainers, I think it would get picked up anyway because it's part of the Python stuff. And some of them are core things that even though there's a handful of maintainers, I think it would get picked up anyway because it's part of the standard library. But there's definitely others that are of concern. And one person points out that perhaps we could build some more things into like GitHub or PyPI
Starting point is 00:11:00 or other places to have maybe errors put in place to you could say, you know, if I don't check in for like six months, then transfer ownership to these people or something. Oh, that's a pretty cool idea. Almost like an escrow for open source. Yeah. Like, I mean, if you get a big investment account, you got to list who gets it if something happens to you. So maybe something ought to be like that for open source projects. It sounds like projects somebody could create and integrate with GitHub. Yeah, maybe. One of the other solutions that I have seen is like the PyTest community for a lot of the plugins.
Starting point is 00:11:38 Try to encourage people to, once you get quite a few users of your plugin, to push it over to a development group on GitHub, and then anybody on the group can maintain it if necessary. Yeah, that makes a lot of sense. I mean, you could always fork the repo and just say, the real one is here, but I can just see a lot of skirmishes around, like, no, your fork is not real, my fork is real. People could fight over ownership, right?
Starting point is 00:12:01 Yeah, and that's doable on GitHub, but you can't really do that on PyPI that I know of. I don't know how to get transfer ownership on PyPI. Yeah, I think you've got to contact them directly and just lobby your case, right? Which doesn't sound like a great long-term widespread solution. I don't have that problem currently of having any super popular packages, but, you know.
Starting point is 00:12:23 Yeah, that's cool. Definitely worth thinking about. Yeah, so before we get to the next topic, let me tell you about Datadog, right? So performance and bottlenecks, these don't just exist in just your application code, right? Like just because your code is slow, well, you could be waiting on the database. The database could be waiting on some Linux internal behavior, who knows, right? So these are layers upon layers upon layers across systems that we really build our apps with. And Datadog lets you view all of those as one whole thing. So let's say you have a Python web app running on Flask.
Starting point is 00:12:55 It's built on Mongo, hosted on a scaled out set of Ubuntu servers running Nginx and MicroWhiskey. Datadog will let you view and monitor all those things as one system. So that is really, really super awesome. And the more you scale out and the more diverse your system gets, the better Datadog can help you. They've got a getting started tutorial. Just take a few moments.
Starting point is 00:13:14 And if you finish it, they'll send you a sweet Datadog t-shirt. So you can check that out at pythonbytes.fm slash Datadog and see what you've been missing. I still have to do that. I need that t-shirt. Yeah, we've got to get our t-shirt. So maybe this is the wrong season here in the Northern Hemisphere. Maybe this is going to resonate a little more
Starting point is 00:13:31 with our Australian friends. But this next project I want to talk about is a solar-powered, internet-connected lawn sprinkler. Oh, nice. Yeah. So Lennon, one of our listeners, friends of the show, sent this in and said, hey, I created this really cool project
Starting point is 00:13:44 and I'd like to share it with everyone. And I thought, yeah, this is actually a really neat example. So I thought we'd throw it in here. And the idea is he went and got this little tiny Adafruit Feather Huzzah board. And this is like a little tiny microchip type thing, but it has Wi-Fi. So that's important. You can plug it in somewhere and talk to it. And it works with MicroPython. So MicroPython is the Python that works in the smallest devices.
Starting point is 00:14:11 Like you said, the PyPy before, I'm not sure if you can get PyPy to run on this one, but MicroPython is so super level. It's basically is the operating system. Your app is basically the operating system. Yeah. So you can take like a Lambda function and connect it right to a hardware interrupt directly. Like that's how low level that's insane, right? Oh, that's, that's nice. Right. And he combines a couple of other interesting things. He combines home assistant, which is the biggest home automation project in Python, like a really cool app that integrates tons and tons of different IOT and smart home things. And he gives a really nice list of like, here's every single piece of hardware I used. Here's the solar board that I used.
Starting point is 00:14:51 Here's the container for the feather Huzzah board. And it's just a really nice example of a small, compact IoT project. Yeah, and useful and not creepy. I like it. Yeah, exactly. The more we seem to go back and forth on these little IoT things, I really want to create one of these that goes on my front door that uses machine learning to determine what type of person
Starting point is 00:15:11 is on my front door when they ring the doorbell. Yeah, or a dinosaur. Yeah, or a dinosaur. That's right. Yeah, and also I put a link in here to TalkPython episode 108 where I actually had the guys from Adafruit come on and talk about a whole bunch of these different projects. But yeah, nice job, London. This is a cool one. Yeah. And shout out to Adafruit too for doing all sorts of cool stuff with hardware and software.
Starting point is 00:15:32 I like that. Yeah. They have a big educational aspect to them. Not just education, education, but teaching people who want to learn about IoT. I'm definitely planning on playing with some of these little devices in MicroPthon it just seems really fun okay so I am going to be perfectly honest with this last one I had a um my last thing I was going to talk about was going to be another packaging story but I kind of went down a rabbit hole so instead of getting into that that's my homework for next week so um I've already set up what I'm going to talk about next week
Starting point is 00:16:04 but the uh what I want to highlight is some books that came out. So we had some new Python books that came out recently. It's a big week for Python books. Yeah, we've got Python Tricks from Dan Bader. That came out. And Matt Harrison's Illustrated Guide to Python 3. And I have just, actually just, I'm going to read, at least peruse both of these. I've just started Python Tricks, and I like the format. It's cool. And then I'm going to take a read for Matt Harrison's book,
Starting point is 00:16:36 and the cover's awesome, and I want to try to get, actually I want to have that around my office so that other people can look up Python 3 stuff pretty easy. That seems really nice. I looked through it as well, and the illustrated aspect is cool on Matt's. So, yeah, congrats both to Dan and Matt on this. This is cool. Yeah, and then the last thing is I was on Twitter.
Starting point is 00:16:57 I was talking with a handful of people, authors. There are some Python books out there that really could use some Amazon review love. So I'm going to drop a link to my book and Harry Percival's Test Driven Development, which has been out for a while, but it's only got six reviews. And I know a lot of people have read this stuff. And then also... Yeah, Harry's book is really great. Obey the Testing Goat and all that stuff. And then also the Greenfield's Two Scoops of Django. I know a lot of people have started Django or gotten a lot better at it using this. So go out and show some Amazon love to these people.
Starting point is 00:17:35 Yeah, definitely. I think it really helps if you've read a book to write a review, if you've taken a class, give a review, right? These things actually make a big difference. Yeah, it really does um and i and we're all trying to um trying to do things the right way and trying to support the community so yep that's cool awesome well the last one i want to talk about is uh sort of harkens back to the first one that you did the data science space and the anaconda distribution so you probably know anaconda distribution is an alternate distribution for Python. It's basically CPython. But instead of just being a standalone CPython
Starting point is 00:18:12 where you pip install stuff, it comes packaged with most of the machine learning, data science, and popular libraries you already need pre-compiled for your machine. So if you want to use some weird package that requires like a Fortran compiler or something, you can just, you know, install it pre-compiled for your machine. So if you want to use some weird package that requires like a Fortran compiler or something, you can just install it. Either it's come with Anaconda or you Anaconda install it and it actually downloads the binary version.
Starting point is 00:18:36 So there's no worries about it not installing correctly. Yeah, there's also the free distribution. There's both paid and free distributions. But even the free one is one of the few multiple package distributions that is completely legitimate to do within a company as long as you're not reselling that itself. Oh, yeah, that's awesome. So the news about Anaconda distribution is version 5 is released. So they have 100 packages that have been updated or added. They have JupyterLab, Alpha Preview included, updated MKL.
Starting point is 00:19:11 Nice. That's the Intel high-performance compiled stuff. So it used the Intel sort of low-level speedups for the machine learning and computational stuff. New compilers for Mac OS and Linux. So what is that? That's Clang and GCC respectively. But what's important here is they flipped the flags on the compile steps for all of these things
Starting point is 00:19:32 to use the most newest and compatible flags for security. So the stuff that gets compiled out of here is less likely to suffer from things like buffer overflow sort of attacks and stuff so that's really cool the compilers are now a little safer for you they've got updated conda forge and some other things are creating another channel that has to do with this new compiler thing i talked about and so on so pretty cool i still have on my to-do list to go check out jupiter lab have you done any of that i have not checked out jupiter lab okay but it looks fun.
Starting point is 00:20:05 Yeah, it definitely does. There's a lot of cool stuff happening with Jupiter and social coding and stuff these days. It's exciting. Well, that's it for my news this week, Brian. Anything else you want to add? I did find out last week that the Python testing with PyTest is available on, what is that thing?
Starting point is 00:20:23 Safari Books Online? Yeah, Safari Online. it's now there. That's awesome. Tons of people have that as just available to them. It's part of their company, right? So people can now get your book that way. I don't think I do. I'm going to check it out.
Starting point is 00:20:37 Yeah, people can check it out. I have no idea if any of that money comes back to me with that, but I'm glad that a lot of people can read it. I think one of the things that's great about creating something like a book or a course or whatever, even a podcast per se,
Starting point is 00:20:49 is that people just use it and enjoy it, right? You put a lot of energy to creating it and if it just sat there like digitally silent, that would be sad. Yeah, yeah, it would be. So cool. Yeah, I'm glad it's out there and yet another channel for people.
Starting point is 00:21:02 All right, well, Brian, thanks again. Thank you. For everything this week and talk to you all later. All right. Bye. Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S.
Starting point is 00:21:15 And get the full show notes at PythonBytes.fm. If you have a news item you want featured, just visit PythonBytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Ocken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.