Python Bytes - #48 Garbage collection and memory management in Python

Episode Date: October 19, 2017

Topics covered in this episode: The Python Graph Gallery pynesis Things you need to know about garbage collection in Python * WSGI Is Not Enough Anymore,* part 1 and part 2 Queues in Python Using R...eflection: A Podcast About Humans Engineering Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/48

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 48, recorded October 18th, 2017. I'm Michael Kennedy. And I'm Brian Harkin. And we got a bunch of awesome stuff lined up for you. We're both dialing in from Portland, Oregon. We've scoured the internet and we're going to start with some graphs. But before we do, let's just say really quick, a thanks to DigitalOcean.
Starting point is 00:00:23 A big thanks to DigitalOcean, thanks. Yeah, they totally blew S3 out of the water and they've got an awesome thing called Spaces. We'll tell you more about it later. Right now, I want to hear about cool graphs. Well, I came across this last week, a website called Python-Graph-Gallery, the Python Graph Gallery. And it is cool. i was describing it as um graph examples times your head explodes with options but got a whole bunch of different types of graphs you you want to do there are all all sorts of different types of graphs that you see around the internet and and basically to help you visualize your data you got kind of the standard ones like histogram and stuff or box plot but then
Starting point is 00:01:04 you also have really cool ones like 2d density plots or bubble plots or connected graphs or core corelogram yeah there's amazing stuff here and they all come with um ipython little scripts right if you click on them they get the details you dive down into exactly what you want to do and then you can go in and it shows you exactly how to make those plots within Matplotlib and I think in IPython, but that's the same thing, right? But then also some of them have like, they'll explain how to do something and then they have alternates and there's some opinion there. Some of the graphs they don't really like and they'll tell you why they don't like them and what some good alternatives are. Yeah. Another thing that's
Starting point is 00:01:47 cool about it is you go to one graph, you're like, huh, I think I need a bar chart or something like that. And it pulled up and said, these are the related ones. You're like, oh, this one is way cooler. I didn't even know about it. Like I haven't, maybe I haven't read the tough day, like visualizing information book. I don't know of all the options. Right. And you can discover them. I like that. Yeah. And then like it includes some of the extensions. Like I just dove into seeing how to do a vertical histogram and it mentions that you need to have the Seabourn library and use it for these. So yeah, it looks pretty cool. It's great. Yeah. And I guess there's also some R1s out there, like an R part that's sort of tied to it somehow as well if you do R.
Starting point is 00:02:25 But yeah, I've been thinking a lot about doing some stuff recently that would require some really cool interactive graphs. So this definitely catches my interest. Yeah. All right. So check out the Python Graph Gallery. That's cool. Moving on to the next one.
Starting point is 00:02:38 Brian, do you know what Kinesis streams are? I don't. I do have a Kinesis keyboard, but I don't think that's related at all. Kinesis keyboards are wild, man. I have the Sculpt Economic mini thing from Microsoft. I used to have one of those. But Kinesis streams are these things that AWS released. And the idea is you can stream tons of real-time data through it and apply filters and transformations and get additional real-time insight. So like under the description, it'll say things like, you can continuously capture and store terabytes of data per hour from hundreds of thousands of sources, such as web click streams, financial transactions, social media feeds, et cetera, et cetera. So this sounds like a really cool service. You can go sign up for it, AWS. It looks like, at least the folks that sent in this recommendation say, look, it really requires
Starting point is 00:03:25 Java right now for the API to do it. So they felt that that was wrong. And they created this thing called PyNesis for Python APIs talking to Kinesis streams. How about that? That's great. Yeah. So if you're out there and you've got tons of data streaming in, and especially if you're already an AWS customer, you already have an account, you already work there, maybe your apps run there, then it's really cool. So this library does some cool stuff. It's worked for 2.7 and 3.6. It has a Django extension helper. It automatically detects shard changes. So like this thing can do sharding, it'll like adjust for that. It'll create checkpoints and even has a dummy Kinesis implementation for testing. How about that? That's great. That's cool. Yeah, and this is an open source project too,
Starting point is 00:04:08 so you can extend on it if you need to. Right on. Yeah, it's pretty new, but check it out. And thanks for recommending PyKinesis. I forgot the guy who sent it in over Twitter, but yeah, thank you. That's awesome. So one of the more mysterious things, I think, in Python,
Starting point is 00:04:21 relative to, say, other languages like C, for example, is how memory works right I can see I call malloc or I call free in Python I just do stuff and like I never run out of memory that's kind of cool yeah it is cool but it has some downsides a little bit I guess not really at least some complexity right yeah well and it hides that complexity from the users but there's especially when you have an application or a service or something that's a long-running Python application, you kind of have to care about what's going on and make sure that you don't continually grow in memory. There's an article that we're going to link to called Things You Need to Know About Garbage Collection in Python. And it just came out recently. And I
Starting point is 00:05:05 sat down with a cup of coffee this morning and really read it and tried to grok it. And I think it helped me a lot to understand how Python does. There's two levels of garbage collection. There's the automatic stuff that's just, if an object goes out of scope, it disappears. And then the Python can reclaim that memory. And there's something about, like, it treats small objects, like under 512 bytes, a little different to save time. And that's cool. But then there's this other thing to detect loops and other dead memory. Because reference counting, you can have objects point to each other, and you can get these loops of memory that just sit around forever. And so there's this other system, the generational garbage collector that goes through and looks for all of these dead items and cleans them out. And that runs periodically. But that one you can
Starting point is 00:05:57 control if you need to. If you really can't handle it going off and doing its own thing, you can turn it off and call it yourself once in a while if you need to. What's really interesting about this is one of the benefits of like C or C plus plus really is you get total deterministic behavior, but the drawback is you got to manage it manually. With reference counting, you get also totally deterministic behavior, right? You run it many times, it's going to behave the same way exactly. So if you're doing something as timing that really matters, that's cool. The reference counting GCs or reference counting algorithm has the problem of
Starting point is 00:06:33 cycles. So if I have like a parent-child relationship, they're always going to have at least one reference because parent knows a child, child knows parent. So that thing's never going to go to zero and will leak. So you have this secondary like market sweep garbage collector type thing that comes in. And I think it's really interesting how they've chosen like this combination. And the market sweep garbage collector is similar to like.NET or Java, which that's all they have over there, right? I didn't know. Yeah, yeah. Those two basically work in this generational garbage collector way very similar.
Starting point is 00:07:02 I don't know that it's exactly the same, but it's similar for Java and.NET, but that's not the main way it works. So I think that that's actually, that's actually pretty interesting. I mean, the article here doesn't go into too much depth, but deep enough to where you can understand it. And it's really, I thought, you know, I knew that you could mess around with stopping the garbage collector and, or the generational one and controlling that yourself but I didn't know how to do it and it's really not that complicated it's a few lines of code is all yeah there's a couple of neat things about this article one is there are some very nice specifics like did you know the five objects that are equal to or smaller than 512 bytes have a different
Starting point is 00:07:40 allocator and mechanism right like knowing that cutoff and those sorts of things and knowing when the GC kicks in and when to turn it off. Like there's also a lot of references. Like if you don't know more about this, read about this section. You don't know more about this, read about this section. So I think this is a great place to start this exploration. And then at the end, it talks about how to find these,
Starting point is 00:08:00 you know, these cycles are bad and you kind of want to get those out of your code if you really want to care about this a lot. And it talks about how to do, how to go about looking for that stuff and visualizing it. So you can try to try to find these cycles in your code and get rid of them. That's cool. Yeah. And the other thing to consider when you're thinking about stuff, especially if it kicks into the actual market sweep cycle, garbage collector type thing is algorithms and data structures. so you can have a data structure that is like many many objects that point at each other think of like linked list
Starting point is 00:08:31 type of things there's tons of of work to process those if you got ginormous ones you got it's ton of tons of work to process and determine if that's garbage right you might be able to like use a sparse array or something that uses almost no pointers but stores the same data and is more efficient. So there's a lot of interesting follow-on things to explore here. And again, yeah, and this is mostly a concern with people that have long-running Python applications. For short-running things, it's not a problem, so you don't really have to care about it.
Starting point is 00:09:01 Also, another final thought is, you said you can turn off the garbage collector. I think, was it Instagram that turned off the garbage collector in their system? It was either, I feel like it was Instagram or Quora, one of those people, one of those companies turned off the garbage collector and they found they were able to get much better memory reuse on Linux across the processes and actually was better off by just letting the cycles leak. In this article, you can determine it yourself.
Starting point is 00:09:27 You can have predetermined times where you're going to go out and let it run. Yep, pretty interesting. You know what else is interesting? Spaces. Yeah, it is. Spaces is pretty awesome.
Starting point is 00:09:37 Yeah, so like this audio you guys all are listening to came over DigitalOcean Spaces. And if you're familiar with S3, this is like S3, but way better. So very deterministic pricing. You pay $5 a month for a terabyte of outbound traffic, no inbound traffic.
Starting point is 00:09:52 And beyond that, it's like one ninth the price of bandwidth and traffic for S3. So if you're using S3 now, definitely consider digital ocean spaces. They're doing really cool stuff there. All the APIs, the libraries, and the tools that work against S3 also work against Spaces. They've made that sort of a compatibility layer for them. So I've been using it.
Starting point is 00:10:14 I really, really like it. And I definitely encourage you to check it out at do.co.python. Help support the show. And like I said, I think it's pretty awesome. So let's talk about the web for a little bit. We've, you know, many times we've touched on asynchronous programming of one variety, another threads, multiprocessing, async IO type of things. But the truth is that on the web, almost all of the things, all the frameworks are built in a way that cannot take advantage of that at all, or very,
Starting point is 00:10:46 very rarely, I guess, because they're built upon WSGI, the web service gateway interface. And that basically has a single serial function call for each request. And that's that. There's really not much of a way to expand or to change how the web processing works. So like, if you want to do maybe some async and await on like database calls, or against web services, like requests, you could do that with requests, for example, that's basically not going to have any effects that's still going to be blocking somewhere along in this whiskey request, there's no way for the server to take advantage of that. Some of the servers use threads, like micro whiskey,
Starting point is 00:11:25 but still, it's not nearly the same level of benefit. So there's this article I want, or series, I guess, that's starting to come out here called WSGI is not enough anymore. I'm referencing part one and part two, and part one really lays out the problem. Basically, there are two problems. One is concurrency, right, which I just described. The other problem really
Starting point is 00:11:45 is that HTTP isn't the only protocol anymore. So things like web sockets and other multi bidirectional communication and binary stuff is happening. That's also not supported by whiskey, right? So this article and series sort of explores like what, how do we solve this with event-driven programming and they're going to they're not quite done they're still working on it but i thought it was a cool thing so the next session the next thing that's coming out is talking about libraries to solve the concurrency problem in python and then onwards to the other things so pretty cool yeah that's very interesting i can't wait for the day when you know these things really unlock because we talk about things
Starting point is 00:12:24 like async and await, and they're pretty cool, but they're really hard to make practical use of. Once the web server requests themselves can participate in these async event loops, then it's on. It just breaks open, and all sorts of amazing stuff can happen.
Starting point is 00:12:38 So I guess I didn't realize that these frameworks couldn't take advantage of WebSockets, or can they with add-on libraries or something? Yeah, you've got to set up some kind of separate server. I can't remember what it's called, unfortunately. But they can send it over, say, we're going to upgrade this to a socket, so send it over to the separate process,
Starting point is 00:12:59 like the separate server type of thing. There's a lot of work to juggle these different protocols right now. So yeah, it'll be nice when that's more seamless. Well, I'll have to follow along with these. This is great. Yeah. And for now we can use things like queues even for a little asynchronous concurrency, drop off a little job and pick it back up. I was in the looking for a queue, a last in first out queue. I needed that for a project I was working on. I just needed it as a data structure. I didn't, I didn't have different producers and consumers. I just had one part of the program where I was collecting stuff
Starting point is 00:13:30 and another part where I had to get it out last in, first out. So I was looking around, and there was an article from Dan Bader, and it's called Queues in Python. And it's a decent, I guess I'd just forgotten about a lot of this stuff. And it kind of goes over how to use queues in Python and how to use a list, how to use the queue library. There's actually a queue built-in library. And the collections deck also is something you can use. The deck is a doubly linked list.
Starting point is 00:14:05 And then it talks about pretty much how to use them. And it's a pretty good article. And it mentions that you can use all of these for last and first out, but I didn't quite know how to use those. So I went ahead and explored all the different ways to use these three, or just a way to use these three as a last and first out queue and threw it in the show notes notes so yeah it's really cool really simple i think you know knowing about data structures and especially knowing about the built in ones is really valuable and i feel like we've been doing python for a long time but i still continuously learn about these things it's good to come back when you start using the data structures you're just using all the time and you need something else going ahead and looking what's around is neat. I was also curious about timing. So I went ahead on a sample program
Starting point is 00:14:50 and timed all these to see with like some huge objects I was throwing in there to see if any of them were faster or slower. And with small objects, they're all kind of about the same. And with large objects, it looks like the collections deck is a tad bit faster for my use. But none of them are really out of the ballpark slower. So to me, the deck has the best, just the best interface because you can just iterate over it when it looks cleaner. But that was my opinion. Yeah, that's really cool. Thanks for pointing that out.
Starting point is 00:15:23 All right. I want to sort of close this out with something kind of meta. So on our podcast, I want to talk about a new podcast. So a guy named Mark Weiss created a podcast called Using Reflection, a podcast about humans and engineering. So he started out interviewing Jesse Davis from MongoDB, one of the main Python guys in the space. So there's a really cool interview about him. And if you're thinking about, you want to look at these notable people and how they've become leaders within their companies or within their industry. And you want to sort of explore that journey with them.
Starting point is 00:15:59 It's a pretty cool podcast. So I thought I'd give a shout out to it. I listened to a couple episodes and I like his interview style and it's very conversational and and laid back it's cool yeah it's like you just kick back grab a coffee with the two guys and you just don't say anything because they can't hear you yeah well you can say stuff but they still don't hear you yeah who knows me that's awesome all right so yeah check out using reflection it's a cool podcast all right so i guess that's it for our news this week, Brian. Anything else you got you want to share with the people?
Starting point is 00:16:28 I got nothing this week. No more book writing? Just hanging out at the zoo now, huh? That was fun. If your idea of fun is trying to herd six eight-year-olds around a zoo for a day, then it was fun. Give me a tricky bug. I'll take that instead.
Starting point is 00:16:42 Yeah. So last week I announced my free MongoDB course at freemongodbcourse. Yeah. So, um, you know, at last week I talked about, I announced my free MongoDB course at freemongodbcourse.com. And that thing has been going super well, like over 5,000 people have taken that course in a week. That's pretty amazing. I have to admit that I was doing your, uh, longer Mongo course. And, um, I thought I'd watch this first. So I've, I've started it myself. I'm one of those sign-ups. Oh, cool. You're like, I don't know what that percent is.
Starting point is 00:17:10 Cool. Very nice. Yeah, people seem to be enjoying it, so I'm glad everyone could take advantage of it. I'm glad you put that out there. It's really cool, and people should check it out. Yeah, thanks. All right, well, I guess until next week, Brian. Yeah, talk to you next week.
Starting point is 00:17:23 All right, talk to you next week. Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Ocken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.