Python Bytes - #108 Spilled data? Call the PyJanitor

Episode Date: December 11, 2018

Topics covered in this episode: [play:0:45] pyjanitor - for cleaning data [play:3:12] What Does It Take To Be An Expert At Python? [play:5:38] Awesome Python Applications [play:8:26] Django Core no... more [play:12:06] wemake django template [play:15:16] Django Hunter Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/108

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 108, recorded December 10th, 2018. I'm Michael Kennedy. And I'm Brian Atkins. And this episode is brought to you by DigitalOcean. Check them out at pythonbytes.fm slash DigitalOcean. Tell you more about that later. Right now, I wonder how you're doing, Brian.
Starting point is 00:00:20 I'm doing great. Yeah? Do you find that, like, sometimes you end up with, like, messy data? It's kind of, you gotta clean it up. You're like, ah? I'm doing great. Yeah? Do you find that sometimes you end up with messy data? It's kind of, you've got to clean it up. You're like, ah, gosh, not again. Yeah, you've got empty spots and bad stuff. Yeah, anyway, data. Yeah, somebody spilled a string where a number was supposed to go.
Starting point is 00:00:39 Yeah. Maybe there is something you can do. Yeah, you can get a janitor. A PyJanitor. Clean up your janitor, yeah. First project we're going to talk about is PyJanitor. A PyJanitor. Clean up your janitor. Yeah, first project we're going to talk about is PyJanitor. It's a package for cleaning up your data. And this has a history in, originally it was a port of an R package called janitor, but now it's grown from that.
Starting point is 00:00:58 So it's both for cleanliness of data, but also just a really clean interface and convenient routines. So it's kind of for anybody working with data that can have problems with it. There's a whole bunch of stuff involved with this. So some of the functionality includes cleaning up column names, which I'm not sure why you would have bad column names in the first place,
Starting point is 00:01:23 but if you're pulling it from somewhere else. Yeah, like a lot of times people load CSVs into Panda data frames and things like that, I think. Yeah, okay, so cleaning those out. Removing empty rows and columns and identifying duplicate entries. There's some stuff that can just happen with data like that. But it has a whole bunch of other cool things, like telling your system how to deal with empty values
Starting point is 00:01:44 and expanding columns and coalescing multiple columns into a single column and a whole bunch of stuff like that. That's part of it is dealing with messy data. But the other thing is to try to keep your code clean. So on the other side of it, it has more of a functional programming style of using it. And not going to try to really talk about this too much, but we have a code snippet in our show notes where it shows kind of how you would deal with data frames and doing things like dropping columns and stuff within pandas, and then how that would look in PyJanitor code. And it just makes it a lot, I think it's more maintainable. It's neat. Yeah, I really like this. It looks super handy. It's like a set of utilities on top of Pandas, which is great.
Starting point is 00:02:29 And I like how they describe this feature you just talked about. It's a cleaner, method-changing, verb-based API for common Pandas routines, otherwise known as a fluent interface. If you're looking for one word there. But that's great. So, yeah, I think it's a really nice way to work with it. It looks really approachable. And I do like the fluent interface. If you're looking for one word there, but that's great. So yeah, I think it's a really nice way to work with it. It looks really approachable. And I do like the fluent interface. Like I really, really wish more things like that operated that way. Yeah. For instance, a lot of the functions return data frames so that you can just keep on in a new function to do
Starting point is 00:02:59 multiple stages of a workflow. Now this is all around pandas. So if you're doing not just regular data cleaning, but you got to be working with pandas. But I think a lot of people who do that kind of work probably are. So it seems really helpful. Yeah. So would you consider yourself an expert at Python? No, I know enough to know that what I don't know. Yeah, I think this is one of the things where it's like, you're never really like, there's always stuff that you don't know. So you don't ever necessarily feel like an expert. Well in this field when we're like i'm always researching stuff that know more about more than me about cool places but all in all i think i am yeah i would say that you are so i think there's a really interesting presentation done by james powell this was recommended to us by one of
Starting point is 00:03:39 our listeners i just don't remember who so i can them credit, but thank you for whoever sent this in. And this is a presentation at PyData 2017 by James Powell. And it's a YouTube video, and it's not just a little light, you know, here's my quick rundown of the five things. It's like an hour and a half sort of deep dive into what it takes to be an expert. So basically, James says, hey, look, it's pretty easy to be competent with Python, right? You can learn a couple of things and whatever other programming language you use, you can kind of like make Python do that. But to really understand it properly and take full advantage of it, write Pythonic code, things like that is a whole lot harder.
Starting point is 00:04:21 So he runs through some of the things that he thinks people should know and it's really focused at the maybe advanced beginner early intermediate type of developer who can do stuff with python but maybe stopped learning about the language and the features when they got whatever they're trying to make work work oh good yeah so it covers things like the python data model you know otherwise known as the magic methods or dunder methods, meta classes, a bunch of other stuff. And it's really nicely done. You know, I'm not a fan of presentations that are like, hey, here's seven slides and me talking about it. Like, woohoo. You know, he just fires up an editor and says, I have no slides. The editor is going to be the presentation. Let's start talking about these
Starting point is 00:05:03 things and just does it from scratch, which I think is a real genuine way to do it so well done i'm gonna have to check that out myself yeah yeah i've watched some of it i haven't watched all of it i've only watched like maybe half but definitely watched enough to recommend it to folks if you feel like you're in the stage like am i an expert i'm not sure watch this and you know you'll probably get some things reinforced and others maybe feel like yeah i knew that that great. Yeah. I would even say it was awesome. That's very awesome. There's a lot of awesomeness in Python, and there's quite a few. On GitHub, there's quite a few different awesome lists.
Starting point is 00:05:37 And so that's what I want to talk about today is another awesome list. And this one is called Isosceles Triangles. No, it's called Awesome Python Applications. It's kind of just a way for you to try to highlight a bunch of different cool applications. Because if you're looking for packages to base your own project on, you can look at PyPI. But that's not as easy to do with applications because they don't exist in PyPI. So that's why this has been created. There's quite a few categories already, and Mahmoud Hashemi has started it,
Starting point is 00:06:12 and he wants people to help him out and fill this in because it's kind of hard to find applications sometimes. So these are all applications written in Python that are open source that you can look at how they're doing things. Yeah, I really like this because so often people say, well, I'd like to use Python for this project, but to sell this to my teammates and my manager or my company, it would be great to say, well, YouTube is written in Python. And I know you think Python doesn't scale, but I'm sure we're doing less than a million requests a second.
Starting point is 00:06:42 So we'll probably be okay also. Having the examples for those kinds of comparisons are really great so this is a little bit like that there's a bunch of stuff for like biology you know and like cell profilers and things like that right yeah and even um like i had to look this one up erps enterprise resource planning no i don't need one of those but cool it's there. But a lot of these, one of the things I like about this is there's a lot of custom applications that people end up writing, and they know that their problem space is very specific. And instead of writing everything from scratch, you could take one of these open source projects and fork it or customize it for your own
Starting point is 00:07:22 need. And that's one of the benefits of open source, of course, but good starting point. Yeah, that's awesome. I really like this again. Well done, Mahmoud. And it's just, it's cool to have these examples out there, you know? Yeah. Do you know what else is cool? I do. And if I had a cool application, I would like to put my application there in DigitalOcean. Absolutely. So DigitalOcean is sponsoring the show. They've been sponsoring most episodes of Python Bytes and they're big supporters of it. So thank you to them. We use them for some of our infrastructure and it's working out great. One thing I want to highlight this time around is their early access Kubernetes project. So if you're doing anything with Docker
Starting point is 00:08:01 and Kubernetes and things like that, they have some special tools for deploying and managing your containers in the cloud. So just go over to pythonbytes.fm slash digilution. You can sign up there or go over to the products and just pick Kubernetes and get started on that. There's tons of other stuff that you can do as well, but the Kubernetes work they're doing is quite cool. Very cool. Nice. Indeed. So the next one I want to talk about is something we haven't covered a ton on the show, but I think has some interesting shadows and parallels with Python itself. And that is some governance around Django itself. So there's an article called Django Core No More by James Bennett. Okay. So this is not core as in like some library.
Starting point is 00:08:47 This is core as in core developers. So Django has been around for a long time, 2005 onward, I believe. And it's obviously a very polished professional web framework. One of the most, if not the most, you know, it's in there fighting with Flask for that title, but one of the most popular python frameworks lots of amazing apps are built on it but what they're finding is say actually django as a open source project is not recruiting enough active contributors that's surprising right yeah it is they said one of the reasons they think this is not working so well is they feel like
Starting point is 00:09:24 there's these people called Jango core developers and then there's everyone else. And if you're not a core developer, well, you probably don't have any business messing around with Jango or submitting any fixes or anything. Maybe you'll tell a core developer and they can go do it, right? But not do it themselves. So the proposal in summary is more or less to abolish this concept of a core developer altogether. Okay. Okay? So that when people come to look at Django, they don't go, oh, there's this special group of, like, selected core developers and then everyone else.
Starting point is 00:09:56 So instead, what they found was in practice, these core developers all had this straight commit bit. They could just commit straight to the repo and just have stuff happen. But no one was doing that. They were all creating pull requests and having a conversation around their changes anyway. Yeah. And that's how somebody would make a contribution to Django from the outside. So they said, let's have a more spread out, even way of talking about people who contribute to Django so that people are more likely to come and make contributions.
Starting point is 00:10:29 Okay, so they'll still have some sort of process for deciding on which pull requests to do, right? Yeah, so now they're going to have two different groups of people who are formalizing this stuff. There's mergers and releasers who would respectively merge PRs and then package it up and release it. So these are more like bureaucratic roles, like sort of finalizing it, right? But the idea is to have PRs
Starting point is 00:10:55 and open discussions around issues and PRs. And then these folks kind of say, yeah, okay, we're all good with this. Interesting, okay. Yeah, so it's a little bit of a parallel of Guido stepping back and saying, okay, everybody, you guys got to spread out some of this decision making and not just, you know, leave it all on my back. Yeah. I like what they're doing. I also like doing a lot of this stuff in the open and having the governance models be sort of an open
Starting point is 00:11:18 discussion so that different groups can learn from it. So like, for instance, I was listening to your interview about Sanic, and they were talking about basically still figuring out how to govern the Sanic project. And so doing all this in the open and having everybody be able to give feedback and stuff is cool. Yeah, it's definitely cool. Now, I don't believe this is the way that things are. This is a proposal for the way that James and folks wants this to be. So kind of take it in that sense, right? This is not an official decision as far as I know, but this is the proposal. Okay, neat.
Starting point is 00:11:52 Yeah, cool. Speaking of Django, what do you got? What's next? Yeah, I wanted to shoo this in. I think somebody mentioned this on Twitter. Again, I can't remember who. Sorry, but thanks for everybody for giving us tips on Twitter. Again, I can't remember who, sorry, but thanks for everybody for giving us tips on things. There is a Django template that is called the WeMake Django template. I think,
Starting point is 00:12:12 actually, I don't really know who WeMake is. I think WeMake is a group that does like customer websites and stuff. Let me just say really quick, this is not templates as in Django templates or Jinja2 or Chameleon. This is like I'm making a project from scratch. It generates the project structure, like a project template, not a web HTML template, right? Right. Okay. I guess it should be called a cookie cutter because it is based on cookie cutter. More or less.
Starting point is 00:12:36 Yeah. So it's based on cookie cutter. So you can use, I'm sure everybody's familiar with cookie cutter. You use it to start a project and it pulls stuff off of GitHub and initializes your project and then asks you a bunch of questions. But it has a whole bunch of really cool things that you might not actually think to do right away in a Django project. They're saying that it's more for larger projects,
Starting point is 00:13:00 but I'm sure that a lot of these, you could do them for smaller projects too. But it uses a system called Dependabot, which I hadn't heard of before. But it's one of those systems to keep your dependencies up to date. It's got Poetry for package management, which is kind of neat. PyTest for testing, of course, that's awesome. One of the reasons why I think this is neat, because Django doesn't do PyTest automatically. So having somebody initialize that and set it up for you is cool including things and then there's some of the other things are mypy for static typing pre-commit hooks already set up flake8
Starting point is 00:13:35 and an extension to the style guide already built in so you can use that as a template to use your own style guide and a whole bunch of other cool things like docker integration already get lab ci for building and testing and then something i hadn't heard of before which is caddy which is um i gonna probably get this wrong but i think it's uh something to do with uh secure web sockets or something i don't know https whatever that is sounds good yeah i don't know what HTTPS, whatever that is. Sounds good. Yeah, I don't know what Caddy is either.
Starting point is 00:14:09 I should check it out. But it looks pretty cool. And yeah, I think if you're creating Django projects, I think, or any form of web project, there's some for Flask, there's some for Pyramid, there's some for Django. I think looking at these more full-featured, more structured starter cookie cutters are really valuable and i think actually the the biggest value comes to people using flask yeah by the way the reason i
Starting point is 00:14:30 say that is django already has a structure right there's a lot of structure like static files go here whatever a lot of stuff is set up when you create a site with django same thing with pyramid you already use cookie cutter templates but flask is like well you create a file and then you're on your own you know so like that all that structure is like not anywhere to be seen, but it's still going to have to exist on real apps eventually. And so having some projects that you can follow, I think it's really great. Yeah. So that's a good segue, not a segue into the next one, but I would love it if people would share with us some of their, some of their favorite Flask cookie cutter starter projects.
Starting point is 00:15:05 Yeah, for sure. We can give them a little shout out in the extra section or something. So you want to make it just a straight three of a kind for Django in a row here? Let's just wrap it up with Django. Yeah, we're already here. So you've gone and you've created your project with one of these Django templates. You've got it working. You've done some testing.
Starting point is 00:15:24 Maybe something wasn't working. So you flipped it into debug mode. That's cool. Maybe set some other stuff. And then you're like, all right, ready to push it out. It has like, this template you told me about here, it already has like integrated deploy steps into the CI build. So that's pretty cool. You just type deploy and boom, off you go. And then a little bit later, something starts happening to your AWS account or your database records that is not so good. That might be because you left the debug mode or some other setting on that exposes all sorts of information. So there are ways to run Django that it's helpful for development, but then you obviously don't want to share that information with everyone else. So there's this project called Django Hunter, and it looks for
Starting point is 00:16:09 insecure Django's. Okay. So if you deploy your app, you can point this thing at it and ask it, you know, what's the status of this, this thing here. So the person who wrote it said, why, why did we create this? Well, it's a tool to help identify incorrectly configured django apps that are exposing sensitive information for example in march 2018 there was 28 165 it's a weird way to write it says 28 165 000 django servers is that 228 million i'm just going to say there's a lot of django servers that are exposed on the internet showing off things like their aws keys their database passwords and connection strings, et cetera, that you don't want. So there's this cool tool called Django Hunter, and you can basically point it at your projects and it will tell you if something's going wrong with them.
Starting point is 00:16:55 That's cool. I love projects like this because Python is so easy to get started on things. You can, I guess, jump into the deep end before you're quite ready and having tools like this to help you jump in safely with it. Yeah, it's good. Absolutely. So, you know, it's easy for people to say, well, that was sure stupid. You got hacked because you didn't, you know, set the debug mode to false. Well, if you're struggling to figure out like, what does deployment mean? Like, I can't even barely get this thing to run on a web server and i'm trying to understand linux and databases and firewalls like it's pretty easy to overlook that kind of stuff when you're like struggling to just make the thing work right there's a lot of these settings that like you're like i just want to test this out and show it to somebody i'm not you know i haven't been running
Starting point is 00:17:39 django for 10 years so there was actually a conversation either on Twitter or on Reddit about this. And somebody said, yeah, this is great. Like this guy jumped in and says, hey, I probably wanted those, you know, 20,000 servers that are among my early projects that's on Heroku. I accidentally exposed my AWS password and all hell broke loose. The problem is as a beginner, it's not obvious how to separate development and production settings and keep that stuff out of your public repo, of course. Yeah, the other thought was, somebody said, you know, there's a reasonable argument to be made that debug should be set to false by default. If you turn it on, then maybe you know about it. So you know to turn it off. But if you never turn it on, how do you know, right, there's a setting, there's a huge
Starting point is 00:18:23 comment right by where the setting is, it says, never put this in production with debug equals true, but it's like in a settings file, you might not ever open. So if you don't look at it, you know, that's bad. So anyway, there's some interesting maybe thoughts around what to do to Django to make it better, but certainly having a tool to tell you if something is wrong is good. Yeah. Okay. All right. So Django hunters for those Django developers or DevOps running Django. People running Django servers. Yeah.
Starting point is 00:18:52 That was good. And that was all of our items. I want to first give a quick shout out to you, Brian, in our extra section. Thanks for having me on your show. And I blogged about that. So I put a link to the blog. But we had a great time talking about what it takes to be a good podcast guest and how to prepare for that, which is more broad than just podcast guests, I guess.
Starting point is 00:19:09 Yeah, and we've actually already gotten a whole bunch of positive feedback on that episode. So I'm glad we did it. Yeah, great. Anything else extra? You know, I was just thinking Christmas is coming up. Just, you know, at least in the United States, there's this weird tradition, but it is a thing where, like, in shopping malls, like, a Santa will be hired, a Santa Claus, and will sit there. And there's typically photographers around. And, like, parents will bring their children to the Santa.
Starting point is 00:19:37 And the purpose is the child sits on the lap of the Santa, asks for something probably totally unreasonable, and they take pictures of it, of the whole situation, and hopefully the kid doesn't cry and get afraid of Santa. So you have a good version of this, right? Yeah, I'm going to read it out loud because it's a comic, but it's hilarious. And it was posted by ChangeLog on Twitter. This little girl is sitting on Santa's lap, and she says, for Christmas, I want a dragon. Of course, Santa says, be realistic. Okay, I want enough donations to support my open source work. With a response of, what color do you want your dragon?
Starting point is 00:20:15 That's so awesome. I really love it. It's sad that it's true, but yeah, it's pretty awesome. Be realistic. What color do you want your dragon to be? All right, well, I thought I'd throw one in here for you as well. It has nothing to do with any seasonal stuff. So this is good year round.
Starting point is 00:20:31 Okay. Has more to do with race conditions, deadlocks, and that sort of weird timing problems you run into with multi-threading. Yeah. So you've heard the joke, why did the chicken cross the road? Which has all sorts of weird answers, but sometimes just to get to the other side, right? Why did the multi-threaded chicken cross the road? I don't know. Why? Road the side get to the other of the two. Ask me again. Why did the multi-threaded chicken cross the road? The side of to the road other get. It's always ground love. It's always different.
Starting point is 00:21:00 I love it. That concludes the joke section, I suppose. Yeah. So yeah, we'd also like, we're both sort of silly people, and we'd like to have some feedback as well from people to see whether or not we should keep a joke or two in the episodes, or actually just whether or not we should. So that'd be great. Yeah, sounds good. And if you have good jokes.
Starting point is 00:21:20 Yeah, send them. Send them. All right. Well, Brian, thank you for doing this, and everyone, thank you for listening. Thank you. Bye. Thank them. Send them. All right. Well, Brian, thank you for doing this and everyone. Thank you for listening. Thank you. Bye. Thank you for listening to Python Bites. Follow the show on Twitter via at Python Bites.
Starting point is 00:21:30 That's Python Bites as in B-Y-T-E-S. And get the full show notes at PythonBites.fm. If you have a news item you want featured, just visit PythonBites.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Ocken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.