Python Bytes - #157 Oh hai Pandas, hold my hand?

Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 157, recorded November 14, 2019. I'm Brian Ocken. And I'm Michael Kennedy. And this episode is brought to you by DigitalOcean. So, Michael, we're going to cover a topic that we've covered a little bit before, I think. We covered Cerberus, right? Or Cerberus? Cerberus. Yeah, we covered Cerberus, which is like a validation layer for unstructured data.

Starting point is 00:00:29 So this is as built as part of the Eve framework by Nicola, who runs both of those projects. And it's really nice, right? So I get like some JSON posts back to my REST. It's a REST framework. I get a JSON post for some data. I have some models to find. It can tell you whether they're a fit or not. It can tell you what's required. I do think the way you set it

Starting point is 00:00:51 up is a little bit of out of band. So Colin Sullivan shot us a note after that said, Hey, that's really cool. You should also talk about Pydantic. Had you heard of Pydantic? I think so, but yeah, tell me more. And it's got a great name. Yeah, it has. It definitely has. Yeah. Yeah. it's got a great name. Yeah, it definitely has it. Yeah, yeah, it's got a super name. I believe I've heard of it, but I didn't do anything with it. So on call and suggestion, I checked it out. And yeah, this is a sweet, simple framework that solves some really nice problems.

Starting point is 00:01:17 And a lot of times with these frameworks, I'm like, yeah, I would love to use this. But at the same time, it's not that helpful. And so I'm not sure I'm actually going to use this but at the same time like it's not that helpful and so i'm not sure i'm actually going to use it i could just put a little test in my class to make sure this file like this thing parses an int or this name is here whatever but this one like might convince me to do it because yeah this is super super cool all right let me tell you what it is so it's data validations and settings managements for using python type annotations. And it's the type annotations that make me really extra happy. Oh, really? Okay. Yeah. So you know how we've got data classes and you can

Starting point is 00:01:50 have like annotated values there and you get a little validation and whatnot, but this is super cool. So I can just take a creative class and say it has things like an ID, which is integer, a name, which equals a default string, a date time, which has a default of none, things like that, right? So you basically, either you have type annotations or the thing has a default value, which implies the type, okay? Yeah. And this probably represents some data that's exchanged over REST or something like that, right? Some sort of dictionary. So if I get a dictionary back, then what I can do is I can just star star unpack that dictionary into the object, the class that I've defined, right? So basically keyword arguments, ID equals whatever the value is,

Starting point is 00:02:35 name equals whatever the name and the dictionary is and so on. And it will validate all that using some really simple rules that you follow along. So we've got a class and it has an ID, which is an integer, but has no default value. It has no none. That means the ID has to obviously be an integer, but it's also required. If it's not there, an error will be raised. The name is a string, so it has to be a string,

Starting point is 00:02:58 but because it has a default value, it's optional to pass it. Oh, okay. That's cool, right? The date time, which is a date time field is not required because it has none as a value but if nothing's passed it's just going to be none so it's an optional date time that's pretty cool so some of the reasons that i think this is cool and they they call out in their web page is that it works automatically with all the ides that you already have right there's no like special oh yeah there's a YAML file that tells me what the schema

Starting point is 00:03:28 looks like for this, or there's a JSON schema that comes back and like, no, it's standard Python with type annotations. So your IDE knows already what all those things are and you don't have to like backfill that, right? So the validation also works for just working with the classes. That's pretty cool, right? Yeah, that's cool. Yeah, it's supposed to be faster

Starting point is 00:03:47 than all the other libraries they tested and they have a link to the ones that they did. It also supports really rich recursive validation. So if you've got like a list or a tuple and maybe like stuff is inside there, right? Or something like that, right? You've got some nested types so it'll actually recursively traverse

Starting point is 00:04:08 the stuff that you're looking for. So it doesn't just test the top level things, it tests the entire object graph and by default the way it works is you derive from the PyDANIC base model, which is cool, but you can also use a decorator on a data class, which we talked about.

Starting point is 00:04:24 It's very similar because of the type annotations, and it'll actually put parsing and validation on there for you. Oh, that's neat. Yeah, so if you really want to use data classes, you can make them better with Pydantic as well. Okay. Yeah, simple, right? Yeah, so where do you put it in your,

Starting point is 00:04:36 so do you get it like when you get data in, you validate the data with Pydantic then? That's the thing is you don't even validate it. That's what's so sweet about it. There's not even a validation step. You have the class, and in the notes, I'm putting a user class. So I'll maybe reference that. And then you've got some external data, which is a dictionary. And then you just create the user objects, a user of star star dictionary, and it'll unpack it keyword. The validation happens in the fact that you don't have a dunder init on your class and

Starting point is 00:05:04 it derives from the pedantic base model. So it have a dunder init on your class and it derives from the pydanic base model so it uses its dunder init which theoretically does the validation okay so you can't not validate then yes exactly yeah you just basically i tried to create the class and either that worked really well or not so much oh that's actually pretty cool and you get like a json response of all the things that were wrong with the validation as part of the exception believe so you can actually go you know there's actually three things wrong here, not just, well, the first thing it hit made it crash. So this is obviously useful for like REST APIs and stuff like that

Starting point is 00:05:31 or grabbing external data. But there's a lot of times where we're passing dictionaries around between components, and it'd be good to have some, if there's less trusted components, to have some sort of validation. So this is pretty cool. Yeah, even web forms that get posted back, a lot of times those come back in Pyramid or Flask as dictionaries, right?

Starting point is 00:05:50 If you wanted to map that to a class, you could get validation. There's a lot of places, yeah. Even settings files, right? Yeah. Yeah, there's a lot of people that just throw stuff that gets adjacent or something that gets thrown in a file. And it's user editable also. So you have to validate it because who knows what somebody edited it to.

Starting point is 00:06:09 Yeah, absolutely. All right, what you got next for us? I am hopefully doing a favor, adding work to Ned Batchelder. So he posted on Twitter recently that there is changes afoot in coverage.py. So coverage is, hopefully everybody knows, coverage is great for using to tell you how much of your code base your test suites are covering. I mean, that's how it's usually used.

Starting point is 00:06:34 You could potentially do anything to try to measure coverage, but usually it's around a test suite or something. Anyway, so the change is they've added measurement contexts. So allowing you, while it's collecting data for coverage, it collects what was the context of what it was doing while which test covered which line of code and to have that. And that's a lot of data. So he's changed the way the data for coverage is being stored. And it's pretty cool. So I'm going to jump to the conclusion. There's this cool feature. The context feature is very cool. I want to talk about that. But first of all, it is a little bit of a break in the coverage, use of coverage. I think the reason is just little bit of a break in the coverage use of coverage i think the reason is just because

Starting point is 00:07:25 there's a hit the way the data is stored there's a a little local database stored so there's another dependency that isn't an external dependency it's a it's a built-in stamp built-in dependency but it's something that some versions of python don't always have i I guess. So for that reason, he's asking everybody, please try out the beta one coverage five, 5.0 beta one, and try it out and let him know if there's any issues. Right. So basically the idea is go try it and see if what you're doing before still works. If not, let him know real quick before it becomes permanent, right? Right, exactly. And I really want this to become permanent because measurement context is so cool. I tried it out this morning. I'm going to put in show notes. I wasn't really clear on how to download, how to install a beta version of something. So you just do the, like for this, it's pip install coverage double equal 5.0 B1. Okay. one. So we'll put that in the show notes. It's not too bad to install it. And then also, I didn't

Starting point is 00:08:25 put this in the show notes, but one of the other tricks I found out is if you want to know what versions are available to pip install, you can just do the coverage equal equal, and then don't list a version. And you'll get an error message that says, I don't know what you're talking about, but here's all of the versions that are available. That's pretty awesome. I didn't know that. Yeah, that's pretty cool. So I traded out a few lines of code or a few lines of command line stuff to run coverage on a little dummy file. And sure enough, if I generate the HTML report,

Starting point is 00:08:54 on the right-hand side of the screen, I've got little dropdowns on every line of code to tell me which test covered which line of code. I like that a lot. That's cool. Yeah, that's very neat. Yeah, super code. I like that a lot. That's cool. Yeah. Very neat. Yeah, super nice.

Starting point is 00:09:07 I look forward to it. Okay. I don't know why I think this is funny. My brain's just not working, man. Will you do the ad read? Got it. Now, this episode is brought to us by DigitalOcean. And I just want to tell you about something brand new

Starting point is 00:09:18 that's gone from beta to general availability, memory-optimized droplets. Droplets are DigitalOcean's words for virtual machines, right? Goes to the cloud, cloud's full of rain, rain droplets, that sort of thing. And if you have some sort of workload

Starting point is 00:09:34 that requires a lot of memory, well, then these things are like super optimized that. So it has eight gigs of RAM for each dedicated virtual CPU. You can get them with, you know, two or many, many more multi-core systems. So basically, you can go all the way from 16 gigs to 256 gigs of RAM, which is a ridiculous amount of RAM. One thing you do to make your app run faster is to make sure it never

Starting point is 00:10:01 touches the disk, right? So if it could just cache everything, that would be great. So they're really good for things like high-performance SQL or NoSQL databases, large memory caches and indices, indexes, things like that, and just lots of big data and stuff running with large runtime requirements. So if you need between 16 to 256 gigs of RAM and you want to just pay mostly for the memory,

Starting point is 00:10:26 the pricing's optimized around that use case, then check them out at pythonbytes.fm slash digitalocean. They're a big supporter of the show. Speaking of cool stuff, the PSF, the Python Software Foundation Packaging Working Group, actually,

Starting point is 00:10:42 that group of the PSF, they're looking to hire some folks. They're looking for, I think, three developers and maybe a project manager. I can't remember exactly all the details, but quite a few number of people to make pip better. Like you just said, if you said, you know, pip install coverage, double equals, it will help you, right? So this is supposed to be a much better setup. so the idea is that one of the things that could be improved in pip is its dependency resolver right so it's you know this package depends on this thing but other packages also maybe depends on that but a different version

Starting point is 00:11:17 or you know i don't know how how often it's happened to you but i've had the order in which i list stuff in the requirements causing issues because one requires, I don't know, doc opt of this version. The other one requires doc opt of another version. And how can you possibly install them both at the same time, right? Weird stuff like that. Poetry has noticed this problem and it has a solution to it, but it's around poetry. And it'd be really cool if that sort of dependency resolution was built in to pip that'd be great yeah the underlying idea is to make distributing and installing python

Starting point is 00:11:50 software just more reliable and easier so funding has been allocated to two contractors a senior developer an intermediate and an intermediate developer that's what it is to work on developing testing and building this feature the test infrastructure code review bug triage all that kind of stuff and this is a non-trivial offering so i believe the senior developer will end up getting 116 000 out of this based on the time they're estimating and the rate and then the either senior developer or the contractors, I can't remember, get 103,000 each. This is quite significant. Not too shabby. Yeah, that's like a, not just a, hey, I need somebody to work on this for a couple of weeks. That's like a legit thing. So if you'd like to

Starting point is 00:12:35 contribute to Python, work on pip, things like that, just, you know, go check out this link. It shows you how to apply. Very cool. Yeah. So when I work on pandas, Brian, I kind of feel a little bit lost. There's all these operations and I don't use pandas enough to kind of actually know what I should be doing. Often it's in the context of Jupyter notebooks where the autocomplete is slightly less good than PyCharm or VS Code. I could always use some help when I'm working on pandas. How about you? Yeah, I could. And I know people that, there's a lot of people that work in it all the time, but I usually just jump in for some particular use. And I know I don't know the best way to do things.

Starting point is 00:13:10 There's a thing called Dove Panda. I think I'm saying that right. Dove Panda. And this was submitted by Dean Langstrom, Langsom, sorry. I think that it's his project, but essentially it's a overlay on, I'm just going to read his thing. He says directions. So dove Panda has directions and our hints and tips for using pandas in an analysis environment. The panda is an overlay for working with pandas.

Starting point is 00:13:39 And so the, the idea is you like, if you have this installed, also you're working in a Jupiter notebook and you start typing stuff, you start doing Pandas operations, it looks at what you did and provides hints, and it pops up in little windows in your notebook to give you hints on, I think you're doing this, but there's a better way to do it, or giving you tips. So it's like Clippy for pandas and jupiter yeah but it's it's a definitely sort of but instead of having just one clippy that pops down they're in your notebook so you don't have to deal with them right away but you can go back and improve your use of pandas within the notebook it's pretty yeah it actually looks really helpful so the example they have

Starting point is 00:14:24 they've got a bunch of pictures on the GitHub repo you all can check out. But like, for example, there's one where someone's calling pd.concat and taking two data frames and specifying the axis equals one. And then the little panda pops up and says, all data frames have the same columns,

Starting point is 00:14:40 which hints for concat on axis zero. You specified axis one, which may result concat on axis 0. You specified axis 1, which may result in unwanted behavior and it'll show you the code. Or after concatenation, you're going to have duplicate column names pay attention and things like that. It's got a bunch of great little tricks. And then

Starting point is 00:14:58 you know how you mentioned Kevin Markham from DataSchool.io and his tips? You can type dovepanda.tip and it'll pull up a Kevin Markham tweet. That's pretty cool. Like inside your notebook, it'll pull up like some random tip.

Starting point is 00:15:11 Yeah, that's pretty cool. Yeah. Circle there. And if you like, you can use it, apparently you can use it, not even just in notebooks. So there's a command line mode where you can set the output to be,

Starting point is 00:15:21 you know, there's no inline output to go to. So you can tell it to print the output to just, you know, there's no inline output to go to. So you can tell it to print the output to just standard out or to a display or to somewhere else. That's nice. So if you're using, you want to have these sorts of tips, but you're not using notebooks, you can still get them. So yeah, very cool. This next one is really simple, but I think some folks will find it super useful. You know, maybe you've picked up that project from someone else at work and they're not following all the best Python practices. You see a bunch of import stars all over the place

Starting point is 00:15:54 and you're like, man, didn't somebody tell these people that import star is not worth it, right? That there's all these potential drawbacks. So enter remove star. Remove star is a command line app you can run or command you can run. And you pointed at either a module, a file, a directory, something like that, it will go through and by default, it'll just find the issues where import stars done. And then it will look at the actual files and say, well, you said import star, but you're actually just, you know, like from collections import star. Maybe you're actually just using named collections and counter or something like that.

Starting point is 00:16:34 Maybe that's it. Anyway, you're just using one or two things and it'll say, you know what? You could replace that line with from collections import name tuple. Right. And it could suggest that or you could actually give it a command to say no just change all my files fix it yeah this is very cool yeah it's great so it's not that it just says import star is bad it actually figures out what of that star is being used and what you should actually write and then we'll write it for you yeah so my

Starting point is 00:17:01 normal operation when i see something like this is just to comment out the import statement and see what breaks. And that's not the best way to do things. So this is way better. I like it. Yeah, yeah. It reminds me a little bit of Flint, F-L-Y-N-T, which will take all your strings and rewrite them as F strings. This will take all your import stars and rewrite them as proper specific imports omg i totally forgot about flint we've got a whole bunch of code that we wrote for three three five that still has all the old stuff in it so yeah i gotta use well it's about to get a whole lot better hit it with flint it's so good yeah definitely awesome all right well that's it remove stars not a whole lot to it it's just a great little command line tool you can use to make your Python code better. Yeah. So the last thing I want to talk about today, actually, oddly enough, we didn't plan

Starting point is 00:17:52 this, is another, it came from Brian Rutledge too. So the PSF thing that we talked about, the hiring developers came from him too. So we've got two stories from Brian. So thanks, Brian, for helping us out. Yeah, absolutely. Thanks Brian. Double thanks. Well, so one of the things that Brian's working on is a PyTest plugin called PyTest Quarantine. This is so cool. Hopefully all your tests pass, but let's say you've got a, you just implemented, you got really fantastic. You got into testing and you started it right in a bunch of tests, and you put it on a code base, and you got a bunch of failures. You know you're going to fix them, but you're not going to fix them right away. So what do you do? And the idea with PyTest Quarantine is it saves a

Starting point is 00:18:34 list. So you run it once, and you tell it to save a list of all the failing tests. And it saves it somewhere, and you can throw it in Git or or something store it and then you run it again with that test or that that list and it automatically marks all of the tests that have failed in the past as x fails now this is something you can do manually to say i know this is going to fail just run it as an x fail instead of it separates separates it from a failure. You know, there's arguments of whether that's a good or bad, but it's very useful so that you can still use your suite to find new failures while you're working on the old ones. Anyway, this is a nice little extra tool. I think it's super cool. I also wanted to bring this up because he sent me this really nice email.

Starting point is 00:19:21 So apparently I met Brian a couple of times at PyCon in Cleveland and he said he was a, started out as a complete PyTest newbie and bought my book, started working through it, loved PyTest and then helped his company to adopt PyTest and then wrote this plugin and he wrote it at work and convinced his company to be able to release it as open source so that's super cool yeah that's really great yeah good work brian this sounds like super useful you know you've got to make some huge change if it breaks 50 tests you can't start solving all 50 at once right you got to like chop your way out of them so yeah so yeah exactly quarantine them and then just you know take them one at a time so yeah i like it i mean there are ways in which you can deal with this like in pycharm you could say run only this test or run certain ones

Starting point is 00:20:11 but uh you know like it doesn't help you on continuous integration or something like that right so yeah i think this is great and one of the things i wanted to bring up also is i've dealt with this in the past on a temporary basis of of course, where you've got, for some reason, a breaking change that fails some things, you're working through them. And we have occasionally, if there's like a known failure, the fix is scheduled, right? We know about it, we're going to fix it, but it's not going to be fixed for like three weeks. You can add X fail to the test itself. But one of the issues with that is to add the X fail mark, you edit the test file. But one of the issues with that is to add the X fail mark, you edit the test file. So one of the benefits of this

Starting point is 00:20:48 is you're not actually editing the test file. You're editing a different file that marks those. So that's kind of cool. Right, you don't want those changes to show up and get saying, well, we made all these changes to these tests, but actually, no, we're just trying to fix something else

Starting point is 00:21:00 and get them out of our way. Yeah, I like it. Yeah. All right, well, that's it for all of our main items. Brian, you got anything extra you want to throw out there? I do not. How about you? I've got some pretty cool news. So I recently decided to go through the

Starting point is 00:21:12 effort of figuring out how much energy all of our services and servers use, right? So for like delivering Python bytes and TalkPython and TalkPython training courses and all that stuff and I figured out how much that was and talk python and talk python training courses and all that stuff and i figured out how much that that was and went out and bought renewable energy credits to offset all the carbon from all of our infrastructure wow that's neat yeah yeah so i'm gonna keep doing that going

Starting point is 00:21:36 forward so um not a huge huge amount but it's uh you know i think a good signal for all the other companies out there as well to say look if this podcast or these podcasts can be carbon neutral for their server structure, why can't we, right? Yeah. Yeah, cool. So anyway, small, but hopefully can trigger some good change. All right, ready for a joke? I am so ready for a joke. I need it this week.

Starting point is 00:21:59 Well, it's more science than it is programming, but I think our audience will generally like it. So I'm going to tell the joke and then explain the joke because I'm not sure everyone will know, but I think a lot of us will get it. And jokes are so much more funny if you explain them also. I know. Absolutely, they are. So imagine a time not too long ago, Dr. Heisenberg from Quantum Mechanics fame, he's driving down the highway and he gets pulled over for speeding. The policeman comes over, the officer says, excuse me, sir, do you know how fast you were going? Heisenberg pauses for a moment and then answers, no, but I do know where I am.

Starting point is 00:22:38 I love that. That's so funny. Yeah. Thanks. Yeah. So the Heisenberg uncertainty principle basically says that the position and velocity of an object cannot both be measured exactly at the same time, not even theoretically. You can know one or the other, but not both. So yeah, he knows where he is. Yeah. Funny. Pretty good.

Starting point is 00:22:57 All right. Well, thanks for being here. Good to be back together after taking off and hiding in Florida for a while now. Now we're back on the usual track. Yeah. Yeah. All right. Have a good one. You too.

Starting point is 00:23:06 Bye. Bye. Thank you for listening to Python Bytes. Follow the show on Twitter at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at PythonBytes.fm. If you have a news item you want featured, just visit PythonBytes.fm and send it our way. We're always on the lookout for sharing something cool.

Starting point is 00:23:24 This is Brian Ocken. And on behalf of myself and Michael Kennedy, thank you for listening and sharing this podcast with your friends and colleagues.

Python Bytes - #157 Oh hai Pandas, hold my hand?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.