Python Bytes - #259 That argument is a little late-bound

Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 259, recorded November 17th, 2021. And I'm Brian Ocken. I'm Michael Kennedy. And I'm Renee T. Well, thanks, Renee, for joining us today. Can you tell us a little bit about who you are? I'm the Director of Data Science at Helio Campus. And a lot of people know me as Data Science Renee or Becoming Data Sci on Twitter. So that's where a lot of people follow me. And then I started with, I had a podcast that's not actively recording, but it's called Becoming a Data Scientist Podcast. So some people listening

Starting point is 00:00:38 probably know me from that as well. Cool. Yeah. Awesome. You were doing a bunch of cool stuff there and any chances of maybe going back to podcasting? It's, it's definitely still open. I've never, I've always told myself this is a pause, not a stop. It's just an extended pause. So yes, hopefully I will get back. It's hard to keep going, isn't it? I mean, life gets in the way and then you get busy and.

Starting point is 00:00:58 I'm always so impressed with those of you that have hundreds of episodes, very consistently recorded. Well, Brian makes me show up every week. Well, yeah, it definitely helps having a partner so that you can coerce each other in. That's right. Well, Michael, speaking of partners, want to tell us about something? Let's talk about some changes, some PyPI changes. These come to us from Brian Skin.

Starting point is 00:01:20 Thank you, Brian, for shooting this over. And it's a project by Bernay Gabor here. And if we pull this up, it says, have you ever wondered when did your Python packages, the packages in your environment that you have active or any given environment, how old are they? When were they last updated? Is there a version of them that's out of date? So I've been solving this by just forcing them to update using pip compile and the pip tools stuff to just regenerate and reinstall the requirements files.

Starting point is 00:01:51 But this is a way to just ask the question, hey, what's the status? And it wouldn't be an episode if we didn't somehow feature Will McGugan. So this is based on Rich, of course. So let's go check this thing out. So over here, if we go to the homepage, we get, as all projects should, a nice animation here. And if you look at it, you can just see type ipi-changes and you specify the path to a Python in a virtual environment. So in this example, it's like ipi-changes vnv slash bin slash Python. It does some thinking on the internet,

Starting point is 00:02:25 caches some information about the packages, and it says, all right, you've got all these things installed. They're this version. Some of them, it'll just say, this was updated 10 months ago or a year and three days ago. Others, it'll say it was updated a year ago,

Starting point is 00:02:37 but only six months on the latest version. It says remote such and such. That's the one you could install if you were to update it. So it's a real nice way to see, well, which ones are here that could be updated or even also sometimes it's interesting to know like, oh, this library, it doesn't have an update, but it's 10 years old. Maybe I should consider switching to a library that's a little more maintained and making progress. Right. What do you all think? That's handy. Cool. Right.

Starting point is 00:03:04 It is pretty neat so yeah i've been playing with this today installed it checked it out even pointed out that you know since yesterday some things changed in one of my projects that i want to keep up to date so i updated it yeah so i like it it's got a nice command line interface you basically specify the python that is in the environment that you want to check. That could either be the main Python or a environmental virtual environment Python. Like I said, you can control the caching because the first time it runs, it has to go get lots of information about each package that's installed and it's faster the second time. It also has some cool parallelism. So you can say number of jobs,

Starting point is 00:03:40 like dash dash jobs. And by default, it runs 10 downloads in parallel as it's pulling this information in but i guess you could go crazy there so anyway i thought this was pretty cool it's a nice little thing to have so i pip x installed this it's perfect for pip x because it doesn't need to be in the project it's testing it just needs to be on your machine as a command and then you point it at the environment, different environments, and it gives you reports on those environments. Yeah, I love pickbacks too. One of the things I want to note, just I know a lot of package maintainers having, I mean,

Starting point is 00:04:14 it's worth checking things out if it's a really old packet, if it hasn't been updated for a long time. But some things are pure Python things that just do a little tiny thing and don't need updated very often. So yeah, it's not necessarily a bad thing that it's not updated, but it's an indicator of something. Yes, exactly. Let's see out in the live audience.

Starting point is 00:04:36 Anthony Lister, hey, says, can the changes be exported to a text file? I haven't seen anything about that other than just, you know, piping it into a text file and who knows what happens with all the color in there, but, uh, perhaps. Yeah. Renee, what do you do to manage your dependencies and all those kinds of things? Um, well at work we started using Docker for that. So we have a centralized, um, Docker container that everyone on my team uses and we make sure we have the same setup in there. So, um, I'm not the one that directly manages it, but, but that's the solution that we've gone towards to make sure we're the same setup in there. So I'm not the one that directly manages it, but that's the solution that we've gone towards to make sure we're all on the same page. How interesting. So you've got the Docker environment that has some version of Python

Starting point is 00:05:13 set up with all the libraries you need pre-installed, and then you just use that to run and that way you know it's the same. Yep. And then it's also nice because when we kind of move some of our projects into production, we can include that Docker container with it. So it will have whatever version it had at the time. So if for some reason it's not compatible with some later version we upgraded to, it still lives out there with the version of the tools that it had until we have a chance to update everything. One of the challenges that people have sometimes is they say, even though you've got some kind of version management, iProject.toml or requirements.txt or whatever, that doesn't necessarily mean that people actually install them the latest. So you could still be out of sync, right?

Starting point is 00:05:52 So having the image that's constantly the same, constantly in sync, that's kind of way to force it. I also want to give a quick shout out to this project, Pip Deptree. Remember, Brian, we spoke about that before. Yeah. Just pretty cool. And what it'll show you is this will show you the things you've directly installed versus the things that happen to be installed. So if we go back to this, like animation here, you can see

Starting point is 00:06:15 that it's got flask, which is 202. But then it's got markup safe, it's got it's dangerous, like nobody installed, it's dangerous. That's a thing that was installed because of Flask. And so for example, when I look at my environments, there were some things that were out of date, but they were out of date because they were pinned requirements of other things that I actually wanted to install. So for example, example, doc opt and some other things are pinned to lower versions and I can't really update those, but they'll show up as outdated. So you might pair this with some pip dep tree to see like, what ones are you in control of and what ones are just kind of out there. That's pretty cool. That's that one. Well, you got it for us. Well, this is a interesting, there's a discussion about a possible change to

Starting point is 00:06:59 future Python. Again, this is just stuff that people are discussing. It's nothing that's even decided on, but it's a, it's an idea of late bound arguments for Python for a late bound arguments for deep, deep, I don't know, late bound argument defaults. That's it for, for functions. So here's the idea. So you've got, so we know that if you, if you sign the default value for a function argument, that is bound at definition time. So when Python first goes and reads it, the, that seems fine. It's a weird thing about the name, the namespace there though. So what happens is if you have a variable foo, for instance, or a value foo, the value expression can be looked, you can look that up in the defining area. So the namespace where the function is defined, it's a little specific, but it causes some

Starting point is 00:08:01 weirdness. It's not the namespace of the function. It's the namespace of the surrounding the function. The problem with that really is that, like for instance, if we wanted to do something like a bisect function that took, you know, has a, you give it an array and maybe an X value for the middle or something. We also have a high and low.

Starting point is 00:08:22 We know the low index would be zero as a default, but what the high should be is should be the length of the array. And you can't do that because you can't reference the array as a default value. So that's kind of what this discussion is about is trying to figure out a way to possibly have an optional late binding of those values.

Starting point is 00:08:46 And in this specific case, it'd be very helpful to be able to late bind that value at the time that the function's called, not at the time that it's defined. Right, so you want to take the first parameter and use it to set the default value of the subsequent parameter. Yeah, to say the length of the array is the default for length or something and that's um the uh that was who was it chris angelico that suggested this and the discussion actually got uh has some some people even even guido said i'm not really opposed to it let's let's explore it a little bit. So there is some, Chris is trying to do a proof of concept. There is some question about what the syntax should be. So Chris suggested a equal colon,

Starting point is 00:09:35 so like the reverse of the walrus operator, because apparently that's available. Another suggestion was equal greater than to kind of look like an arrow, but we already have like dash arrow to mean something else. So up in the air on the syntax. But anyway, one of the things I wanted to comment about is the in the article I'm we're linking to, it says at first blush, Angelico's idea to fix this wart in Python seems fairly straightforward, but the discussion has shown that there are multiple facets to

Starting point is 00:10:11 consider. And it's always tricky to add complexity to the language. So the people in the steering council will think about it, right? Under consideration. Okay. Renee, what do you think about this? I'm going to be honest. It's going over my head a little bit. I don't consider myself like a real software developer. So I usually use Python for, you know, standard data science type of scripts. I'm trying to sit here thinking of a use case for this that I would use and not coming up with one. So yeah, I'm with you on that one as well. It's not, doesn't mean it's a bad idea necessarily. Well, one use case would be to be able to set an empty list as a default value. You can't do that now because the list is bound. All calls to

Starting point is 00:11:00 the function will get whatever the last function set it to. And that's a weirdness in Python, but we could probably fix that with this. Yeah. Yeah. That's what I was thinking as well, is if you pass immutable value as the default, then you're asking for trouble, right? Because if it gets changed anywhere, then every subsequent call gets those changes applied to it. So that seems useful. This like sort of flowing one parameter into the next, I'm not sure it's worth the complexity. So Renee, what I wanted to ask you was, as somebody who doesn't dive deep into the low levels of the language

Starting point is 00:11:35 and compiler, parsing, all that kind of stuff, which is totally fine, that's like 99% of the people, how do you feel about these kinds of new features coming along? Are you like, oh, geez, now I gotta to learn the walrus operator. I got to learn pattern matching. I was fine. And now I got to deal with this code. What is this? Or do you see it as like, oh, awesome. Here's new stuff. Yeah. I mean, I guess it depends how much it really impacts my day-to-day work. If it's something that it's not impacting something I use frequently, or it's kind of abstracted away from

Starting point is 00:12:02 me or optional, then, you know, go ahead. But if it's something that some, you know, some features they roll out clearly have a wide ranging impact and you have to go update everything. So I'm not great at keeping up with that, which is one reason that, you know, of course, you have to be so careful when you update to a new version. But, you know, I guess that's why people listen to podcasts like this. So, you know, it's potentially this. So you know it's potentially coming. So you're aware when it does come out, you're on top of it. But I don't have strong

Starting point is 00:12:31 opinions. And what we worry about a lot in data science is the packages, right? So not the base Python, but the packages are constantly changing and the dependencies and the versions. So that does end up affecting us when it follows through to that level. Yeah, my concern is around teaching Python, because every new syntax thing you put in makes it something that you potentially have to teach somebody. And maybe you don't have to teach newbies this, but they'll see it in code. So they should be able to understand what it is so but uh on the other hand like things like uh you can do really crazy comprehensions list comprehensions and stuff but

Starting point is 00:13:12 you don't have to and most of the ones i see are fairly simple ones so um i don't think we should nix the phone nick shouldn't nix something just because it can be complicated anyway yeah cool indeed yeah good one all right re, you got the next one. All right. So speaking of data science packages, a lot of us use pandas. So I wrote a book, which I'll come back to later called SQL for Data Scientists. And since I wrote that, and some people that have been learning data science in school or on the job haven't always used SQL, or if they use it as kind of a separate process from their Python. So they started asking me, how do you use SQL alongside Python? So this is kind of beginner level, but also something that's just very useful. In the pandas package, there's a

Starting point is 00:13:57 read SQL function. And so you can read a SQL query. It runs the query. It's kind of a wrapper around some other functions. It will run the query and return the data set into your data frame. And so basically you're just running a query and the results become the data frame right in your notebook. So let's see some of my notes on here. So you can save your SQL as a text file. So you don't have to have the string in your actual notebook, which is sometimes useful. And then when you import it in from that pandas data frame, that's where a lot of people do their data cleaning and feature engineering and everything like that. So you could just pull in the raw data from SQL and do a lot of the data engineering there. Sometimes I do feature

Starting point is 00:14:39 engineering in SQL and then pull it in. So that's kind of up to each user. But you really just set up the connection in your database using a package like SQL alchemy. So that's kind of up to each user, but you really just set up the connection in your database using a package like SQL Alchemy. So you have a connection to the database and you pass your SQL string either directly or from the file and the database connection, and it returns a pandas data frame. So I'm happy to talk a little bit more about, you know, how I use this at work. Yeah, I think this is really good. You know, one of the things to do with pandas is there's just so many of these little functions that solve whole problems. You know, it's like, oh, you could go and use request to download some HTML and you could use beautiful soup to do some selectors and you could get some stuff and parse

Starting point is 00:15:22 out some HTML and then you could get some table information out and then convert that into a data frame or you could just say read html table bracket zero or whatever and boom you have it like knowing about these i think is really interesting so it's cool that you highlighted this one i actually just on a side i'm just literally like in an hour so probably before this show ships, we'll ship this episode I did with Bex, which you have about 25 Panda functions you didn't even know existed. And what's interesting is like, this one wasn't even on the list.

Starting point is 00:15:53 So good. So I'll highlight another one now you know exists. That's pretty cool. Let's see a couple of comments from the audience. Sam says, Pandas is so amazing. I always find something too late that it has all of these IO functions. And then we have Paul says, do you have any recommendations on tutorials for

Starting point is 00:16:14 how to create good SQL alchemy selectables? This always feels like the scariest bit. I don't have any of that on hand. I'll try to find something later or I'll ask my Twitter following and see what they recommend. I don't have a good list of tutorials for that one. I can talk about, yeah, by selectable, he said he means connectable. So yeah, I don't have a tutorial for that. There's a lot of documentation and I know that SQL alchemy can be a little mysterious sometimes. Maybe that's why it's alchemy. But yeah, I will try to share that later on Twitter. Yeah. Fantastic. All right. And Paul says, read clipboard is pretty great. Yeah. Yeah. Very cool. Bunch of different things there.

Starting point is 00:16:54 Yeah. So if you want me to walk through an example of how I use this at work, I'm happy. Yeah. Give us a quick example. Yeah. So at Helio Campus, one thing we do is we connect to a lot of different databases at universities. So the universities will have separate databases for admissions, enrollment, financial aid. Those are all separate systems. And so we pull all that data into a data warehouse. And in SQL, we can combine that data, build some extracts that we're all using the same way. And so we can either use this to just read one of those tables directly, or we can combine what I typically do is do a little bit of cleanup and feature engineering and narrow down my data set to the population that I want to run through my model in SQL, and then just pull those final results. And now I've got my data set with at least preliminary features. I might do some standardization and things like that in pandas, but I've got a pretty clean and subset of the data that I need

Starting point is 00:17:49 right into my Jupyter notebook. Oh, that's fantastic. Pretty great. Yeah, I think definitely understanding SQL is an important skill for data scientists. And it's slightly different than for, say, like a web API developer, right? Absolutely.

Starting point is 00:18:04 That's why I wrote the book. That's awesome. Yeah, for sure. So on the API side, you kind of get something set up. You're very likely using an ORM like SQL Alchemy and you just connect it and go. And once you get it set, like you can kind of forget about it and just program against it. As a data scientist, you're exploring. You don't totally know, right? You're kind of out there testing and digging into stuff and sorting and filtering. And yeah, it's, I think you need, I would say you probably need a better fluency with SQL as a data scientist. Absolutely. Than as a web developer, because I can just you need and why and which fields you need.

Starting point is 00:18:45 Now you could just do it yourself or add a field if you need it. You can do more sophisticated things like window functions. So yeah, I think knowing SQL is really a value add if you're looking to become a data scientist and putting yourself out there on the market. If you can do the whole pipeline end to end, it definitely makes you stand out.

Starting point is 00:19:03 I would think so. All right. One thing to wrap up on this. Sam asks, can you configure SQL Alchemy to dump the raw queries that it runs? Yes. Yeah. In this case, you have the raw query in your function call. So I'm not actually using SQL Alchemy for that because I'm providing a query.

Starting point is 00:19:20 You just got like a select statement, right? The problem with SQL Alchemy and data science is you have to the structure of your models has to exactly match the structure of the data and often i imagine you're just kind of dealing with loose data and it doesn't make sense to take the time to like model it in classes but for sql alchemy you can just set echo equal true when you create the engine and then everything that would get sent to the crossover the database gets echoed as like DDL or SQL or whatever that it does. So yes. Cool.

Starting point is 00:19:47 For sure. Yeah. All right. Brian, want to tell us about our sponsor? Yeah, let's. I am pleased and happy to that Shortcut is sponsoring the episode. So thank you, Shortcut, formerly Clubhouse, for sponsoring the episode. There are a lot of project management tools out there, but most suffer from common problems. Like it's too simple for an engineering team to use

Starting point is 00:20:09 on several projects, or it's too complex and it's hard to get started. And there's tons of options. And some of them are great for managers, but bad for engineers. And some are great for engineers and bad for managers. Shortcut is different. It's built for software teams and based on making workflows super easy. For example, keyboard-friendly user interface. The UI is intuitive for mouse lovers, of course, but the activities that you use every day can be set to keyboard shortcuts if they aren't already. Just learn them and you'll start working faster. It's awesome. Tight VCS integration, so you can update tasks, progress, and commits with a commit or a PR. That's sweet.

Starting point is 00:20:51 And iteration planning is a breeze. I like that there's a burndown on cycle time charts built in. They just are set up already for you when you start using this. So it's a pretty clean system. Give it a try at shortcut.com slash Python bytes. Yeah, absolutely. Thanks shortcut for sponsoring this episode. Now, what have we got next here? Pigeon. I want to talk about pigeon. So, uh, we already talked about Will McGugan and rich. So it's time to talk about Anthony Shaw so that we can complete our shout outs. We always seem to give

Starting point is 00:21:22 over on the podcast. So I want to talk about Pidgin because I just interviewed Anthony Chabot. More importantly, he just released Pidgin as 1.0. So Pidgin is a drop-in JIT compiler for Python 3.10. Let me say that again, a JIT compiler for Python. And there've been other speed up type of attempts where people will like fork CPython and they'll do something inside of it to make it different. Think Cinder. There've been attempts to create a totally different but compatible one like PyPy, P-Y-P-Y. And that's, they've worked pretty well, but they always have some sort of incompatibility

Starting point is 00:22:01 or something. It would be nice if just the Python you ran could be compiled to go faster if you want it to be. So that's what this is. It uses a PEP whose number I forgot that allows you to plug in something that inspects the method frames before they get executed. And then instead of just interpreting that code, the bytecode as Python bytecode,

Starting point is 00:22:23 it'll actually compile it to machine instructions, first to.NET intermediate instructions, intermediate language, and then does get compiled to machine instructions that then run directly. Works on Linux, macOS, Windows, x64, and ARM64. So this is a pretty cool development. This is pretty cool.

Starting point is 00:22:43 Yeah, so if we go over here and check it out, in order to use it, it has some requirements. You just pip install Pigeon. That's it. That's crazy, right? And then it has to be on 3.10. It can't be older than that. And you have to have.NET 6 installed. Okay.

Starting point is 00:22:58 So that just got released. It's a good chance you don't have.NET 6 installed. But then once you set it up right, you can just say import pigeon, pigeon.enable at the startup of your app, and then it will look at all the methods and JIT compile them. So if you come down here, like he has an example of a half function

Starting point is 00:23:15 that Anthony put up here. And when it first loads, it's not JIT compiled. But after that, you can go and say, if you run it, you can say disassemble this thing, and it'll show you basically assembly instructions of what was, would have otherwise been Python code. Wow. It's wild, right?

Starting point is 00:23:32 So it's a little bit like Numba. It's a little bit like a tiny bit like Cython in the sense that it takes Python code, translates it into something else that then can be interoperated with and then makes it go fast. So this is all well and good. translates it into something else that then can be interoperated with and then makes it go fast. So this is all well and good. If you're going to use it on the web, by default, it would be just fine. Except if you're hosting it, normally you host it in this like supervisor process and then a bunch of forked off processes. So there's a, a WSGI app configuration thing you can do as well. Somewhere in the docs, I'm not seeing it right now, but you basically allow it to push the pigeon changes on down into the worker processes,

Starting point is 00:24:10 which is pretty cool. And it has a bunch of comparisons against PyPy, Piston, Numba, IronPython, et cetera, Nutka, and so on. Now it's not that much faster. It is faster when you're doing more like data science-y things, I believe, than if you're doing just like a query against a database where you're mostly just waiting anyway.

Starting point is 00:24:30 But still, I think this is promising and it's really pretty early days. So the thing to look at is if there's optimizations coming along here somewhere in the docs, Anthony lists out the various optimizations he's put in so far. And really, it just needs more optimizations to make it faster still, which is pretty neat. I think that's pretty cool. One of the things that my first reaction was, oh, it's.NET only, so I have to use it on Windows. But that's not been that way for a long time. So.NET runs on just about everything. Yeah, exactly. It supports all the different frameworks. There's even this thing called live.trypigeon.com where you can write Python code like over here on the left, and then you can say compile it and it will actually show

Starting point is 00:25:15 you the assembly that it would compile to. And then here's the.NET intermediate language. I guess maybe they should possibly be switching orders here. Like first it goes to IL and then it goes to machine instructions through the JIT, but it shows you all the stuff that it's, it's doing to make this work. And you could even see at the bottom, there's like sort of a visual understanding of what it's doing. One of the things that's really cool that it does is imagine you've got a, a math problem up here. Like you're saying like X equals, you know, Y times Y plus z times z, or, you know, something like that, like, each one of those steps generates an intermediate number. So for example, z times z would generate a, by default, a Python number. And then so would y times y,

Starting point is 00:25:58 and then the addition, and finally assign it, what it'll do is it'll say, okay, if those are two floats, let's just store that as a C float in the intermediate computation. And then that's as a C float. And so it can sort of stay lower level as it's doing a lot of computational type of things. So there's a bunch of interesting optimizations. People can check this out. I haven't had a chance to try it yet. I was hoping to, but haven't got there yet. Yeah. Really interesting conversation you had with them too. And it's interesting timing to just get him to jump on this. Like right after he wrote the book on Python internals,

Starting point is 00:26:31 CPython internals, to jump into this. Well, I guess he's working on it before, but still. Yeah, you definitely got to know CPython internals to do this. Rene, do you guys do anything to optimize your code with like Numba or Cython or anything like that? Or are you just running straight Python and letting the libraries deal with it? Yeah, not currently. We have a pretty good server and are working with relatively small data sets, you know, not millions of rows, for example.

Starting point is 00:26:59 So for right now, we haven't gone in this direction at all. I can imagine this would also be really useful if you were a computer science student and trying to understand what's going on under the hood when you run these things. So it's interesting that it's for the people that aren't seeing the visual, you kind of have three columns here with the code side by side to kind of get a peek under the hood at what's going on there. But no, this isn't something I've used personally. Yeah, yeah. I haven't used it either. But like I said, I would like to, I think it's, it's got the ability to just plug in

Starting point is 00:27:29 and make things faster. And really it is faster to some degree. Sometimes I think it's slower, sometimes faster, but the more optimizations the JIT compiler gets, the better it could be. Right. So like if it could inline function calls rather than calling them or it could um there's things like it if it sees you allocating a list and putting stuff into it it can skip some intermediate steps and just straight allocate that or if you're doing accessing elements by index out of the list it can just do pointer operations instead of going through the python apis there's a lot of a lot of hard work that Anthony's put into this, and I think it's pretty cool. Yeah, I haven't tried it. I would like to. Yeah. Cool. Indeed. All right, Brian, what do you

Starting point is 00:28:12 got for us? Well, actually, before I jump to the next topic, I wanted to mention, I wanted to shoo into this last conversation. Brett Cannon just wrote an interesting article called Selecting a Programming Language Can Be a Form of Premature Optimization. And this is relevant to the conversation because the real steps, he says, if you think Python might be too slow, another implementation like Pidgin is like step three. So first prototype in Python, then optimize your data structures and algorithms and also like you know profile it um and then and then try another implementation before you abandon python altogether uh and then you know you can do some late bindings like language bindings to connect to c if you need to or rust um but um but i think it ties in as like, when would I choose Pigeon or PyPI over CPython? Well, it's step three, just to let people know. Step three, got it.

Starting point is 00:29:12 Step three. I wanted to do something more lighthearted, like use print for debugging. So I love this article. I am guilty of this. Of course, I use debuggers and logging systems as well but I also throw print statements in there sometimes and I'm not ashamed to say it so Adam Johnson wrote tips for debugging the print and there were a couple that with print there were a couple that stood out to me I really wanted to mention because I use them a lot, even with logging though, is use debug variables with

Starting point is 00:29:46 F strings and the equal sign. So this is brilliant. It's been in since 3.8. Instead of typing like print widget equals and then in a string and then the widget number or something, you can just use the F strings and do the equal sign and it interpolates for you or it doesn't interpolate sign and it interpolates for you. Or it doesn't interpolate. It just prints it for you.

Starting point is 00:30:07 So it's nice. I like that. The next one is, I love this. Use emojis. I never thought to do this. This is brilliant. Throw emojis in your print statements so they pop out when you're debugging. Have you ever used emojis?

Starting point is 00:30:23 I started using emojis in comments. Oh, okay. Comments, nice. Yeah, so I'll put like the different emojis mean, for me, I was doing some API stuff. So like, this is the read-only method here of an API. So I'll put a certain emoji up there. And then this is the one that changes data.

Starting point is 00:30:39 So here's, so I'll put there, here's one that returns a list versus a single thing. So I'll put a whole bunch of those emojis and stuff. Yeah. Well, I mean, I used to do like a whole bunch of plus signs because they're easy to see, but an emoji would be way easier to see. So way more fun, man. Yeah.

Starting point is 00:30:53 Yeah. I do this as well. I print all the time for debugging, especially in Jupyter notebooks, because you don't always have the most sophisticated debugging tools in there, but being able to print and see what's going on, going on as you go through each step of the notebook and emojis are a great idea for that because it's so visual as you're scrolling through, you want to like the, they're showing there, the X and the check mark emoji. I like that for my little to-do lists and the comments that I leave. Yeah. I thought so. I've done that as well. That's cool. Chris may on the audience,

Starting point is 00:31:21 uh, just put a, you know, a heart sign, smiley face emoji as a response to this. Last thing, he's got like seven tips. The last tip I wanted to talk about was using rich and or specifically rich print, rich dot print or P print. So for P print, you have to do from P print, install P print or something, or unless you wanted to say it twice with pprint dot pprint but it's for it's pprint stands for pretty printing and the gist of this really is the structures by default print horribly uh if you just print like a dict or uh or a set or something it looks gross but rich and pretty print uh make it look nice so if you're debugging printing with those and debugging use that so yeah there's also exception handling stuff in there for it and uh

Starting point is 00:32:12 there's a lot of that kind of debugging stuff in rich yeah printing exceptions is great with that um i also wanted to say one of the reasons why um i the one of the places where I use printing a lot for debugging is I print to print stuff in my, what I expect is going on when I'm writing a test function. So I'll often print out the flow, what's going on. The reason I do that is when, if PyTest, for PyTest, if a test fails, PyTest dumps the standard out. So it'll dump all of your print statements from the failed procedure. So that's either the test under code or the test itself. If there's print statements, it gets dumped out. So that's helpful. Yeah. Nice. That's great. I love it. I use print statements a lot. My output is very verbose. You can see right in order what's happening. Sometimes a debugger helps,

Starting point is 00:33:00 but sometimes it's time to just print. Yeah. Speaking of visual stuff, what's your last one here, Renee? Yeah, so in our line of work with data science, especially when you're providing the models as tools for end users that aren't the data scientists themselves, you really want, the explainability is really important. So being able to explain why a certain prediction got the value it did, what the different inputs are, we're always working to make

Starting point is 00:33:25 that more transparent for our end users. In our case, for example, we might be predicting which students might be at risk of not retaining at the university, so not being enrolled a year later. So what are the different factors, both overall for the whole population that are correlated with not being enrolled for a year. And for each individual student, what might be particular factors that, at least from the model's perspective, puts them at higher risk. So this package is called SHAP, and that stands for Shapley Additive Explanations. It was brought to my attention by my team member, Brian Richards, at Helio Campus. And now we use it very frequently because it has really good visualizations. So these Shapley values, apparently they're from game theory.

Starting point is 00:34:12 I won't pretend to understand the details of how they're generated. But you could think of it as like a model on top of your model. So it's additive and all the different features. If you see the visualization here, it's showing kind of a little waterfall chart. So some of the values that, think of a particular row that you're running through your algorithm. Some of the values in that row are going to make the, if you're doing a classification model, some values might make you more likely to be in one class. Some might make you more likely to be in the other class.

Starting point is 00:34:42 So you have these visuals of kind of the push and pull of each value. In this visual, we're seeing age is pushing a number to the right, sex is pushing it to the left, I guess BP and BMI, that looks like a blood pressure. So it's got this like waterfall type of chart. And what it's actually doing is it's comparing, it starts with the expected value for the whole population, and then it's showing you where this particular record, each of the input values is kind of nudging that eventual prediction one in one direction or the other. So it's just nice visually to have those waterfall plots and to see which features are negatively or positively correlated with the, um, you know, the end, end result. And you could also do some cool scatter plots with this. So you can do the input value versus the shop value and have a point for every, um, item in your population. So for in our, in our example, that would be students. So we can have a scatter plot of all

Starting point is 00:35:40 the students and something like, um like the number of cumulative credits that they have as of that term. And so you'll see the gradient of like from low credits to high credits. It's not usually linear. What are those kind of breakpoints? And at what point are the values positively impactful to likelihood to retain or in the opposite direction. Of course, I'm glad that they put in this documentation, they have a whole section on basically correlation is not causation. And we're constantly having to talk to our end users about that. But if we say a student that lives in a certain town is potentially more likely to retain maybe because of distance from campus, or maybe you have traditionally recruit a lot of students from that town. It doesn't mean that if you force

Starting point is 00:36:30 someone to move to that town, they're more likely to stay at your institution, right? So yeah, correlation is not causation. And I'm going to switch over here to the visual, to something called a B-swarm plot that you can output this right in your Jupyter notebook, which is really handy when you develop a new model. And I'll try to describe this for people who are listening to the audio. It has along the x-axis a list of features. So you've got, in this example on their website, age, relationship, capital gain, marital status. And then you see a bunch of dots going across horizontally. And there's areas where there's little clusters of dots. So what this is showing is the x-axis is the SHAP value. So what this SHAP package outputs. So you can see visually across what

Starting point is 00:37:19 is the spread of the impact of this value. So if each dot is a person in this case, you see people all the way to the right, whatever their age was positively impacted their eventual score. People all the way to the left, their age negatively impacted the score. And then each dot is a color that ranges from blue to red. So the blue ones are people with low age and the red ones are people with a high age. So you can see here that basically the higher the age, the more positive their eventual prediction. So just an interesting way to get both like a feature importance and see the distribution of the values within each feature. So it's just really helpful when you're doing predictive models, both for evaluating your own model and then eventually explaining it to end users. So would a wider spread mean that the feature is more useful or does it have any? Yeah, especially if you can see a split in the numbers

Starting point is 00:38:19 there. So you see in this example relationship, you've got all the red ones to the right and all the blue ones to the left. That means that there's a clear relationship from this relationship field with the target variable. So there's a clear split where the low values are on one side and the high values are on the other side. And then, yeah, the spread means that if there's not a good example here, but sometimes you'll see like two clumps, two bee swarms spread apart with nothing in the middle. So that's when you have a really clear spread of the high impact group versus the low impact group.

Starting point is 00:38:53 And if it's more narrow, that's less of an important variable. So you see if you look at the one that's sorted by max, here we go. Absolute value of the shop value. The ones near the bottom for the population have less impact. Now, there might be one person in there where that particular value was like the deciding factor of which class they ended up in. But for the population as a whole, there's less differentiation across these values than across the ones near the top of the list. Yeah, this is cool because that visualization of models is very tricky, right? And it's something

Starting point is 00:39:30 like knowing why you got an answer. Now this looks very helpful. Yeah, it's really useful. And the visuals are so pretty by default, but then you can also pull those values into other tools. So we'll, for example, so for each feature in each row, you get a shop value. So you can write those back to the database and then pull those values into another tool. We use it in Tableau to highlight for each student what are those really important features, either making them, if you're not making them, it's not causation, but correlated with their more likely to retain or less likely to retain. So we might say, well, for this student, their GPA, that's the main factor.

Starting point is 00:40:10 Their GPA is really low. Students with low GPAs tend not to retain. And so when the end user is looking at all of their values in a table or some other kind of view, you can use the shot value to highlight. GPA is the one you need to hone in on. The student is struggling academically. Try to Try to help them get some help with grades, for example. Cool. Yeah, this is a great find. Indeed, indeed. Brian, that brings us to the extras. Extras. Got any extras for us? I do. I've got one that was just a quick one. Let's see.

Starting point is 00:40:41 Pull it up. Matthew Fiegert mentioned on on twitter that pip index is a is a cool thing um and uh i kind of didn't know about it so this is neat so uh pip index is something you can take a pip index well specifically picks pip index versions so pip index does a whole bunch of stuff but pip index versions will tell you if you also give it a package uh it'll tell you um all the different versions that are available on pi pi and which one you have and which you know if you're out of date and stuff but um so for instance if you're thinking about upgrading something and you don't know what to what to upgrade to um you can look to see what all is there, I guess.

Starting point is 00:41:26 Or you want to roll it back. You're like, oh, this version is not working. I want to go back to a lower one. But if you're on 2.0, it's not 1.0. What is it, right? What do you go back to? And so this will list all the available versions. Basically, this is a CLI version of the releases option in PyPI.org. Right. And it's not obvious how to get to on PyPI, but I know you can get to it. You can see all the releases, but by default, it doesn't show those. So this is pretty quick.

Starting point is 00:41:55 Yeah, pretty neat. Good one, good one. Renee, how about you? Some extras? Sure. I wanted to make sure to mention my book. So just published and just out in Europe this week, actually, is a paperback, but it's been out since September in the US, SQL for Data Scientists, A Beginner's Guide for Building

Starting point is 00:42:11 Datasets for Analysis. So I mentioned earlier, you know, I wrote this book because I think a lot of students coming out of data science programs or people who are coming from maybe a statistics background that are in data science might not have experience pulling the data. So in class, a lot of times you're given a clean spreadsheet to start with when you're building your predictive model. Then you get to the job and you sit down the first day and they say, great, build us a model. And you say, well, where's the data? It's in a raw form in the database. So you have to build your own data set. So that's what the

Starting point is 00:42:44 purpose of the book is to kind of get you from that point of when you have access to raw data to exploring and building your data set so that you can run it through your predictive model. So on the screen there, you see my website and for people on the audio, it's sql4datascientist.com. And you can go to different chapters on the website and I have some example SQL and you can also run it. So there's a SQLite database in the browser here. And so you can actually copy and paste some of the SQL on the page, click execute and it shows up in a table down here.

Starting point is 00:43:18 You can edit it and rerun it. So you get a little bit of practice with using the database in the book. Neat. Yeah. Cool book and wow, SQLite in the book. Neat. Yeah. Cool book. And wow, SQLite in the browser. Very neat.

Starting point is 00:43:29 Thank you. Yeah. Awesome. That's a book that definitely should exist. All right. Really quickly, I'm going to do a webcast with Paul Everett. I haven't seen Paul in the audience today. Paul, where are you?

Starting point is 00:43:39 No, I'm not sure. He might be working. But on November 23rd, I'm going to be doing a webcast around PyCharm. I've updated my PyCharm course with all sorts of good stuff. I haven't quite totally announced it yet because there's a few things I'm waiting, slightly more stable versions to come out of JetBrains

Starting point is 00:43:56 to finish some of their data science tools, actually. And then I'll talk more about it, but we're doing a webcast in about a week or so. So that should be a lot of fun. And yeah, come check it out. Watch Paul and me make the code go. Two days before Thanksgiving. Yes, indeed.

Starting point is 00:44:13 All right, that brings us to our joke. I need a joke. And the joke is a response to something that you posted on Twitter. Really appreciating my foresight using lots of stuff comment as the message. And git commit. Well, it actually confused me

Starting point is 00:44:28 because I did a git rebase main and it said applying lots of stuff. And I thought it was like a feature of git rebase and it just happened to be my commit message. Yeah, it's like, oh, git's gotten real casual. It's lots of stuff. So Francois Voron said, time for a classic XKCD link link here yeah yeah yeah and so this

Starting point is 00:44:50 is like the commit history throughout the project as you get farther into it so it starts out with very formal proper comments like created main loop and timing control the next commit is enabled config file parsing, and then starts to miscellaneous bug fixes, and then code additions and edits, and then a branch, more code, you have code, just eight letter A's. It comes back with screaming. Exactly. Just A, D, K, F, J, S, L, K, just a bunch of home row. And then my hands are typing words and then just hands. And the title is, as a project drags on, my get commit messages get less and less informative. We've all been there, right?

Starting point is 00:45:34 Yeah. Yes. It happens to me with branch names too. Because if I'm working on one feature and push part of it and then I go and I'm still working on it. I, I, I like to use a new branch name and I just, I, I can't, it's hard to come up with good branch names for a feature. I'm branching. Exactly. I try, I try to be more formal on the branches, at least so I know I can delete them later. And so when I'm working on projects that are mostly just me, I'll, I'll create a GitHub issue and then create the branch name to be like a short version of the issue title and the issue number. So then when I commit back in, I can just look at the branch name and put a hash that number and it'll tag it in the commit on the issue in GitHub.

Starting point is 00:46:17 If I'm working with someone like a team, I might put like my name slash branch name. And then actually in some of the tools like SourceTree, you have like little expando widgets around that on the branches. So you can say, these are Michael's branches and these are Renee's branches and so on. Yeah, we got into the habit of doing that too. It helps a lot to see right up front,

Starting point is 00:46:36 whose branch is this? Yeah, it can get out of control, right? All right, a quick couple of follow-ups, Brian. Anthony says, the book looks great, Renee. I'll check it out. Great, thank you. Chris May likes it, Brian. Anthony says, the book looks great, Renee. I'll check it out. Great, thank you. Chris May likes it as well. It's a great book idea.

Starting point is 00:46:49 Glad you like it. Especially when I keep working long after I should have gone home. No, this is the joke. This is me, especially after I keep working long after I should have gone home. Yeah, absolutely. And Sam, oops, forgot to stage this

Starting point is 00:47:02 as a common message in my repositories. Nice. Indeed. So, cool. It was a fun episode. oops forgot to stage this as a as a common message in my repositories nice indeed so cool it was a fun episode thanks Renee for coming on yeah thanks for having me it's fun don't get to dive into Python too often I'm using the same type of things over and over so it's nice to see what's new and what's on the horizon awesome

Starting point is 00:47:20 yep thanks much today thanks Brian bye

Python Bytes - #259 That argument is a little late-bound

Topics covered in this episode: pypi-changes Late-bound argument defaults for Python pandas.read_sql pyjion Tips for debugging with print() SHAP (and beeswarm plot) Extras Joke See the full show... notes for this episode on the website at pythonbytes.fm/259

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.