Python Bytes - #259 That argument is a little late-bound
Episode Date: November 17, 2021Topics covered in this episode: pypi-changes Late-bound argument defaults for Python pandas.read_sql pyjion Tips for debugging with print() SHAP (and beeswarm plot) Extras Joke See the full show... notes for this episode on the website at pythonbytes.fm/259
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 259, recorded November 17th, 2021. And I'm Brian Ocken.
I'm Michael Kennedy.
And I'm Renee T.
Well, thanks, Renee, for joining us today. Can you tell us a little bit about who you are?
I'm the Director of Data Science at Helio Campus. And a lot of people know me as Data Science Renee or Becoming Data Sci on Twitter.
So that's where a lot of people follow me. And then I started with, I had a podcast that's not
actively recording, but it's called Becoming a Data Scientist Podcast. So some people listening
probably know me from that as well. Cool. Yeah. Awesome. You were doing a bunch of cool stuff
there and any chances of maybe going back to podcasting?
It's, it's definitely still open.
I've never, I've always told myself this is a pause, not a stop.
It's just an extended pause.
So yes, hopefully I will get back.
It's hard to keep going, isn't it?
I mean, life gets in the way and then you get busy and.
I'm always so impressed with those of you that have hundreds of episodes, very consistently
recorded.
Well, Brian makes me show up every week.
Well, yeah, it definitely helps having a partner so that you can coerce each other in.
That's right.
Well, Michael, speaking of partners, want to tell us about something?
Let's talk about some changes, some PyPI changes.
These come to us from Brian Skin.
Thank you, Brian, for shooting this over.
And it's a project by Bernay Gabor here.
And if we pull this up, it says, have you ever wondered when did your Python packages,
the packages in your environment that you have active or any given environment, how
old are they?
When were they last updated?
Is there a version of them that's out of date?
So I've been solving this by just forcing them to update using pip compile and the pip tools stuff to just regenerate and reinstall the requirements files.
But this is a way to just ask the question, hey, what's the status?
And it wouldn't be an episode if we didn't somehow feature Will McGugan.
So this is based on Rich, of course.
So let's go check this thing out.
So over here, if we go to the homepage, we get, as all projects should, a nice animation here.
And if you look at it, you can just see type ipi-changes and you specify the path to a Python in a virtual environment.
So in this example, it's like ipi-changes vnv slash bin slash Python.
It does some thinking on the internet,
caches some information about the packages,
and it says, all right,
you've got all these things installed.
They're this version.
Some of them, it'll just say,
this was updated 10 months ago
or a year and three days ago.
Others, it'll say it was updated a year ago,
but only six months on the latest version.
It says remote such and such.
That's the one you could install
if you were to update it.
So it's a real nice way to see, well, which ones are here that could be updated
or even also sometimes it's interesting to know like, oh, this library, it doesn't have an update,
but it's 10 years old. Maybe I should consider switching to a library that's a little more
maintained and making progress. Right. What do you all think? That's handy. Cool. Right.
It is pretty neat so yeah
i've been playing with this today installed it checked it out even pointed out that you know
since yesterday some things changed in one of my projects that i want to keep up to date so i
updated it yeah so i like it it's got a nice command line interface you basically specify
the python that is in the environment that you want to check. That could either be the main Python
or a environmental virtual environment Python. Like I said, you can control the caching because
the first time it runs, it has to go get lots of information about each package that's installed
and it's faster the second time. It also has some cool parallelism. So you can say number of jobs,
like dash dash jobs. And by default, it runs 10 downloads in parallel as it's pulling this
information in but i guess you could go crazy there so anyway i thought this was pretty cool
it's a nice little thing to have so i pip x installed this it's perfect for pip x because
it doesn't need to be in the project it's testing it just needs to be on your machine as a command
and then you point it at the environment, different environments, and it gives you reports
on those environments.
Yeah, I love pickbacks too.
One of the things I want to note, just I know a lot of package maintainers having, I mean,
it's worth checking things out if it's a really old packet, if it hasn't been updated for
a long time.
But some things are pure Python things that just do a little tiny thing and don't need
updated very often.
So yeah, it's not necessarily a bad thing that it's not updated, but it's an indicator
of something.
Yes, exactly.
Let's see out in the live audience.
Anthony Lister, hey, says, can the changes be exported to a text file?
I haven't seen anything about that other than just, you know, piping it into a text
file and who knows what happens with all the color in there, but, uh, perhaps. Yeah. Renee, what do you do to manage
your dependencies and all those kinds of things? Um, well at work we started using Docker for that.
So we have a centralized, um, Docker container that everyone on my team uses and we make sure
we have the same setup in there. So, um, I'm not the one that directly manages it, but, but that's the solution that we've gone towards to make sure we're the same setup in there. So I'm not the one that directly manages it, but
that's the solution that we've gone towards to make sure we're all on the same page.
How interesting. So you've got the Docker environment that has some version of Python
set up with all the libraries you need pre-installed, and then you just use that to
run and that way you know it's the same. Yep. And then it's also nice because when we
kind of move some of our projects into production, we can include that Docker container with it.
So it will have whatever version it had at the time.
So if for some reason it's not compatible with some later version we upgraded to, it still lives out there with the version of the tools that it had until we have a chance to update everything.
One of the challenges that people have sometimes is they say, even though you've got some kind of version management, iProject.toml or requirements.txt or whatever,
that doesn't necessarily mean that people actually install them the latest.
So you could still be out of sync, right?
So having the image that's constantly the same, constantly in sync,
that's kind of way to force it.
I also want to give a quick shout out to this project, Pip Deptree.
Remember, Brian, we spoke about that before.
Yeah.
Just pretty cool.
And what it'll show you is this will show you the things you've directly installed versus the
things that happen to be installed. So if we go back to this, like animation here, you can see
that it's got flask, which is 202. But then it's got markup safe, it's got it's dangerous, like
nobody installed, it's dangerous. That's a thing that was
installed because of Flask. And so for example, when I look at my environments, there were some
things that were out of date, but they were out of date because they were pinned requirements
of other things that I actually wanted to install. So for example, example, doc opt and some other
things are pinned to lower versions and I can't really update those, but they'll show up as outdated. So you might pair this with some pip dep tree to see like, what ones are you in control
of and what ones are just kind of out there. That's pretty cool. That's that one. Well,
you got it for us. Well, this is a interesting, there's a discussion about a possible change to
future Python. Again, this is just stuff that people are discussing. It's nothing that's
even decided on, but it's a, it's an idea of late bound arguments for Python for a late bound
arguments for deep, deep, I don't know, late bound argument defaults. That's it for, for functions.
So here's the idea. So you've got, so we know that if you, if you sign the default
value for a function argument, that is bound at definition time. So when Python first goes and
reads it, the, that seems fine. It's a weird thing about the name, the namespace there though. So what happens is if you have a variable foo, for instance,
or a value foo, the value expression can be looked, you can look that up in the defining area.
So the namespace where the function is defined, it's a little specific, but it causes some
weirdness. It's not the namespace of the function. It's the namespace of the surrounding the function.
The problem with that really is that,
like for instance,
if we wanted to do something like a bisect function
that took, you know, has a,
you give it an array
and maybe an X value for the middle or something.
We also have a high and low.
We know the low index would be zero as a default,
but what the high should be
is should be the length of the array.
And you can't do that
because you can't reference the array as a default value.
So that's kind of what this discussion is about
is trying to figure out a way
to possibly have an optional late binding of those values.
And in this specific case, it'd be very helpful to be able to late bind that value
at the time that the function's called, not at the time that it's defined.
Right, so you want to take the first parameter and use it to set the default value of the subsequent parameter.
Yeah, to say the length of the array is the default for length or something and that's um the uh that was who was it chris angelico that
suggested this and the discussion actually got uh has some some people even even guido said i'm not
really opposed to it let's let's explore it a little bit. So there is some, Chris is trying to do a proof of concept.
There is some question about what the syntax should be.
So Chris suggested a equal colon,
so like the reverse of the walrus operator,
because apparently that's available.
Another suggestion was equal greater than
to kind of look like an arrow, but we already
have like dash arrow to mean something else. So up in the air on the syntax. But anyway,
one of the things I wanted to comment about is the in the article I'm we're linking to,
it says at first blush, Angelico's idea to fix this wart in Python
seems fairly straightforward, but the discussion has shown that there are multiple facets to
consider. And it's always tricky to add complexity to the language. So the people in the steering
council will think about it, right? Under consideration. Okay. Renee,
what do you think about this? I'm going to be honest. It's going over my head a little bit.
I don't consider myself like a real software developer. So I usually use Python for, you know,
standard data science type of scripts. I'm trying to sit here thinking of a use case for this that
I would use and not coming up with one. So yeah, I'm with you on that one as well.
It's not, doesn't mean it's a bad idea necessarily. Well, one use case would be to be able to set
an empty list as a default value. You can't do that now because the list is bound. All calls to
the function will get whatever the last function set it to. And that's a weirdness in
Python, but we could probably fix that with this. Yeah. Yeah. That's what I was thinking as well,
is if you pass immutable value as the default, then you're asking for trouble, right? Because
if it gets changed anywhere, then every subsequent call gets those changes applied to it. So that
seems useful. This like sort of flowing one parameter into the next,
I'm not sure it's worth the complexity.
So Renee, what I wanted to ask you was,
as somebody who doesn't dive deep into the low levels of the language
and compiler, parsing, all that kind of stuff,
which is totally fine, that's like 99% of the people,
how do you feel about these kinds of new features
coming along?
Are you like, oh, geez, now I gotta to learn the walrus operator. I got to learn pattern matching. I was fine. And
now I got to deal with this code. What is this? Or do you see it as like, oh, awesome. Here's new
stuff. Yeah. I mean, I guess it depends how much it really impacts my day-to-day work. If it's
something that it's not impacting something I use frequently, or it's kind of abstracted away from
me or optional, then, you know, go ahead.
But if it's something that some, you know, some features they roll out clearly have a
wide ranging impact and you have to go update everything.
So I'm not great at keeping up with that, which is one reason that, you know, of course,
you have to be so careful when you update to a new version.
But, you know, I guess that's why people listen to podcasts like this.
So, you know, it's potentially this. So you know it's potentially
coming. So you're aware when it does come out, you're on top of it. But I don't have strong
opinions. And what we worry about a lot in data science is the packages, right? So not the base
Python, but the packages are constantly changing and the dependencies and the versions. So that
does end up affecting us when it follows through to that level.
Yeah, my concern is around teaching Python, because every new syntax thing you put in
makes it something that you potentially have to teach somebody.
And maybe you don't have to teach newbies this, but they'll see it in code.
So they should be able to understand what it is so but uh on the other
hand like things like uh you can do really crazy comprehensions list comprehensions and stuff but
you don't have to and most of the ones i see are fairly simple ones so um i don't think we should
nix the phone nick shouldn't nix something just because it can be complicated anyway yeah cool
indeed yeah good one all right re, you got the next one.
All right. So speaking of data science packages, a lot of us use pandas. So I wrote a book, which I'll come back to later called SQL for Data Scientists. And since I wrote that, and
some people that have been learning data science in school or on the job haven't always used SQL,
or if they use it as kind of a separate process from their
Python. So they started asking me, how do you use SQL alongside Python? So this is kind of
beginner level, but also something that's just very useful. In the pandas package, there's a
read SQL function. And so you can read a SQL query. It runs the query. It's kind of a wrapper
around some other functions. It will run
the query and return the data set into your data frame. And so basically you're just running a
query and the results become the data frame right in your notebook. So let's see some of my notes
on here. So you can save your SQL as a text file. So you don't have to have the string in your actual
notebook, which is sometimes useful.
And then when you import it in from that pandas data frame, that's where a lot of people do their data cleaning and feature engineering and everything like that. So you could just pull
in the raw data from SQL and do a lot of the data engineering there. Sometimes I do feature
engineering in SQL and then pull it in. So that's kind of up to each user. But you really just set
up the connection in your database using a package like SQL alchemy. So that's kind of up to each user, but you really just set up the connection
in your database using a package like SQL Alchemy. So you have a connection to the database and you
pass your SQL string either directly or from the file and the database connection, and it returns
a pandas data frame. So I'm happy to talk a little bit more about, you know, how I use this at work.
Yeah, I think this is really good. You know, one of the things to do with pandas is there's just so many of these little functions
that solve whole problems. You know, it's like, oh, you could go and use request to download some
HTML and you could use beautiful soup to do some selectors and you could get some stuff and parse
out some HTML and then you could get some table information out and then convert that into a data frame or you could just say read html table
bracket zero or whatever and boom you have it like knowing about these i think is really
interesting so it's cool that you highlighted this one i actually just on a side i'm just
literally like in an hour so probably before this show ships, we'll ship this episode I did with Bex,
which you have about 25 Panda functions
you didn't even know existed.
And what's interesting is like,
this one wasn't even on the list.
So good.
So I'll highlight another one now you know exists.
That's pretty cool.
Let's see a couple of comments from the audience.
Sam says, Pandas is so amazing.
I always find something too late
that it has all
of these IO functions. And then we have Paul says, do you have any recommendations on tutorials for
how to create good SQL alchemy selectables? This always feels like the scariest bit.
I don't have any of that on hand. I'll try to find something later or I'll ask my Twitter
following and see what they recommend. I don't have a good list of tutorials for that one. I can talk about,
yeah, by selectable, he said he means connectable. So yeah, I don't have a tutorial for that.
There's a lot of documentation and I know that SQL alchemy can be a little mysterious sometimes.
Maybe that's why it's alchemy. But yeah, I will try to share that later on Twitter.
Yeah. Fantastic. All right. And Paul says, read clipboard is pretty great.
Yeah. Yeah. Very cool. Bunch of different things there.
Yeah. So if you want me to walk through an example of how I use this at work, I'm happy.
Yeah. Give us a quick example. Yeah. So at Helio Campus, one thing we do is we connect to a lot of
different databases at universities. So the universities will have separate databases for admissions, enrollment, financial aid. Those are all separate systems. And so we pull all that data into a data warehouse. And in SQL, we can combine that data, build some extracts that we're all using the same way. And so we can either use this to just read one of
those tables directly, or we can combine what I typically do is do a little bit of cleanup and
feature engineering and narrow down my data set to the population that I want to run through my
model in SQL, and then just pull those final results. And now I've got my data set with at
least preliminary features. I might do some standardization and things like that in pandas, but I've got a pretty clean
and subset of the data that I need
right into my Jupyter notebook.
Oh, that's fantastic.
Pretty great.
Yeah, I think definitely understanding SQL
is an important skill for data scientists.
And it's slightly different than for, say,
like a web API developer, right?
Absolutely.
That's why I wrote the book.
That's awesome. Yeah, for sure. So on the API side, you kind of get something set up. You're
very likely using an ORM like SQL Alchemy and you just connect it and go. And once you get it set,
like you can kind of forget about it and just program against it. As a data scientist,
you're exploring. You don't totally know, right? You're kind of out there testing and
digging into stuff and sorting and filtering. And yeah, it's, I think you need, I would say you probably need a better fluency with SQL as a data scientist.
Absolutely.
Than as a web developer, because I can just you need and why and which fields you need.
Now you could just do it yourself
or add a field if you need it.
You can do more sophisticated things like window functions.
So yeah, I think knowing SQL is really a value add
if you're looking to become a data scientist
and putting yourself out there on the market.
If you can do the whole pipeline end to end,
it definitely makes you stand out.
I would think so.
All right.
One thing to wrap up on this.
Sam asks, can you configure SQL Alchemy to dump the raw queries that it runs?
Yes.
Yeah.
In this case, you have the raw query in your function call.
So I'm not actually using SQL Alchemy for that because I'm providing a query.
You just got like a select statement, right?
The problem with SQL Alchemy and data science is you have to the
structure of your models has to exactly match the structure of the data and often i imagine you're
just kind of dealing with loose data and it doesn't make sense to take the time to like model
it in classes but for sql alchemy you can just set echo equal true when you create the engine and
then everything that would get sent to the crossover the database gets echoed as like DDL or SQL or whatever that it does.
So yes.
Cool.
For sure.
Yeah.
All right.
Brian, want to tell us about our sponsor?
Yeah, let's.
I am pleased and happy to that Shortcut is sponsoring the episode.
So thank you, Shortcut, formerly Clubhouse, for sponsoring the episode.
There are a lot of project management tools out there, but most suffer from common problems. Like it's too simple for an engineering team to use
on several projects, or it's too complex and it's hard to get started. And there's tons of options.
And some of them are great for managers, but bad for engineers. And some are great for engineers
and bad for managers. Shortcut is different. It's built for software teams and based
on making workflows super easy. For example, keyboard-friendly user interface. The UI is
intuitive for mouse lovers, of course, but the activities that you use every day can be set to
keyboard shortcuts if they aren't already. Just learn them and you'll start working faster. It's
awesome. Tight VCS integration, so you can update tasks, progress, and commits with a commit or a PR.
That's sweet.
And iteration planning is a breeze.
I like that there's a burndown on cycle time charts built in.
They just are set up already for you when you start using this.
So it's a pretty clean system.
Give it a try at shortcut.com slash Python bytes.
Yeah, absolutely. Thanks shortcut for sponsoring this episode. Now, what have we got next here?
Pigeon. I want to talk about pigeon. So, uh, we already talked about Will McGugan and rich. So
it's time to talk about Anthony Shaw so that we can complete our shout outs. We always seem to give
over on the podcast. So I want to talk about
Pidgin because I just interviewed Anthony Chabot. More importantly, he just released Pidgin as 1.0.
So Pidgin is a drop-in JIT compiler for Python 3.10. Let me say that again, a JIT compiler for
Python. And there've been other speed up type of attempts where people will like fork CPython and they'll
do something inside of it to make it different.
Think Cinder.
There've been attempts to create a totally different but compatible one like PyPy, P-Y-P-Y.
And that's, they've worked pretty well, but they always have some sort of incompatibility
or something.
It would be nice if just the Python you ran could be compiled to go faster if you want it to be.
So that's what this is.
It uses a PEP whose number I forgot
that allows you to plug in something
that inspects the method frames before they get executed.
And then instead of just interpreting that code,
the bytecode as Python bytecode,
it'll actually compile it to machine instructions,
first to.NET intermediate instructions,
intermediate language,
and then does get compiled to machine instructions
that then run directly.
Works on Linux, macOS, Windows, x64, and ARM64.
So this is a pretty cool development.
This is pretty cool.
Yeah, so if we go over here and check it out, in order to use it, it has some requirements.
You just pip install Pigeon.
That's it.
That's crazy, right?
And then it has to be on 3.10.
It can't be older than that.
And you have to have.NET 6 installed.
Okay.
So that just got released.
It's a good chance you don't have.NET 6 installed.
But then once you set it up right, you can just say import pigeon,
pigeon.enable at the startup of your app,
and then it will look at all the methods
and JIT compile them.
So if you come down here,
like he has an example of a half function
that Anthony put up here.
And when it first loads, it's not JIT compiled.
But after that, you can go and say,
if you run it, you can say disassemble this thing,
and it'll show you basically assembly instructions of what was, would have otherwise been Python
code.
Wow.
It's wild, right?
So it's a little bit like Numba.
It's a little bit like a tiny bit like Cython in the sense that it takes Python code, translates
it into something else that then can be interoperated with and then makes it go fast.
So this is all well and good. translates it into something else that then can be interoperated with and then makes it go fast.
So this is all well and good. If you're going to use it on the web, by default, it would be just fine. Except if you're hosting it, normally you host it in this like supervisor process and then
a bunch of forked off processes. So there's a, a WSGI app configuration thing you can do as well.
Somewhere in the docs, I'm not seeing it right now, but you basically allow it to push the pigeon changes
on down into the worker processes,
which is pretty cool.
And it has a bunch of comparisons against PyPy,
Piston, Numba, IronPython, et cetera,
Nutka, and so on.
Now it's not that much faster.
It is faster when you're doing more
like data science-y things, I believe,
than if you're doing just like a query against a database where you're mostly just waiting anyway.
But still, I think this is promising and it's really pretty early days. So the thing to look
at is if there's optimizations coming along here somewhere in the docs, Anthony lists out the
various optimizations he's put in so far. And really, it just needs more
optimizations to make it faster still, which is pretty neat. I think that's pretty cool.
One of the things that my first reaction was, oh, it's.NET only, so I have to use it on Windows.
But that's not been that way for a long time. So.NET runs on just about everything.
Yeah, exactly. It supports all the different frameworks. There's even this thing called live.trypigeon.com where you can write
Python code like over here on the left, and then you can say compile it and it will actually show
you the assembly that it would compile to. And then here's the.NET intermediate language. I
guess maybe they should possibly be switching orders here. Like first it goes to IL and then it goes to machine instructions through the JIT, but it shows
you all the stuff that it's, it's doing to make this work. And you could even see at the bottom,
there's like sort of a visual understanding of what it's doing. One of the things that's really
cool that it does is imagine you've got a, a math problem up here. Like you're saying like
X equals, you know, Y times Y plus z times z, or, you know,
something like that, like, each one of those steps generates an intermediate number. So for example,
z times z would generate a, by default, a Python number. And then so would y times y,
and then the addition, and finally assign it, what it'll do is it'll say, okay, if those are
two floats, let's just store that as
a C float in the intermediate computation. And then that's as a C float. And so it can sort of
stay lower level as it's doing a lot of computational type of things. So there's a
bunch of interesting optimizations. People can check this out. I haven't had a chance to try it
yet. I was hoping to, but haven't got there yet. Yeah. Really interesting conversation you had
with them too. And it's interesting timing to just get him to jump on this.
Like right after he wrote the book on Python internals,
CPython internals, to jump into this.
Well, I guess he's working on it before, but still.
Yeah, you definitely got to know CPython internals to do this.
Rene, do you guys do anything to optimize your code
with like Numba or Cython or anything like that?
Or are you just running straight Python and letting the libraries deal with it?
Yeah, not currently.
We have a pretty good server and are working with relatively small data sets, you know, not millions of rows, for example.
So for right now, we haven't gone in this direction at all. I can imagine this would also be really useful if you were a computer science student and
trying to understand what's going on under the hood when you run these things.
So it's interesting that it's for the people that aren't seeing the visual, you kind of
have three columns here with the code side by side to kind of get a peek under the hood
at what's going on there.
But no, this isn't something I've used personally.
Yeah, yeah.
I haven't used it either. But like I said, I would like to, I think it's, it's got the ability to just plug in
and make things faster. And really it is faster to some degree. Sometimes I think it's slower,
sometimes faster, but the more optimizations the JIT compiler gets, the better it could be.
Right. So like if it could inline function calls rather than calling them or it could um
there's things like it if it sees you allocating a list and putting stuff into it it can skip some
intermediate steps and just straight allocate that or if you're doing accessing elements by
index out of the list it can just do pointer operations instead of going through the python
apis there's a lot of a lot of hard work that Anthony's put into this, and I think it's pretty
cool. Yeah, I haven't tried it. I would like to. Yeah. Cool. Indeed. All right, Brian, what do you
got for us? Well, actually, before I jump to the next topic, I wanted to mention, I wanted to shoo
into this last conversation. Brett Cannon just wrote an interesting article called Selecting a Programming Language Can Be a Form of Premature Optimization.
And this is relevant to the conversation because the real steps, he says, if you think Python might be too slow, another implementation like Pidgin is like step three.
So first prototype in Python, then optimize your data structures and algorithms and also like you know profile it um and then and then try another implementation before you
abandon python altogether uh and then you know you can do some late bindings like language bindings
to connect to c if you need to or rust um but um but i think it ties in as like, when would I choose Pigeon or PyPI over CPython?
Well, it's step three, just to let people know.
Step three, got it.
Step three.
I wanted to do something more lighthearted, like use print for debugging.
So I love this article.
I am guilty of this.
Of course, I use debuggers and logging systems as well but I also throw
print statements in there sometimes and I'm not ashamed to say it so Adam Johnson wrote tips for
debugging the print and there were a couple that with print there were a couple that stood out to
me I really wanted to mention because I use them a lot, even with logging though, is use debug variables with
F strings and the equal sign.
So this is brilliant.
It's been in since 3.8.
Instead of typing like print widget equals and then in a string and then the widget number
or something, you can just use the F strings and do the equal sign and it interpolates
for you or it doesn't interpolate sign and it interpolates for you.
Or it doesn't interpolate.
It just prints it for you.
So it's nice.
I like that.
The next one is, I love this.
Use emojis.
I never thought to do this.
This is brilliant.
Throw emojis in your print statements so they pop out when you're debugging.
Have you ever used emojis?
I started using emojis in comments.
Oh, okay.
Comments, nice.
Yeah, so I'll put like the different emojis mean,
for me, I was doing some API stuff.
So like, this is the read-only method here of an API.
So I'll put a certain emoji up there.
And then this is the one that changes data.
So here's, so I'll put there,
here's one that returns a list versus a single thing.
So I'll put a whole bunch of those emojis and stuff.
Yeah.
Well, I mean, I used to do like a whole bunch of plus signs because they're easy to see,
but an emoji would be way easier to see.
So way more fun, man.
Yeah.
Yeah.
I do this as well.
I print all the time for debugging, especially in Jupyter notebooks, because you don't always
have the most sophisticated debugging tools in there, but being able to print and see
what's going on, going on as you go through each step of the notebook and emojis are a great idea for that
because it's so visual as you're scrolling through, you want to like the, they're showing
there, the X and the check mark emoji. I like that for my little to-do lists and the comments
that I leave. Yeah. I thought so. I've done that as well. That's cool. Chris may on the audience,
uh, just put a, you know, a heart sign, smiley face emoji as a response to this.
Last thing, he's got like seven tips.
The last tip I wanted to talk about was using rich and or specifically rich print, rich dot print or P print.
So for P print, you have to do from P print, install P print or something, or unless you wanted to say it twice with pprint
dot pprint but it's for it's pprint stands for pretty printing and the gist of this really is
the structures by default print horribly uh if you just print like a dict or uh or a set or
something it looks gross but rich and pretty print uh make it look nice so if you're debugging printing with
those and debugging use that so yeah there's also exception handling stuff in there for it and uh
there's a lot of that kind of debugging stuff in rich yeah printing exceptions is great with that
um i also wanted to say one of the reasons why um i the one of the places where I use printing a lot for debugging is I print to print stuff in my,
what I expect is going on when I'm writing a test function. So I'll often print out the flow,
what's going on. The reason I do that is when, if PyTest, for PyTest, if a test fails,
PyTest dumps the standard out. So it'll dump all of your print statements from the failed
procedure. So that's either the test under code or the test itself. If there's print statements, it gets dumped out.
So that's helpful. Yeah. Nice. That's great. I love it. I use print statements a lot. My output
is very verbose. You can see right in order what's happening. Sometimes a debugger helps,
but sometimes it's time to just print. Yeah. Speaking of visual stuff, what's your last one here, Renee?
Yeah, so in our line of work with data science,
especially when you're providing the models as tools
for end users that aren't the data scientists themselves,
you really want, the explainability is really important.
So being able to explain why a certain prediction
got the value it did, what the different inputs are,
we're always working to make
that more transparent for our end users. In our case, for example, we might be predicting which
students might be at risk of not retaining at the university, so not being enrolled a year later.
So what are the different factors, both overall for the whole population that are correlated with
not being enrolled for a year.
And for each individual student, what might be particular factors that, at least from the model's perspective, puts them at higher risk. So this package is called SHAP, and that stands for
Shapley Additive Explanations. It was brought to my attention by my team member, Brian Richards,
at Helio Campus. And now we use it very frequently because it has really good visualizations.
So these Shapley values, apparently they're from game theory.
I won't pretend to understand the details of how they're generated.
But you could think of it as like a model on top of your model.
So it's additive and all the different features.
If you see the visualization here, it's showing kind of a little waterfall chart.
So some of the values that, think of a particular row that you're running through your algorithm.
Some of the values in that row are going to make the, if you're doing a classification
model, some values might make you more likely to be in one class.
Some might make you more likely to be in the other class.
So you have these visuals of kind of the push and pull of each value. In this visual, we're seeing age is pushing a number
to the right, sex is pushing it to the left, I guess BP and BMI, that looks like a blood pressure.
So it's got this like waterfall type of chart. And what it's actually doing is it's comparing,
it starts with the expected value for the whole population, and then it's showing you where this particular record, each of the input values is kind of nudging that eventual prediction one in one direction or the other.
So it's just nice visually to have those waterfall plots and to see which features are negatively or positively correlated with the, um, you know, the end,
end result. And you could also do some cool scatter plots with this. So you can do
the input value versus the shop value and have a point for every, um, item in your population.
So for in our, in our example, that would be students. So we can have a scatter plot of all
the students and something like, um like the number of cumulative credits that they
have as of that term. And so you'll see the gradient of like from low credits to high credits.
It's not usually linear. What are those kind of breakpoints? And at what point are the values
positively impactful to likelihood to retain or in the opposite direction. Of course, I'm glad that they put in this
documentation, they have a whole section on basically correlation is not causation.
And we're constantly having to talk to our end users about that. But if we say a student that
lives in a certain town is potentially more likely to retain maybe because of distance from campus, or maybe you
have traditionally recruit a lot of students from that town. It doesn't mean that if you force
someone to move to that town, they're more likely to stay at your institution, right?
So yeah, correlation is not causation. And I'm going to switch over here to the visual,
to something called a B-swarm plot that you can output this right in your Jupyter notebook, which is really handy when you develop a new model. And I'll try to describe this for people
who are listening to the audio. It has along the x-axis a list of features. So you've got,
in this example on their website, age, relationship, capital gain, marital status.
And then you see a bunch of dots going across horizontally.
And there's areas where there's little clusters of dots. So what this is showing is the x-axis
is the SHAP value. So what this SHAP package outputs. So you can see visually across what
is the spread of the impact of this value. So if each dot is a person in this case, you see people all the way to
the right, whatever their age was positively impacted their eventual score. People all the
way to the left, their age negatively impacted the score. And then each dot is a color that ranges
from blue to red. So the blue ones are people with low age and the red ones are people with a high age.
So you can see here that basically the higher the age, the more positive their eventual prediction.
So just an interesting way to get both like a feature importance and see the distribution of the values within each feature.
So it's just really helpful when you're doing predictive models, both for evaluating your own model and then eventually explaining it to end users. So would a wider spread mean that the
feature is more useful or does it have any? Yeah, especially if you can see a split in the numbers
there. So you see in this example relationship, you've got all the red ones to the right and all
the blue ones to the left. That means that there's a clear relationship from this relationship field with
the target variable. So there's a clear split where the low values are on one side and the
high values are on the other side. And then, yeah, the spread means that if there's not a good
example here, but sometimes you'll see like two clumps, two
bee swarms spread apart with nothing in the middle.
So that's when you have a really clear spread of the high impact group versus the low impact
group.
And if it's more narrow, that's less of an important variable.
So you see if you look at the one that's sorted by max, here we go. Absolute value of the shop value.
The ones near the bottom for the population have less impact.
Now, there might be one person in there where that particular value was like the deciding
factor of which class they ended up in.
But for the population as a whole, there's less differentiation across these values than
across the ones near the top of the list.
Yeah, this is cool because that visualization of models is very tricky, right? And it's something
like knowing why you got an answer. Now this looks very helpful.
Yeah, it's really useful. And the visuals are so pretty by default, but then you can also pull
those values into other tools. So we'll, for example, so for each feature in each row, you get a shop value.
So you can write those back to the database and then pull those values into another tool.
We use it in Tableau to highlight for each student what are those really important features,
either making them, if you're not making them, it's not causation, but correlated with their
more likely to retain or less likely to retain.
So we might say, well, for this student, their GPA, that's the main factor.
Their GPA is really low.
Students with low GPAs tend not to retain.
And so when the end user is looking at all of their values in a table or some other kind
of view, you can use the shot value to highlight.
GPA is the one you need to hone in on.
The student is struggling academically. Try to Try to help them get some help with
grades, for example. Cool. Yeah, this is a great find. Indeed, indeed. Brian, that brings us to
the extras. Extras. Got any extras for us? I do. I've got one that was just a quick one. Let's see.
Pull it up. Matthew Fiegert mentioned on on twitter that pip index is a is a cool
thing um and uh i kind of didn't know about it so this is neat so uh pip index is something you
can take a pip index well specifically picks pip index versions so pip index does a whole bunch of
stuff but pip index versions will tell you if you also give it a
package uh it'll tell you um all the different versions that are available on pi pi and which
one you have and which you know if you're out of date and stuff but um so for instance if you're
thinking about upgrading something and you don't know what to what to upgrade to um you can look
to see what all is there, I guess.
Or you want to roll it back. You're like, oh, this version is not working. I want to go back to
a lower one. But if you're on 2.0, it's not 1.0. What is it, right? What do you go back to? And so
this will list all the available versions. Basically, this is a CLI version of the releases
option in PyPI.org. Right. And it's not obvious how to get to on PyPI,
but I know you can get to it.
You can see all the releases,
but by default, it doesn't show those.
So this is pretty quick.
Yeah, pretty neat.
Good one, good one.
Renee, how about you?
Some extras?
Sure.
I wanted to make sure to mention my book.
So just published and just out in Europe this week, actually, is a paperback, but it's been
out since September in the US, SQL for Data Scientists, A Beginner's Guide for Building
Datasets for Analysis.
So I mentioned earlier, you know, I wrote this book because I think a lot of students
coming out of data science programs or people who are coming from maybe a statistics background
that are in data science might not
have experience pulling the data. So in class, a lot of times you're given a clean spreadsheet to
start with when you're building your predictive model. Then you get to the job and you sit down
the first day and they say, great, build us a model. And you say, well, where's the data?
It's in a raw form in the database. So you have to build your own data set. So that's what the
purpose of the book is to kind of get you from that point of when you have access to raw data to exploring
and building your data set so that you can run it through your predictive model. So on the screen
there, you see my website and for people on the audio, it's sql4datascientist.com. And you can go
to different chapters on the website and I have some example SQL and you can also run it.
So there's a SQLite database in the browser here.
And so you can actually copy and paste
some of the SQL on the page,
click execute and it shows up in a table down here.
You can edit it and rerun it.
So you get a little bit of practice
with using the database in the book.
Neat.
Yeah. Cool book and wow, SQLite in the book. Neat. Yeah.
Cool book.
And wow, SQLite in the browser.
Very neat.
Thank you.
Yeah.
Awesome.
That's a book that definitely should exist.
All right.
Really quickly, I'm going to do a webcast with Paul Everett.
I haven't seen Paul in the audience today.
Paul, where are you?
No, I'm not sure.
He might be working.
But on November 23rd, I'm going to be doing a webcast around PyCharm.
I've updated my PyCharm course
with all sorts of good stuff.
I haven't quite totally announced it yet
because there's a few things I'm waiting,
slightly more stable versions to come out of JetBrains
to finish some of their data science tools, actually.
And then I'll talk more about it,
but we're doing a webcast in about a week or so.
So that should be a lot of fun.
And yeah, come check it out.
Watch Paul and me make the code go.
Two days before Thanksgiving.
Yes, indeed.
All right, that brings us to our joke.
I need a joke.
And the joke is a response
to something that you posted on Twitter.
Really appreciating my foresight
using lots of stuff comment as the message.
And git commit.
Well, it actually confused me
because I did a git rebase main
and it said applying lots of stuff.
And I thought it was like a feature of git rebase
and it just happened to be my commit message.
Yeah, it's like, oh, git's gotten real casual.
It's lots of stuff.
So Francois Voron said,
time for a classic XKCD link link here yeah yeah yeah and so this
is like the commit history throughout the project as you get farther into it so it starts out with
very formal proper comments like created main loop and timing control the next commit is enabled
config file parsing, and then starts
to miscellaneous bug fixes, and then code additions and edits, and then a branch, more code,
you have code, just eight letter A's. It comes back with screaming. Exactly. Just
A, D, K, F, J, S, L, K, just a bunch of home row. And then my hands are typing words and then just hands.
And the title is, as a project drags on, my get commit messages get less and less informative.
We've all been there, right?
Yeah.
Yes.
It happens to me with branch names too.
Because if I'm working on one feature and push part of it and then I go and I'm still working on it. I, I, I like to use a new branch name and I just, I, I can't, it's hard to come up with good branch
names for a feature. I'm branching. Exactly. I try, I try to be more formal on the branches,
at least so I know I can delete them later. And so when I'm working on projects that are mostly
just me, I'll, I'll create a GitHub issue and then create the branch name to be like a short version of the issue title and the issue number.
So then when I commit back in, I can just look at the branch name and put a hash that number and it'll tag it in the commit on the issue in GitHub.
If I'm working with someone like a team, I might put like my name slash branch name.
And then actually in some of the tools like SourceTree,
you have like little expando widgets
around that on the branches.
So you can say, these are Michael's branches
and these are Renee's branches and so on.
Yeah, we got into the habit of doing that too.
It helps a lot to see right up front,
whose branch is this?
Yeah, it can get out of control, right?
All right, a quick couple of follow-ups, Brian.
Anthony says, the book looks great, Renee.
I'll check it out. Great, thank you. Chris May likes it, Brian. Anthony says, the book looks great, Renee. I'll check it out.
Great, thank you.
Chris May likes it as well.
It's a great book idea.
Glad you like it.
Especially when I keep working long
after I should have gone home.
No, this is the joke.
This is me, especially after I keep working long
after I should have gone home.
Yeah, absolutely.
And Sam, oops, forgot to stage this
as a common message in my repositories.
Nice. Indeed. So, cool. It was a fun episode. oops forgot to stage this as a as a common message in my repositories nice indeed so cool
it was a fun episode thanks Renee for coming
on yeah thanks for having me it's fun
don't get to dive into Python
too often I'm using the same type of
things over and over so it's nice to see what's
new and what's on the horizon awesome
yep thanks much today thanks Brian
bye