Python Bytes - #238 A cloud-based file system for Python and a new GUI!
Episode Date: June 15, 2021Topics covered in this episode: Practical SQL for Data Analysis Git Blame in your Python Tracebacks fsspec: a unified file system library The need for slimmer containers PandasGUI: A GUI for analyz...ing Pandas DataFrames xarray: pandas-like API for labeled N-dimensional data Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/238
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 238, recorded June 15th, 2021.
I'm Michael Kennedy.
And I'm Brian Ocken.
And I'm Julia Sidnell.
Hey, Julia. Thanks for coming on the show.
Yeah, thanks for having me.
Yeah, it's great. Why don't you tell folks a bit about yourself?
Yeah, so I'm the head of open source at SaturnCrowd and a maintainer of Dask.
So I split my time half and half.
I spend half my time just doing regular like maintenance-y stuff on Dask
and then half my time doing like engineering and product management on SaturnCloud.
SaturnCloud is a data science platform that really specializes in distributed Dask clusters in Jupyter
and making it really easy for people to get up and going with those things on AWS.
Yeah, Dask is really interesting.
You know, when I first heard about it,
I thought, okay, this is like a grid computing scale-out thing,
which I probably don't have a lot of use for.
But then I was speaking with Matthew Rocklin about it,
and it has a lot of applicability,
even if you have not huge data, huge clusters, right?
Like you can say, even on your local machine, scale this out across my cores or, you know,
allow me to work with more data than will fit in RAM on my laptop and stuff like that,
right?
It's a cool idea.
Yeah.
Yeah.
It has like a whole number of different ways of interacting with it, right?
Like there's that, there's like, just make this thing go faster by parallelizing it.
There's all the data framey stuff.
There's all the array stuff for more dimensional data.
So it's got a, it's got a large API.
Yeah.
Cool.
And we're going to touch on a couple of topics that are not all that unrelated to those things
here.
And so, yeah.
Speaking of data science, Brian, you want to kick us off?
Sure.
Yeah.
The first thing I want to cover is an article called Practical SQL for Data Analysis.
This is by Aki Benita.
So one of the things I liked about this is it was kind of talking about the first bit of the article was talking about basically that with data science, you've got pandas and NumPy and stuff.
And you also often you're dealing with a database
and SQL on the back end.
So the first part of the article talks about
how some things you can do both in pandas
and in SQL, like SQL queries, it's faster in SQL.
So there's a big chunk that's just talking about how that's faster.
But then he also talks about just basically there's a lot of benefits to the flexibility
and the comfortableness you can have with pandas, though.
So trade-offs as to where you're going to push it you're going to push the push it too far into sql or or have a nice split is good but then he goes through and talks about a whole bunch of
great examples of different things like pivot tables and roll-ups and and choices and different
things you can do with either pandas or sql and really what his recommendations are for whether
it should be in in pandas or in SQL query, and then how
to do those queries.
Because, I mean, really the gist of the article is, and this problem space is people are comfortable
with pandas, but they don't really understand SQL queries.
So this sort of good cheat sheet for how to do the queries is, I think, really kind of
a cool thing.
So, yeah, I think, really kind of a cool thing. Yeah.
I think it's really neat.
And you have these problems, you know how to solve them in one or the other.
And I think this compare and contrast is really valuable, right?
Like, I know how to take the mean of some column in SQL,
but I haven't done it in pandas yet.
Let's go see how to do that.
Or I'm really good at doing pivot tables in pandas,
but boy, I always kind of avoided joins in SQL. They scared me. And then how does that even translate? Right. I think that back and
forth is really valuable. Yeah. Yep. And then it covers things that I don't even know what they
are, like aggregate expressions. I don't even know what that is, but apparently that's a thing
that people do. I can help you out at aggregate stuff. No, just kidding. Julie, what do you think
of this? Yeah, no, it seems, it's really cool. Like I agree that like the having the, the having an impendence
and then a SQL, that comparison is super helpful. Like SQL is always super scary to me. And I always
end up like Googling a bunch of stuff whenever I have to mangle my SQL. But I know it's so fast,
so it's cool to see a way to access that. Yeah, absolutely. This is a good one, Brian. I
think a lot of people will find it useful. I also want to just give a quick shout out to the past a
little bit. Not too long ago, we talked about an efficient SQL on pandas with DuckDB, where you
actually do the SQL queries against pandas data frames. So if you're finding that you're trying
to do something, and maybe it would be better in SQL, but you don't want to say completely switch
all your data over to a relational database.
You just kind of want to stay in the Panda side,
but there's that one or two things.
Like this is really cool.
This sort of upgrade your data frame
to execute SQL with the DuckDB query optimizer
is also a kind of a nice intermediary there.
Yeah, Dask also does some, I'm going to try not to make everything about Dask, but Dask does
some things that are kind of, that kind of take some of the ideas from this article of
like doing predicate pushdown of like, of pushing down some of the like filters into
the read because it evaluates lazily.
It doesn't have to like grab all the data greedily up front.
It can like do that later.
So you can get some of the benefits.
That's cool.
And it can also distribute the filter bit,
I guess at that point.
Yeah.
Nice.
All right.
I want to talk about the usual suspects.
So, okay.
That was, that was a pretty good show.
Was that Quentin Tarantino or something like that?
It's not actually about this.
This comes to us from Ruslan Portnoy.
And thank you for sending this in. Mentioned an article that has this really interesting idea.
How do you apply git blame when you encounter a Python traceback? So here's the scenario. Your
code crashes and you either print out the traceback or Python does it for you because it's just
crashed. And normally it says, here's the
value. Here's the line of code. Here's the file it's in. Here's the next line in the call stack.
Here's a line of code it's in. The idea is you can take git blame, which is a command that says,
show me who changed this line of code or who wrote this line of code, at least touched it last
on every single line of code. And I love this whole idea of like, all right, who did this line of code at least touched it last on every single line of code. And I love this
whole idea of like, all right, who did this? And sometimes I'll come across code. I'm like, this is
so crappy. Like who did this? Oh, wait, that's me. Okay. Well, at least I know what I would feel
about it. But the idea is what if your trace back on each line where it had an exception could also
show who wrote that line of code. Cool, huh? Yeah, so let's check it out. It's pretty straightforward.
This is an article by OfferCoren, and it basically uses two libraries that are themselves both
pretty straightforward. So like, here's a straightforward example of a traceback,
like trying to pop something off of an empty list. It says on this line in the function pop sum,
you know, there's this line here in the call stack, and then the next line, this line in the function pop sum, you know, there's this line here in the call stack.
And then the next line, this line in the call stack, and eventually raise a value error, you know, empty range, can't pop nothing off, you know, something off of nothing, basically.
But this doesn't show you any information about like maybe who wrote that line and who wrote this other line up here, right?
So what they did is they took a couple of modules, Traceback, and then LineCache.
And it turns out when Traceback shows you this Traceback, it uses LineCache to figure out, okay, from this actual, I'm guessing, bytecode that it's going to run, this CPython interpreter code, what line of file did this actually come from right yeah so here's the
the insight or the thing you can actually change what's in the cache and because it's a cache
once it's figured out what the lines are it's not going to read it again so it's like um like a list
for each line that you get back and you can just change the value so it said okay well here's like
return random.
That's what the line of text was. They're like, no, no, no, there's nothing to see here. Move
along. If you make that and then you cause it to crash again, what comes out is if you go a little
bit further down, normal code, normal code or normal trace back, normal trace back. Then it
just, instead of the line of code, it says nothing to see here. Please move along. All right. So what
are you going to do with that now that you realize like you can actually change
what appears in the traceback?
So you write a little regular expression to go and execute get blame on the various files
and then to re-inject that back into line cache.
And so what they do is they just put, if they know the blame, they just put, you know, like
80 lines, 80 characters, up to 80 characters
of the line, and then edited on such and such and such date by such and such person. And here's the
commit message, right? And so just basically shelling out to get blame when it crashes. Now
you get some really cool stuff. Like on this line, it says this is edited by, you know, many,
many days ago by so-and-so in this Git commit and so on.
And what's interesting, like this is already in itself useful, I think.
But what's more interesting is other tools use this as well.
So, for example, if you use PUDB, which is a sort of visual debugger, kind of.
It's like a command line one.
I know visual in the sense of like Emacs is visual, not like PyCharm is visual.
But it will actually pull up that data.
So you can see they jumped into the PUD debugger
and it's actually showing all this get blame attribution
as well that they've added.
So yeah, pretty interesting.
What do you all think?
Yeah, I think that looks really cool.
I mean, I always do get blame
whenever I run into something that's weird
with the hope that someone else
will be able to explain it to me.
Exactly, who knows about this or who do I talk to about breaking this? Right. Yeah. You
could even put like PR numbers and stuff in here. Right. And that'd be pretty cool. Yeah. Yeah.
Yeah. That'd be super cool. Yeah. One of the things I like, I don't really like that the name
get blame, but it's there. But I agree with Julia that the main thing I use it for isn't
to try to figure out who broke it, but who to ask about this chunk of the code.
I agree.
Because usually when you see something that's really confusing and weird, you're like, I know they didn't just pick the hard way of doing this because they didn't want to do the easy way.
There's something that I don't fully understand, some edge case that's crazy here.
I'm going to go talk to that person.
So, yeah.
Also the,
the,
how long ago it was edited.
So if there was something that edited yesterday,
that's probably the problem.
Yeah,
exactly.
Like in this little screenshot here,
some of these are edited like 1,427 days ago.
That's probably not the problem.
Maybe,
but I feel like I have the opposite assumption.
Like if something is from six years ago and it's weird,
I'm like, well, probably things were different back then.
Okay. Yeah.
Yeah. Yeah. It's no longer applicable to the new data, new situation.
Yeah.
Oh, that'd be an interesting thing also is to have like a tool that would tell you if something's like over a thousand days old or something like that, you probably should go refactor it to make sure somebody understands that code.
Yeah. Yeah.
Yeah, for sure.
All right.
Jumping back to the first item really quick.
In the live stream, Alexander out there.
Hey, Alexander.
Says, I wonder if graph databases with Gremlin queries could be more suitable for data science.
You know, SQL joins are way harder.
Yeah, graph databases are pretty interesting.
If you're trying to understand the relationships, that may well be better.
I don't know.
Gila, do you got any thoughts on this? I don't know anything about graph databases. So out of my league. I didn't have a desire to understand graph databases until I found out that there
were gremlin queries. Now I think I will. Well, Brian, they don't start out as a gremlin queries.
They're mogwai inserts. And then if you insert them after midnight, then they become a gremlin queries through mogwai inserts and then if you insert them after midnight then they become
a gremlin query i mean come on we all know how it goes you definitely don't want to get them wet
oh that's an old show i'm not sure if everyone's going to get that reference but yeah that was i
love that show okay anyway let's let's move on to the next one the next one is you julia yeah so i
wanted to highlight um fs spec uh so file system spec for people who
can't hear letters very well um so this is the basis for s3 fs fs say i'm not getting the letters
right but there's there's one for gcp there's one for s3 and um, it's a file system storage interface or, like, the basis for a file system.
And so, you can do things like you can open just files as you can just take a path and open it as
a file object in Python and read it with all the normal, like, read write operations.
Oh, interesting.
But from anywhere. So, like, there's all these different ones for S3, for GCFs,
and for, like, even for, like, HTTP and just basically anything you can imagine. Anywhere
you can imagine a file being. Either there's already been one of these written. It's kind
of like a, it's an interface and then you write different packages on top of it that are like drivers or something.
They have some name for it.
And it allows you to treat the file system as like this interchangeable building block.
So you don't get, you don't end up writing like photo three code or something that's like very specific to a specific um cloud
storage you write like this more general code and then um it's really useful for like a lot of free
data sets that are hosted on different clouds but like they'll sometimes be on one cloud and
sometimes be on another but like basically it's the same data um or if you're at a company and
you want to like switch clouds it just makes that whole thing so much easier it looks really really useful especially for avoiding cloud lock-in yeah yeah and you can
always write like you can always write your own one if something else pops up you can write your
own implementation of that all right so there's an example here talking about using a file system
in the docs that says something to the effect of well well, you want to open up a CSV and feed it off to pandas read CSV. So normally you would say open
CSV file, and then you just say pandas read CSV and give it the file stream. But what if that's
on the internet? What if that's on S3 with authentication? What's that? What if that's,
you know, somewhere else, right? And so with this one, you can just say, FS, InfoSystem spec, open,
here's a URL. And now that's a stream, right? Or that could be, here's an S3 location,
S3 bucket, go get that, right? Yeah, yeah. So instead of passing the path directly into the read function, you pass in the file object. And's really powerful. Like it seems like a thing that we shouldn't need, but, um, files get like the file locations
can get so crazy so quickly.
Um, and this just really helps simplify and like make it so you don't have to think about
this stuff, which I think is what most people want.
It's what I want.
Yeah, for sure.
So like there's a local file system option, but then you could also have an FTP file system or you could have something else, right?
All sorts of different options.
Yeah.
Yeah.
All sorts of stuff.
Yeah.
Okay.
That's cool.
Brian, what do you think?
Does this have any applicability for you?
Oh, yeah, definitely.
And that's a great abstraction layer to put in place to just have reading as if it was a file and have it moved. It also helps you develop
tools locally and then be able to deploy them into a larger space. So it's cool.
Yeah, for sure. One of the things that always makes me a little hesitant when I hear people
say things like we're cloud native, like my app is cloud native. That's always code word for me.
Like I will never be able to run my app unless I'm connected to the internet.
You know, it's like,
it depends on all these services together.
And there's no way I can recreate that locally.
But something like this could allow you to say,
well, we're going to have a local file system version.
But then when we go to production,
we'll switch to, I don't know, S3 or, you know, pick it,
pick something.
I've always wanted to make it either a t-shirt or a sticker or both that says,
not a cloud native, just visiting.
Nice. I also think, Brianrian there might be testing opportunities here yeah definitely give it a test file system that'd be cool yeah and like julia said swapping things out to just have have your
um have your logic not have to care where it's coming from um but um but i i guess it would
make sure you'd have to make sure all of the interfaces,
the different storage systems really are equal.
But I guess you'd have to try that out yourself.
Yeah, there's kind of a bucket, right?
There's kind of like a dict that you can pass, which is like storage options.
So I think that might get a little wonky,
depending on what the different backends need.
But the general principles are the same. And it also, I should have said this originally, that might get a little wonky um depending on what the different backends need but the like
general principles are the same and it also i should have said this originally but it also
allows like fs spec itself can contain logic to do things that are um general to all the different
libraries like caching and things like that to all the different interesting like you could put
a caching layer on top of arbitrary things like S3, Google storage and Azure buckets or blob storage.
Yeah.
Yeah.
Maybe even save money on bandwidth there if you can do some caching.
Yeah.
If you can do it right.
Yeah.
Super, super neat.
Brian, you're going to tell us about how to slim down our Docker containers.
But before you do, I want to tell people about our sponsor for this episode brought to you by Sentry.
So how would you like to remove a little stress from your life in addition to just abstracting your file system, maybe tracking down
some errors. So do you worry that your users may be having difficulties or encountering errors with
your app right now? And would you even know it until they send that support email? How much
better would it be if you got the error or performance details sent right away and with
all the call stack, maybe with get
blame in there, the local variables, the active user who was logged in while this happened, all
that kind of stuff. So with Sentry, it's not only possible, it's actually really simple. I've used
this on Sentry. I've used Sentry on our websites before. So it's on Python Bytes, TalkPython
Training, all those different sites. And I've actually had someone encounter an error trying
to buy a course over on TalkPython Training. I got those different sites. And I've actually had someone encounter an error trying to buy a course over on TalkPythonTraining. I got the Sentry notification.
I said, oh, geez, I can't believe this problem crept in here. And I fixed it really quick and
started to roll out the fix and actually got an email. They said, hey, we're having this problem
buying a course. I know, I've almost got it fixed. Just give me a moment and try again.
And they were just like, what? That doesn't make sense. So they were very surprised.
And so surprise and delight your users.
Create your Sentry account at pythonbytes.fm slash Sentry.
And when you sign up, there's a little got a promo code.
Make sure that you put Python Bytes, all one word, all caps with a Y in there.
And you'll get two free months plus a bunch of extra features and so on.
So also, it really lets them know that you came from us rather than just somewhere else.
And that helps support the show a lot.
So pythonbytes.fm slash Sentry
and promo code Python Bytes.
Awesome.
Thanks for supporting the show, Sentry.
And Brian, let's talk Docker.
Yeah, let's talk Docker.
I mean, I'm starting to use Docker
more and more.
And I like the experience,
but I was interested
when this article came up.
So it was in June.
I saw this article called The Need for Slimmer Containers.
And this is from somebody, Ivan.
Ivan, I'm not going to try his last name.
Ivan something.
But anyway, it's an interesting discussion.
And the idea around the original post was that there's now a Docker scan that you can use.
So you can use Docker scan to scan for vulnerabilities in your Docker containers.
And I even thought, well, I'll look at some of the standard Python containers that are available.
Right. Theoretically, some of the things that are nice is I can just go and say docker or in my my docker container and say from python colon three nine and i don't have to think about
how do i install python how do i keep it up to date you know make sure that pip is there and
that i'll be able and you know pip install stuff that needs to do build things and that's all that
stuff will be there right so it seems like of course this is what you want yeah well and also
just that's kind of the one of the
neat things about dockers i can just say i have these standard parts now i just want to put my
custom stuff on top of it and um and it's great so well what did he find so he used uh so docker
scan apparently uses uh a third-party tool called uh snake snyk container and we've covered uh snake before not the container
version but uh we covered snake in episode 227 um but um so it's looking for vulnerabilities um
and that's a good thing but he found him in everything and he found him in all of the the uh
the standard python ones except for alpine i guess um And so he didn't really know what to make of it,
really. He was just sort of reporting his results that maybe Alpine is the only one with few
vulnerabilities. But then this went out on Hacker News and there was a big discussion around it.
So he updated the article, which I appreciate appreciate with some of the feedback that he got.
And so some of the feedback was that these vulnerability checkers sometimes give you false positives.
And I don't really have enough experience to know what that I know what that means.
But I don't have enough experience to know if these really are false positives or if they're actual vulnerabilities or not. The other thing that was that that maybe
some people suggested that these these standard ones really aren't updated very much. So I don't
really know much about that either. And if they're not, that's kind of a bummer because I think I
think people are relying on them. So I actually just kind of am left with a little bit of a confusion as to what to do.
The one of the, I want to also mention that the Alpine is current one.
There's original article.
He says Alpine is pretty good for vulnerabilities, but then his followup says it doesn't.
Well, there's a lot of applications that can't run on Alpine because of some issues or another.
So anyway, I'm not sure what to make of it.
So I was hoping Michael might give us some insight.
I did some thinking about this this morning.
And in fact, I recently spoke a lot about this
over on TalkPython.
So I had Itamar on the show
and we talked about best practices for Docker packaging.
And we talked a lot about both security and package size.
So I can try to relay
a couple of things from that. So we've got our official image over here, our Python official
image. There's actually a bunch of options. As you can see, there's a few, like 310 beta 2 buster
or the 310 RC buster. That sounds bad, but I think it's actually good. No, I'm just kidding. I know what it is.
So these are by default based on Debian,
and Buster is the latest version of Debian.
And so you can do a Buster,
which is like full Debian with 310,
or you can do a 310 Slim Buster,
which is like a slimmed down version of Debian Buster
that supports Python 310.
Okay, so there's a lot going on here
in terms of the options.
One of, so the article talks about
how Alpine had the fewest security vulnerabilities.
And actually, so the Python latest,
if you run the sneak package scanner thingy on it,
it says there's 364 vulnerabilities
if you just do Python latest 3.9 and 353 after
you run apt update apt upgrade. So if you try to get the container to update itself,
there's still 353 in that one. I don't use that. I use Ubuntu. So I use the Ubuntu latest
and the bare version of that one had 31 vulnerabilities.
But then if I either install Python through apt or build it through source and put in
the necessary foundational bits like build essentials and stuff to build Python, it goes
up to 35 total problems where 28 of them are low.
So seven are medium, nothing major.
One thing I thought was weird was I actually ran another step where I said, okay, let's uninstall those intermediate tools like GCC and Wget and stuff
like that, that I needed to get stuff on the machine, but I'm not going to use again.
And I took them away. And almost all those warnings were about those tools that I had
apt uninstalled. So I don't know why Snyk is still showing them. Cause if I go into the container,
I type Wget, itget it says nope this thing
is not installed sorry but it still says the warning is that wget has a vulnerability in it
for example right so there's like there's like this over reporting for sure but i mean the
difference between 28 and 350 is not trivial right right so like run an apt install python 3
type of thing is not you know it's it's probably worth it, for example.
When I switched from Python 3.9 to Python 3.9 Slim Buster,
it went from 350 to 69.
So that's a lot better, right?
Yeah.
It's still not as good as Ubuntu, but it's a lot better.
It's still twice as many.
It sounds better, but it could be like 359 low
problems and then 69 critical ones um it totally could it totally could yeah also if the reporting
like if the if if we can't trust snake necessarily then like maybe you know if you can't trust your
reporting system then like maybe that maybe none of this means anything, right?
Yeah.
Yeah.
I think one of the things the article originally started out to address was if you have fewer subsystems, there's no chance the missing subsystem could get hacked because it's not there.
Right?
So if there's a vulnerability in SSH, but you literally don't install SSH, who cares?
Whereas if you just take the full distribution, you may potentially get
affected by something you dragged along. And then it went down this rattle of like, well, let me
scan it and so on. So I want to add one more thing. Alpine did result in the best outcome from the
scanner, but there's a lot of issues with Alpine and Python. So for example, there's this PEP here, 656, that right now, if I try to
pip install something on Alpine, so especially in the data science world where things are large and
the compiling takes a lot of steps and so on, the wheels that are built for Linux are built for,
what is it, glib, gclib, I mean, hold on. I'll look over here. I wrote it down so I know.
No, I didn't write it down.
Sorry.
I think it's GLIB or GCLIB, which is the C runtime on Ubuntu and Debian.
But there's one MUSL, muscle, on Alpine.
And the wheels are not built for muscle.
They're built for GCLIB.
And so you can't hip install that.
You've got to download everything and then compile it
and it's like compiling matplotlib and jupyter from scratch can take a really long time versus
just downloading the wheel and it takes up a lot of space and there's there's a bunch of issues and
things around that that like make it slightly not python friendly that's why there's this PEP 656 to allow wheels to be tagged as supporting muscle, not GC lib.
Is that more than you wanted, Brian, or are you good?
Okay, so the takeaway that I'm getting is probably not panic on some of these, but maybe at least pay attention to them.
And it is good, like you said, to remove tools out of your Docker images that you're not using. If you're
not using Wget in your application, take it off. Things like that. Yeah, exactly. I think Julia's
point was great, right? It might be a false positive, but at the same time, if you're not
going to use it again, because Docker, a lot of times you pip install all your stuff and then it's
kind of ready to run, but you're not going to go and pip install something again you're going to do a new docker build from scratch right
like one of the final lines could be remove remove all those intermediate things that could have
problems and make it larger and whatnot yeah i've thought um so i've only thought about this from
like packet from like image size right like that that you want some more images just because it
takes forever to get them around. But, um, it's interesting to think about from the vulnerability
perspective. And I've always seen it done as, um, you do whatever installation you need and then
you do all these like cleaning steps. But what you said, Michael, about like not ever putting
certain things on your image was, is interesting. I haven't heard of that before.
Yeah. Thanks. I also had Peter McKee from, who works at Docker on TalkPython a little while,
like six months ago or something. And he talks about having these multi step builds,
something to the effect of doesn't make as much sense with Python. I'll try to put it together.
But like, imagine you're building a Go library, you could put the Go runtime and build tools on a
container, build your thing. but the thing you get from Go
is an actual binary that's all self-contained. You could throw that container away and just copy the
output of that into your actual container and never even put all those tools on the actual
system that goes to production. With Python, that might look something like maybe using Pex to
package up all the stuff inside of a virtual environment. And long as Python, the runtime is there,
then you can like pecs run on your other machine,
but you could potentially not even ever install those, which might be good.
Yeah, that makes sense.
Yeah. There's a lot, a lot there that I'm,
I is sort of beyond my comfort level, but that's,
that's what I thought as I looked at this, Brian.
Well, thanks for taking a look.
There you bet. All right.
We'd like to talk about GUIs on the show every now and then. And so, and we want to talk about pandas and data frames and
data science and all that. So let's put those together. There's this project over here called
pandas GUI, and the documentation is sparse. Let's say it's pretty easy. There's a couple
of examples or two. So I could come down here and I could do my Panda stuff and create a data frame. And then I could just import show
from the Pandas GUI. And within my notebook, it will pop open a separate window that then allows
me to cruise around and check it out. So you can print out the data frame in a notebook and you
get kind of a static Excel grid looking thing. And that's nice. But with this,
you get a interactive one that lets you sort and select. You can actually copy and paste
chunks out of there as if it was Excel and then paste it in other places. It also has a plotting
library with like pictures. So I'm going to go click on the bar graph picture. And then there's,
there's a list of all the columns and the things that the bar graph needs. And you can drag and drop this column is the x axis. And this column is the y axis. And
I want to group by color and have, you know, group by color it by some other aspect of the data,
you know, like group into multiple charts, or multiple lines or plots on a chart, all sorts
of cool stuff like that. There's a statistics section. There's you can export important export, I guess, important CSV files with drag and drop.
And there's also search that you can do.
So it's a pretty neat, quick way to explore pandas.
Yeah, it's a neat idea.
When you first encounter a data frame, you really want to just be able to look at it
without any assumptions and there's a
lot of stuff that like kind of goes towards that with like the dot plot uh api and pandas and
making that making really accessible to make plots really quickly but this is like kind of like the
step beyond that right of just yeah visualizing it immediately yeah like one thing you get when
you view the the data frame as you know, like I said,
it looks kind of just like printing DF in or just typing DF in the notebook, but then on the right,
you can say, Oh, I want to see the filters. And you could type in these filter expressions,
these query expressions, and then turn them all like pile them on. You can have little checkboxes
to like optionally turn them off, but not delete them. And then of course you can sort within there
like that. And the graphing, I think the support for the graphing part is really, really helpful.
So the fact that you can just go and click and say, oh, I want a box plot. And then the box plot
needs these things. You can just drag and drop from the column from your data frame definition
over and it just live updates. Yeah. I think that really lets people visualize the data in the way that they want to sometimes
rather than the way they already know how in Matplotlib, which I think is what people
end up doing, at least for exploratory stuff.
Yeah, exactly.
You could real quickly switch between a bar, a box, a scatterplot, back and forth without
having to actually be familiar with how those work.
Can you tell if there's a way to export the filters or is there any mechanism for that?
There is, I don't think so. At least in the YouTube explainer video, there were some comments
like, you know what would be awesome? Export this as code from here so that I can just
turn it back into Python. I didn't see anything like that, but.
Yeah. Sometimes GUIs are a little weird for me because of that. You end up in this
GUI world and you can't reproduce anything.
I clicked on a whole bunch of stuff and it looked great, but
don't touch it. I can't do it again.
Okay, but to be fair, it is a fairly quick way to look
at the data and know what you,
maybe you can't produce that exact plot again, but you know what the data looks like.
And you can use a different plotting mechanism to do that.
Yeah, and the visual is pretty clear.
Like, okay, well, X is assigned to speed and we know it's a histogram.
And so you could pretty quickly, you know, with some Googling and Stack Overflow and go,
all right, how do I map plot lib a histogram and get that going?
You know, that's a huge time saver.
Yeah.
But some, some, some sort of export of like, okay, give me the code to make this plot in
my own code.
That would be great.
Yeah, absolutely.
Absolutely.
All right.
On to the next, but before we get there, I do want to call out just a shout out by Piling
that FS spec is sweet. Good mention. Yeah.
I like it as well. Cool. All right. X-ray. Okay. Um, so X-ray is, it's my favorite library. Um,
it's a, it's like a pandas. So it's a pandas like API. Um, but it's for n-dimensional data. So if you have like a lot of times people talk about in like geospatial data where there's
that long time and others, but also for image data where there's maybe a bunch of different
bands from like satellite imagery or other disciplines where you just have labeled data
that's not tabular.
So the axes like mean something, but there's not tabular. So the axes mean something,
but there's not just one or two of them.
Then X-Ray is great for that
because it lets you do things like you can select
a certain subset of time
or a certain subset of whatever your dimension is.
And you can also aggregate across different dimensions
and you can use the labels directly.
So if you don't have a tool like this, I see people doing this a lot with like machine learning workflows where they'll be, they'll have like separate, like a list of all their, they'll have like a list of all their labels and then they'll have their data and they'll do some manipulation and they'll try to like reattach them at the end um and it's just it it just turns into a mess um and it's actually just like takes
care of that all for you um it's pretty great uh and i think that it has applications that have not
been fully realized yet and it's starting to like take off in other spaces but it really comes from
this geospatial world but i think it could be useful for all sorts of people.
Right.
Because in geospatial, sometimes you have three dimensions, not just two.
Yeah, you almost always have three, right?
Sorry, Brian.
No, the documentation looks great too.
The documentation has like getting started guides and tutorials and videos and galleries
and stuff.
So definitely check out the documentation.
Yeah, I think it got a major... It seems like I looked at it for this too,
and it seems like it got a major facelift.
So it looks really nice.
It also has like plotting, it supports the dot plot API
or some different version of it that's like the pandas version,
but you can plot in different you know three dimensions or aggregate
and then plot um and that's that's like a really nice way to get the visuals quickly um and then
the last thing i wanted to say about is that um it's normally backed by numpy arrays but it can
also be backed by dask arrays or sparse arrays or all sorts of different um arrays natively so
it's a it's a really cool,
it's another one is like building block things
where you can have x arrays like your labeling
and your indexing and all the like nice stuff.
And then down inside it can be NumPy or QPy or Dask.
How interesting.
So it can do that juggling and piecing back together
that other people are mainly doing
and you just have this simple API and if it has to do that,
it'll figure it out.
Yeah.
Yeah.
That's pretty cool.
Nice.
And you talked about QPy and Dask.
Like those are some pretty interesting back ends for this.
Yeah.
Yeah.
The Dask one is,
I said QPy and now I'm wondering if maybe it's just like Dask and then QPy.
So don't quote me on that
but um but yeah the dask one is um is like really integrated with x-ray code so you do like they do
just do some special things to make it so that it works with parallelizing and things but uh but
from the user experience it's the same yeah fantastic and then also noticed it requires
python 3.7 really nice to see tools sort of keeping up with the latest, not really old stuff.
Well, hopefully it's 3.7 and above.
Well, yeah. Yeah. Greater than or equal to.
Well, I mean, I ran into a library. It was an internal thing that was only 3.7. So I tried it
on, I'm like, I assumed or above and I tried it on 3.9 and and it fell over. I'm like, what's going on?
It was only 3.7.
It's weird.
Okay, that is weird.
That'd be interesting to think about what special features of 3.7 there, depending on the broken 3.8.
Yeah, that's what I was thinking.
How do you do that without just checking for equal equal 3.7 on version?
Yeah.
So anyway.
Yeah.
All right.
Well, that's it for our six main topics
brian you got anything else you want to throw out there quickly um yeah actually um so i i uh um i
didn't have this up but there was a um on twitter somebody's like reacted to me with an emoji and i
uh didn't um didn't know what they meant um so i looked up, let me pop this up,
this Emojipedia, and it was helpful.
And you can just copy and paste the emoji
that somebody uses in there,
and it tells you what it means.
And the, you know, kind of not just what it's supposed to mean,
but also what people are using it for.
I don't know, for somebody that's sort of an old guy
that is out of touch sometimes, this was helpful.
So anyway.
Yeah, I mean, sometimes it's obvious,
like a heart, we know what a heart means, right?
But, you know, like hands together,
it's not necessarily that that's like a thank you
sort of bow type of thing.
I mean, there's certain ones where you're like,
ah, what does that mean?
It was like a hands together with like arrows coming out of the top and i'm like i don't know what this is
but apparently it's just raising hands like like you're saying hooray for somebody oh okay that's
nice so okay it's good i use emojipedia all the time but i think i use it in the opposite way
like i use it to get an emoji to like put somewhere because i don't have like an emoji
keyboard or whatever oh yeah that would be good too.
The other thing I wanted to bring up is
I hopefully have some cool news
to share tomorrow
about the PyTest book and
the news will show up on a revamped
PyTest book site.
So if you go to PyTestbook.com
you get redirected to this
PythonTest.com
page where I'll talk about the second edition. you get redirected to this Python test.com page
where I'll talk about the second edition.
So hopefully there'll be news
about the second edition coming out tomorrow.
And I-
Is this your new static site, Magic?
Yeah, yeah, static site.
And I totally, and it goes dark and light.
But I totally stole from Prajan.
So Prajan has the same, he's got a really nice site. So it's a bunch
of great, great. It looked great. And I'm like, that'll work. I'll just do what he's
doing. So that's what I did.
Yeah. Very cool.
I think we have exactly the same stack for our Saturn cloud site now.
Oh, how neat.
That's cool.
Awesome. How about you, Julie? Anything else you want to give a shout out to?
Well, I've been really into entry points recently.
Just like the concept of them is very cool.
As in like Python packages, you can give them almost like CLI command type of entry points?
Yeah, but the thing that I think is really cool is like,
like Matplotlib, this is an example that made me first realize about entry points,
is Matplotlib has this dot plot. I think I mentioned this three times now.
But you can swap out the backends.
So you don't have to have matplotlib.
You can use other backends.
And all the logic for that is in the other visualization
libraries themselves, not in pandas.
So it's just like you can swap out other things.
It's not just for CLIs, I guess.
OK, yeah, how neat. All right, yeah, I learned about entry points a year, It's just like you can swap out other things. It's not just for CLIs. Okay.
Yeah.
How neat.
All right.
Yeah. I learned about entry points a year, year and a half ago.
And ever since I'm like, oh yeah, this is awesome.
I can now create these little commands that'll be part of just my shell.
I love it.
Yeah.
The other thing I wanted to say was the GitHub CLI is really cool.
I think that's standalone, but I've been using it a lot.
I'm sure people know the Git CLI, but what's the story of the GitHub CLI?
Oh, well, the GitHub CLI
is, makes it, so
if you have ever tried to
check out a branch on someone else's fork,
like, if you want to, like,
evaluate a PR that someone has put on a fork,
that is the situation
where the GitHub CLI is really great
because you can just do, like,
gh-checkout-pr or gh-pr-checkout, whatever the that is the situation where the GitHub CLI is really great. Because you can just do like GH checkout PR
or GH PR checkout, whatever the number is.
And that you're just on their branch then.
And if you can push,
if you have push access to their branch,
if you're a maintainer and they've allowed it,
you can just push directly.
And you don't, I mean,
I was always looking at that sequence of commands before.
I know people have like get aliases and stuff,
but yeah,
I'd really recommend checking it out if you do a lot of GitHub stuff.
Okay.
Awesome.
Yeah.
That's great advice.
Yeah.
I often want to like check out some,
so pull requests.
I want to be able to like play with it and run their code.
And yeah.
And so it's the best.
Yeah.
Awesome.
All right.
I got a couple of things to add,
by the way,
first of all,
just that first practical SQL analysis that you talked about.
It also is a similar, a similar theme that you were talking about, Brian.
One of the things I thought was cool though, as you scroll through it,
it has a progress bar for reading at the top and that just made me so happy.
I don't know why that was, that was really neat. All right.
But I have a bunch of hear all about it sort of thing. So really quick,
a Python B2, it's got the center yeah okay live update python 310
beta 2 is out if people want to check that out and you can go download that it also highlights
all the major features like um the pipe operator for writing unions and type specifications and
a bunch of other stuff that people might care about. Structural pattern matching is probably a big one.
Yeah, go to the completely different.
Is that on here?
And now for something completely different.
I love that part.
So write about the files.
Yeah.
Oh, interesting.
The Aaron Fest paradox concerns the rotation of a rigid disk
in the theory of relativity.
It's original 1909 formulation presented by.
Yeah, okay. That is unexpected, but very cool.
And completely different and irrelevant.
Yeah. Awesome. Okay, so
takeaway, 3.10 Beta 2
is out. People can check that out. There's also some
security patches for Django, so
be sure to check that out. One thing that surprised
me is the Microsoft
install Python from the Windows Store
already has a 310 beta
store install. So, okay, that's pretty cool that they're keeping that up to date.
And it's rated E for everyone.
Yeah, even kids can pip install. Awesome. So Frederick Bankston sent a message in response
to our last show where we talked about the method overloading by type.
Like if it takes an int or a string, it calls different functions.
It's also pointed us towards this multi-method other library that is similar.
So people can check that out.
That's cool.
Neat.
Speaking of the GitHub stuff, I've been starting to use PyCharm 2021 to early access version, early access program version.
And it's been working fine so people want
to try out the new features there's a bunch of cool stuff uh you have support for python 310
and new stuff for pytest i don't remember if this came in here but one thing that i did
learn about that recently that's in there that's super cool is they have in pycharm if you log in
pycharm into your github account there's a pull request section and you can just click it and it'll do those same steps that Julia was talking about.
Like right there in PyCharm.
Just go, I want to try that PR before I accept it.
And just click that and go.
You can even have comments.
You see the conversation inside there and everything.
It's cool.
Never go to GitHub again.
Exactly.
And just forget how to use it, basically.
All right. That's it. That's all the items i got so yeah i've got other stuff that's just hanging around from
before cool all right well you want to close it out with a joke yeah a couple of jokes always
all right so over at upjoke.com slash programmer to ask jokes you'll find many bad jokes some even
that are not very appropriate or whatever but there's a few that are funny. So I pulled out three here. I'll do the first one. Brian,
you can do the second. Julie, you can do the third, I guess, if you're up for it.
Okay.
So this one we should have saved for six months from now, but I asked a programmer
what her new year's resolution would be. She answered 1920 by 1080.
That's so bad. No, that's awesome.
It's really bad.
All right, well,
you got to do the next one.
How does a programmer
confuse a mathematician?
I don't know how.
Just saying that
X equals X plus one.
All right, Julia.
Okay.
Why do Python programmers
have low self-esteem?
They're constantly
comparing their self to other. Okay. Why do Python programmers have low self-esteem? They're constantly comparing
their self to other. Also bad. Probably the worst. Sorry we gave you that one.
It's okay. I saw the one that Brian did and I was like, oh, it should be x plus equals one.
I was like, no, that ruins the joke. Exactly. Yeah. Yeah, i actually often do the the slow way or the the non-obvious way
yeah x equals x plus one just to make it more obvious to people reading it sometimes yeah
yeah no i agree yeah at least it's not c plus plus with x plus plus x I love that. No, no, we should have that.
I'm okay with X plus plus,
but not that also plus plus X.
Oh,
the pre-increment.
Yeah.
The pre-increment.
The slight.
That's weird.
Yes,
exactly.
Exactly.
But I could go for it.
X plus plus.
Come on.
All right.
Well,
Julia,
thanks for joining us this week.
And Brian,
thanks as always.
Oh,
it was a pleasure.
Thanks,
Julie.
Yeah.
Bye everyone.