Python Bytes - #238 A cloud-based file system for Python and a new GUI!

Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 238, recorded June 15th, 2021. I'm Michael Kennedy. And I'm Brian Ocken. And I'm Julia Sidnell. Hey, Julia. Thanks for coming on the show. Yeah, thanks for having me. Yeah, it's great. Why don't you tell folks a bit about yourself?

Starting point is 00:00:19 Yeah, so I'm the head of open source at SaturnCrowd and a maintainer of Dask. So I split my time half and half. I spend half my time just doing regular like maintenance-y stuff on Dask and then half my time doing like engineering and product management on SaturnCloud. SaturnCloud is a data science platform that really specializes in distributed Dask clusters in Jupyter and making it really easy for people to get up and going with those things on AWS. Yeah, Dask is really interesting. You know, when I first heard about it,

Starting point is 00:00:51 I thought, okay, this is like a grid computing scale-out thing, which I probably don't have a lot of use for. But then I was speaking with Matthew Rocklin about it, and it has a lot of applicability, even if you have not huge data, huge clusters, right? Like you can say, even on your local machine, scale this out across my cores or, you know, allow me to work with more data than will fit in RAM on my laptop and stuff like that, right?

Starting point is 00:01:14 It's a cool idea. Yeah. Yeah. It has like a whole number of different ways of interacting with it, right? Like there's that, there's like, just make this thing go faster by parallelizing it. There's all the data framey stuff. There's all the array stuff for more dimensional data. So it's got a, it's got a large API.

Starting point is 00:01:30 Yeah. Cool. And we're going to touch on a couple of topics that are not all that unrelated to those things here. And so, yeah. Speaking of data science, Brian, you want to kick us off? Sure. Yeah.

Starting point is 00:01:43 The first thing I want to cover is an article called Practical SQL for Data Analysis. This is by Aki Benita. So one of the things I liked about this is it was kind of talking about the first bit of the article was talking about basically that with data science, you've got pandas and NumPy and stuff. And you also often you're dealing with a database and SQL on the back end. So the first part of the article talks about how some things you can do both in pandas and in SQL, like SQL queries, it's faster in SQL.

Starting point is 00:02:25 So there's a big chunk that's just talking about how that's faster. But then he also talks about just basically there's a lot of benefits to the flexibility and the comfortableness you can have with pandas, though. So trade-offs as to where you're going to push it you're going to push the push it too far into sql or or have a nice split is good but then he goes through and talks about a whole bunch of great examples of different things like pivot tables and roll-ups and and choices and different things you can do with either pandas or sql and really what his recommendations are for whether it should be in in pandas or in SQL query, and then how to do those queries.

Starting point is 00:03:06 Because, I mean, really the gist of the article is, and this problem space is people are comfortable with pandas, but they don't really understand SQL queries. So this sort of good cheat sheet for how to do the queries is, I think, really kind of a cool thing. So, yeah, I think, really kind of a cool thing. Yeah. I think it's really neat. And you have these problems, you know how to solve them in one or the other. And I think this compare and contrast is really valuable, right?

Starting point is 00:03:34 Like, I know how to take the mean of some column in SQL, but I haven't done it in pandas yet. Let's go see how to do that. Or I'm really good at doing pivot tables in pandas, but boy, I always kind of avoided joins in SQL. They scared me. And then how does that even translate? Right. I think that back and forth is really valuable. Yeah. Yep. And then it covers things that I don't even know what they are, like aggregate expressions. I don't even know what that is, but apparently that's a thing that people do. I can help you out at aggregate stuff. No, just kidding. Julie, what do you think

Starting point is 00:04:03 of this? Yeah, no, it seems, it's really cool. Like I agree that like the having the, the having an impendence and then a SQL, that comparison is super helpful. Like SQL is always super scary to me. And I always end up like Googling a bunch of stuff whenever I have to mangle my SQL. But I know it's so fast, so it's cool to see a way to access that. Yeah, absolutely. This is a good one, Brian. I think a lot of people will find it useful. I also want to just give a quick shout out to the past a little bit. Not too long ago, we talked about an efficient SQL on pandas with DuckDB, where you actually do the SQL queries against pandas data frames. So if you're finding that you're trying to do something, and maybe it would be better in SQL, but you don't want to say completely switch

Starting point is 00:04:46 all your data over to a relational database. You just kind of want to stay in the Panda side, but there's that one or two things. Like this is really cool. This sort of upgrade your data frame to execute SQL with the DuckDB query optimizer is also a kind of a nice intermediary there. Yeah, Dask also does some, I'm going to try not to make everything about Dask, but Dask does

Starting point is 00:05:09 some things that are kind of, that kind of take some of the ideas from this article of like doing predicate pushdown of like, of pushing down some of the like filters into the read because it evaluates lazily. It doesn't have to like grab all the data greedily up front. It can like do that later. So you can get some of the benefits. That's cool. And it can also distribute the filter bit,

Starting point is 00:05:30 I guess at that point. Yeah. Nice. All right. I want to talk about the usual suspects. So, okay. That was, that was a pretty good show. Was that Quentin Tarantino or something like that?

Starting point is 00:05:40 It's not actually about this. This comes to us from Ruslan Portnoy. And thank you for sending this in. Mentioned an article that has this really interesting idea. How do you apply git blame when you encounter a Python traceback? So here's the scenario. Your code crashes and you either print out the traceback or Python does it for you because it's just crashed. And normally it says, here's the value. Here's the line of code. Here's the file it's in. Here's the next line in the call stack. Here's a line of code it's in. The idea is you can take git blame, which is a command that says,

Starting point is 00:06:17 show me who changed this line of code or who wrote this line of code, at least touched it last on every single line of code. And I love this whole idea of like, all right, who did this line of code at least touched it last on every single line of code. And I love this whole idea of like, all right, who did this? And sometimes I'll come across code. I'm like, this is so crappy. Like who did this? Oh, wait, that's me. Okay. Well, at least I know what I would feel about it. But the idea is what if your trace back on each line where it had an exception could also show who wrote that line of code. Cool, huh? Yeah, so let's check it out. It's pretty straightforward. This is an article by OfferCoren, and it basically uses two libraries that are themselves both pretty straightforward. So like, here's a straightforward example of a traceback,

Starting point is 00:06:57 like trying to pop something off of an empty list. It says on this line in the function pop sum, you know, there's this line here in the call stack, and then the next line, this line in the function pop sum, you know, there's this line here in the call stack. And then the next line, this line in the call stack, and eventually raise a value error, you know, empty range, can't pop nothing off, you know, something off of nothing, basically. But this doesn't show you any information about like maybe who wrote that line and who wrote this other line up here, right? So what they did is they took a couple of modules, Traceback, and then LineCache. And it turns out when Traceback shows you this Traceback, it uses LineCache to figure out, okay, from this actual, I'm guessing, bytecode that it's going to run, this CPython interpreter code, what line of file did this actually come from right yeah so here's the the insight or the thing you can actually change what's in the cache and because it's a cache once it's figured out what the lines are it's not going to read it again so it's like um like a list

Starting point is 00:07:58 for each line that you get back and you can just change the value so it said okay well here's like return random. That's what the line of text was. They're like, no, no, no, there's nothing to see here. Move along. If you make that and then you cause it to crash again, what comes out is if you go a little bit further down, normal code, normal code or normal trace back, normal trace back. Then it just, instead of the line of code, it says nothing to see here. Please move along. All right. So what are you going to do with that now that you realize like you can actually change what appears in the traceback?

Starting point is 00:08:27 So you write a little regular expression to go and execute get blame on the various files and then to re-inject that back into line cache. And so what they do is they just put, if they know the blame, they just put, you know, like 80 lines, 80 characters, up to 80 characters of the line, and then edited on such and such and such date by such and such person. And here's the commit message, right? And so just basically shelling out to get blame when it crashes. Now you get some really cool stuff. Like on this line, it says this is edited by, you know, many, many days ago by so-and-so in this Git commit and so on.

Starting point is 00:09:06 And what's interesting, like this is already in itself useful, I think. But what's more interesting is other tools use this as well. So, for example, if you use PUDB, which is a sort of visual debugger, kind of. It's like a command line one. I know visual in the sense of like Emacs is visual, not like PyCharm is visual. But it will actually pull up that data. So you can see they jumped into the PUD debugger and it's actually showing all this get blame attribution

Starting point is 00:09:32 as well that they've added. So yeah, pretty interesting. What do you all think? Yeah, I think that looks really cool. I mean, I always do get blame whenever I run into something that's weird with the hope that someone else will be able to explain it to me.

Starting point is 00:09:43 Exactly, who knows about this or who do I talk to about breaking this? Right. Yeah. You could even put like PR numbers and stuff in here. Right. And that'd be pretty cool. Yeah. Yeah. Yeah. That'd be super cool. Yeah. One of the things I like, I don't really like that the name get blame, but it's there. But I agree with Julia that the main thing I use it for isn't to try to figure out who broke it, but who to ask about this chunk of the code. I agree. Because usually when you see something that's really confusing and weird, you're like, I know they didn't just pick the hard way of doing this because they didn't want to do the easy way. There's something that I don't fully understand, some edge case that's crazy here.

Starting point is 00:10:23 I'm going to go talk to that person. So, yeah. Also the, the, how long ago it was edited. So if there was something that edited yesterday, that's probably the problem. Yeah,

Starting point is 00:10:31 exactly. Like in this little screenshot here, some of these are edited like 1,427 days ago. That's probably not the problem. Maybe, but I feel like I have the opposite assumption. Like if something is from six years ago and it's weird, I'm like, well, probably things were different back then.

Starting point is 00:10:47 Okay. Yeah. Yeah. Yeah. It's no longer applicable to the new data, new situation. Yeah. Oh, that'd be an interesting thing also is to have like a tool that would tell you if something's like over a thousand days old or something like that, you probably should go refactor it to make sure somebody understands that code. Yeah. Yeah. Yeah, for sure. All right. Jumping back to the first item really quick.

Starting point is 00:11:10 In the live stream, Alexander out there. Hey, Alexander. Says, I wonder if graph databases with Gremlin queries could be more suitable for data science. You know, SQL joins are way harder. Yeah, graph databases are pretty interesting. If you're trying to understand the relationships, that may well be better. I don't know. Gila, do you got any thoughts on this? I don't know anything about graph databases. So out of my league. I didn't have a desire to understand graph databases until I found out that there

Starting point is 00:11:35 were gremlin queries. Now I think I will. Well, Brian, they don't start out as a gremlin queries. They're mogwai inserts. And then if you insert them after midnight, then they become a gremlin queries through mogwai inserts and then if you insert them after midnight then they become a gremlin query i mean come on we all know how it goes you definitely don't want to get them wet oh that's an old show i'm not sure if everyone's going to get that reference but yeah that was i love that show okay anyway let's let's move on to the next one the next one is you julia yeah so i wanted to highlight um fs spec uh so file system spec for people who can't hear letters very well um so this is the basis for s3 fs fs say i'm not getting the letters right but there's there's one for gcp there's one for s3 and um, it's a file system storage interface or, like, the basis for a file system.

Starting point is 00:12:28 And so, you can do things like you can open just files as you can just take a path and open it as a file object in Python and read it with all the normal, like, read write operations. Oh, interesting. But from anywhere. So, like, there's all these different ones for S3, for GCFs, and for, like, even for, like, HTTP and just basically anything you can imagine. Anywhere you can imagine a file being. Either there's already been one of these written. It's kind of like a, it's an interface and then you write different packages on top of it that are like drivers or something. They have some name for it.

Starting point is 00:13:12 And it allows you to treat the file system as like this interchangeable building block. So you don't get, you don't end up writing like photo three code or something that's like very specific to a specific um cloud storage you write like this more general code and then um it's really useful for like a lot of free data sets that are hosted on different clouds but like they'll sometimes be on one cloud and sometimes be on another but like basically it's the same data um or if you're at a company and you want to like switch clouds it just makes that whole thing so much easier it looks really really useful especially for avoiding cloud lock-in yeah yeah and you can always write like you can always write your own one if something else pops up you can write your own implementation of that all right so there's an example here talking about using a file system

Starting point is 00:14:01 in the docs that says something to the effect of well well, you want to open up a CSV and feed it off to pandas read CSV. So normally you would say open CSV file, and then you just say pandas read CSV and give it the file stream. But what if that's on the internet? What if that's on S3 with authentication? What's that? What if that's, you know, somewhere else, right? And so with this one, you can just say, FS, InfoSystem spec, open, here's a URL. And now that's a stream, right? Or that could be, here's an S3 location, S3 bucket, go get that, right? Yeah, yeah. So instead of passing the path directly into the read function, you pass in the file object. And's really powerful. Like it seems like a thing that we shouldn't need, but, um, files get like the file locations can get so crazy so quickly. Um, and this just really helps simplify and like make it so you don't have to think about

Starting point is 00:14:55 this stuff, which I think is what most people want. It's what I want. Yeah, for sure. So like there's a local file system option, but then you could also have an FTP file system or you could have something else, right? All sorts of different options. Yeah. Yeah. All sorts of stuff.

Starting point is 00:15:10 Yeah. Okay. That's cool. Brian, what do you think? Does this have any applicability for you? Oh, yeah, definitely. And that's a great abstraction layer to put in place to just have reading as if it was a file and have it moved. It also helps you develop tools locally and then be able to deploy them into a larger space. So it's cool.

Starting point is 00:15:31 Yeah, for sure. One of the things that always makes me a little hesitant when I hear people say things like we're cloud native, like my app is cloud native. That's always code word for me. Like I will never be able to run my app unless I'm connected to the internet. You know, it's like, it depends on all these services together. And there's no way I can recreate that locally. But something like this could allow you to say, well, we're going to have a local file system version.

Starting point is 00:15:55 But then when we go to production, we'll switch to, I don't know, S3 or, you know, pick it, pick something. I've always wanted to make it either a t-shirt or a sticker or both that says, not a cloud native, just visiting. Nice. I also think, Brianrian there might be testing opportunities here yeah definitely give it a test file system that'd be cool yeah and like julia said swapping things out to just have have your um have your logic not have to care where it's coming from um but um but i i guess it would make sure you'd have to make sure all of the interfaces,

Starting point is 00:16:26 the different storage systems really are equal. But I guess you'd have to try that out yourself. Yeah, there's kind of a bucket, right? There's kind of like a dict that you can pass, which is like storage options. So I think that might get a little wonky, depending on what the different backends need. But the general principles are the same. And it also, I should have said this originally, that might get a little wonky um depending on what the different backends need but the like general principles are the same and it also i should have said this originally but it also

Starting point is 00:16:49 allows like fs spec itself can contain logic to do things that are um general to all the different libraries like caching and things like that to all the different interesting like you could put a caching layer on top of arbitrary things like S3, Google storage and Azure buckets or blob storage. Yeah. Yeah. Maybe even save money on bandwidth there if you can do some caching. Yeah. If you can do it right.

Starting point is 00:17:13 Yeah. Super, super neat. Brian, you're going to tell us about how to slim down our Docker containers. But before you do, I want to tell people about our sponsor for this episode brought to you by Sentry. So how would you like to remove a little stress from your life in addition to just abstracting your file system, maybe tracking down some errors. So do you worry that your users may be having difficulties or encountering errors with your app right now? And would you even know it until they send that support email? How much better would it be if you got the error or performance details sent right away and with

Starting point is 00:17:43 all the call stack, maybe with get blame in there, the local variables, the active user who was logged in while this happened, all that kind of stuff. So with Sentry, it's not only possible, it's actually really simple. I've used this on Sentry. I've used Sentry on our websites before. So it's on Python Bytes, TalkPython Training, all those different sites. And I've actually had someone encounter an error trying to buy a course over on TalkPython Training. I got those different sites. And I've actually had someone encounter an error trying to buy a course over on TalkPythonTraining. I got the Sentry notification. I said, oh, geez, I can't believe this problem crept in here. And I fixed it really quick and started to roll out the fix and actually got an email. They said, hey, we're having this problem

Starting point is 00:18:17 buying a course. I know, I've almost got it fixed. Just give me a moment and try again. And they were just like, what? That doesn't make sense. So they were very surprised. And so surprise and delight your users. Create your Sentry account at pythonbytes.fm slash Sentry. And when you sign up, there's a little got a promo code. Make sure that you put Python Bytes, all one word, all caps with a Y in there. And you'll get two free months plus a bunch of extra features and so on. So also, it really lets them know that you came from us rather than just somewhere else.

Starting point is 00:18:45 And that helps support the show a lot. So pythonbytes.fm slash Sentry and promo code Python Bytes. Awesome. Thanks for supporting the show, Sentry. And Brian, let's talk Docker. Yeah, let's talk Docker. I mean, I'm starting to use Docker

Starting point is 00:18:58 more and more. And I like the experience, but I was interested when this article came up. So it was in June. I saw this article called The Need for Slimmer Containers. And this is from somebody, Ivan. Ivan, I'm not going to try his last name.

Starting point is 00:19:17 Ivan something. But anyway, it's an interesting discussion. And the idea around the original post was that there's now a Docker scan that you can use. So you can use Docker scan to scan for vulnerabilities in your Docker containers. And I even thought, well, I'll look at some of the standard Python containers that are available. Right. Theoretically, some of the things that are nice is I can just go and say docker or in my my docker container and say from python colon three nine and i don't have to think about how do i install python how do i keep it up to date you know make sure that pip is there and that i'll be able and you know pip install stuff that needs to do build things and that's all that

Starting point is 00:19:58 stuff will be there right so it seems like of course this is what you want yeah well and also just that's kind of the one of the neat things about dockers i can just say i have these standard parts now i just want to put my custom stuff on top of it and um and it's great so well what did he find so he used uh so docker scan apparently uses uh a third-party tool called uh snake snyk container and we've covered uh snake before not the container version but uh we covered snake in episode 227 um but um so it's looking for vulnerabilities um and that's a good thing but he found him in everything and he found him in all of the the uh the standard python ones except for alpine i guess um And so he didn't really know what to make of it,

Starting point is 00:20:47 really. He was just sort of reporting his results that maybe Alpine is the only one with few vulnerabilities. But then this went out on Hacker News and there was a big discussion around it. So he updated the article, which I appreciate appreciate with some of the feedback that he got. And so some of the feedback was that these vulnerability checkers sometimes give you false positives. And I don't really have enough experience to know what that I know what that means. But I don't have enough experience to know if these really are false positives or if they're actual vulnerabilities or not. The other thing that was that that maybe some people suggested that these these standard ones really aren't updated very much. So I don't really know much about that either. And if they're not, that's kind of a bummer because I think I

Starting point is 00:21:38 think people are relying on them. So I actually just kind of am left with a little bit of a confusion as to what to do. The one of the, I want to also mention that the Alpine is current one. There's original article. He says Alpine is pretty good for vulnerabilities, but then his followup says it doesn't. Well, there's a lot of applications that can't run on Alpine because of some issues or another. So anyway, I'm not sure what to make of it. So I was hoping Michael might give us some insight. I did some thinking about this this morning.

Starting point is 00:22:08 And in fact, I recently spoke a lot about this over on TalkPython. So I had Itamar on the show and we talked about best practices for Docker packaging. And we talked a lot about both security and package size. So I can try to relay a couple of things from that. So we've got our official image over here, our Python official image. There's actually a bunch of options. As you can see, there's a few, like 310 beta 2 buster

Starting point is 00:22:39 or the 310 RC buster. That sounds bad, but I think it's actually good. No, I'm just kidding. I know what it is. So these are by default based on Debian, and Buster is the latest version of Debian. And so you can do a Buster, which is like full Debian with 310, or you can do a 310 Slim Buster, which is like a slimmed down version of Debian Buster that supports Python 310.

Starting point is 00:23:02 Okay, so there's a lot going on here in terms of the options. One of, so the article talks about how Alpine had the fewest security vulnerabilities. And actually, so the Python latest, if you run the sneak package scanner thingy on it, it says there's 364 vulnerabilities if you just do Python latest 3.9 and 353 after

Starting point is 00:23:29 you run apt update apt upgrade. So if you try to get the container to update itself, there's still 353 in that one. I don't use that. I use Ubuntu. So I use the Ubuntu latest and the bare version of that one had 31 vulnerabilities. But then if I either install Python through apt or build it through source and put in the necessary foundational bits like build essentials and stuff to build Python, it goes up to 35 total problems where 28 of them are low. So seven are medium, nothing major. One thing I thought was weird was I actually ran another step where I said, okay, let's uninstall those intermediate tools like GCC and Wget and stuff

Starting point is 00:24:10 like that, that I needed to get stuff on the machine, but I'm not going to use again. And I took them away. And almost all those warnings were about those tools that I had apt uninstalled. So I don't know why Snyk is still showing them. Cause if I go into the container, I type Wget, itget it says nope this thing is not installed sorry but it still says the warning is that wget has a vulnerability in it for example right so there's like there's like this over reporting for sure but i mean the difference between 28 and 350 is not trivial right right so like run an apt install python 3 type of thing is not you know it's it's probably worth it, for example.

Starting point is 00:24:46 When I switched from Python 3.9 to Python 3.9 Slim Buster, it went from 350 to 69. So that's a lot better, right? Yeah. It's still not as good as Ubuntu, but it's a lot better. It's still twice as many. It sounds better, but it could be like 359 low problems and then 69 critical ones um it totally could it totally could yeah also if the reporting

Starting point is 00:25:13 like if the if if we can't trust snake necessarily then like maybe you know if you can't trust your reporting system then like maybe that maybe none of this means anything, right? Yeah. Yeah. I think one of the things the article originally started out to address was if you have fewer subsystems, there's no chance the missing subsystem could get hacked because it's not there. Right? So if there's a vulnerability in SSH, but you literally don't install SSH, who cares? Whereas if you just take the full distribution, you may potentially get

Starting point is 00:25:47 affected by something you dragged along. And then it went down this rattle of like, well, let me scan it and so on. So I want to add one more thing. Alpine did result in the best outcome from the scanner, but there's a lot of issues with Alpine and Python. So for example, there's this PEP here, 656, that right now, if I try to pip install something on Alpine, so especially in the data science world where things are large and the compiling takes a lot of steps and so on, the wheels that are built for Linux are built for, what is it, glib, gclib, I mean, hold on. I'll look over here. I wrote it down so I know. No, I didn't write it down. Sorry.

Starting point is 00:26:27 I think it's GLIB or GCLIB, which is the C runtime on Ubuntu and Debian. But there's one MUSL, muscle, on Alpine. And the wheels are not built for muscle. They're built for GCLIB. And so you can't hip install that. You've got to download everything and then compile it and it's like compiling matplotlib and jupyter from scratch can take a really long time versus just downloading the wheel and it takes up a lot of space and there's there's a bunch of issues and

Starting point is 00:26:56 things around that that like make it slightly not python friendly that's why there's this PEP 656 to allow wheels to be tagged as supporting muscle, not GC lib. Is that more than you wanted, Brian, or are you good? Okay, so the takeaway that I'm getting is probably not panic on some of these, but maybe at least pay attention to them. And it is good, like you said, to remove tools out of your Docker images that you're not using. If you're not using Wget in your application, take it off. Things like that. Yeah, exactly. I think Julia's point was great, right? It might be a false positive, but at the same time, if you're not going to use it again, because Docker, a lot of times you pip install all your stuff and then it's kind of ready to run, but you're not going to go and pip install something again you're going to do a new docker build from scratch right

Starting point is 00:27:48 like one of the final lines could be remove remove all those intermediate things that could have problems and make it larger and whatnot yeah i've thought um so i've only thought about this from like packet from like image size right like that that you want some more images just because it takes forever to get them around. But, um, it's interesting to think about from the vulnerability perspective. And I've always seen it done as, um, you do whatever installation you need and then you do all these like cleaning steps. But what you said, Michael, about like not ever putting certain things on your image was, is interesting. I haven't heard of that before. Yeah. Thanks. I also had Peter McKee from, who works at Docker on TalkPython a little while,

Starting point is 00:28:30 like six months ago or something. And he talks about having these multi step builds, something to the effect of doesn't make as much sense with Python. I'll try to put it together. But like, imagine you're building a Go library, you could put the Go runtime and build tools on a container, build your thing. but the thing you get from Go is an actual binary that's all self-contained. You could throw that container away and just copy the output of that into your actual container and never even put all those tools on the actual system that goes to production. With Python, that might look something like maybe using Pex to package up all the stuff inside of a virtual environment. And long as Python, the runtime is there,

Starting point is 00:29:05 then you can like pecs run on your other machine, but you could potentially not even ever install those, which might be good. Yeah, that makes sense. Yeah. There's a lot, a lot there that I'm, I is sort of beyond my comfort level, but that's, that's what I thought as I looked at this, Brian. Well, thanks for taking a look. There you bet. All right.

Starting point is 00:29:23 We'd like to talk about GUIs on the show every now and then. And so, and we want to talk about pandas and data frames and data science and all that. So let's put those together. There's this project over here called pandas GUI, and the documentation is sparse. Let's say it's pretty easy. There's a couple of examples or two. So I could come down here and I could do my Panda stuff and create a data frame. And then I could just import show from the Pandas GUI. And within my notebook, it will pop open a separate window that then allows me to cruise around and check it out. So you can print out the data frame in a notebook and you get kind of a static Excel grid looking thing. And that's nice. But with this, you get a interactive one that lets you sort and select. You can actually copy and paste

Starting point is 00:30:11 chunks out of there as if it was Excel and then paste it in other places. It also has a plotting library with like pictures. So I'm going to go click on the bar graph picture. And then there's, there's a list of all the columns and the things that the bar graph needs. And you can drag and drop this column is the x axis. And this column is the y axis. And I want to group by color and have, you know, group by color it by some other aspect of the data, you know, like group into multiple charts, or multiple lines or plots on a chart, all sorts of cool stuff like that. There's a statistics section. There's you can export important export, I guess, important CSV files with drag and drop. And there's also search that you can do. So it's a pretty neat, quick way to explore pandas.

Starting point is 00:30:55 Yeah, it's a neat idea. When you first encounter a data frame, you really want to just be able to look at it without any assumptions and there's a lot of stuff that like kind of goes towards that with like the dot plot uh api and pandas and making that making really accessible to make plots really quickly but this is like kind of like the step beyond that right of just yeah visualizing it immediately yeah like one thing you get when you view the the data frame as you know, like I said, it looks kind of just like printing DF in or just typing DF in the notebook, but then on the right,

Starting point is 00:31:30 you can say, Oh, I want to see the filters. And you could type in these filter expressions, these query expressions, and then turn them all like pile them on. You can have little checkboxes to like optionally turn them off, but not delete them. And then of course you can sort within there like that. And the graphing, I think the support for the graphing part is really, really helpful. So the fact that you can just go and click and say, oh, I want a box plot. And then the box plot needs these things. You can just drag and drop from the column from your data frame definition over and it just live updates. Yeah. I think that really lets people visualize the data in the way that they want to sometimes rather than the way they already know how in Matplotlib, which I think is what people

Starting point is 00:32:12 end up doing, at least for exploratory stuff. Yeah, exactly. You could real quickly switch between a bar, a box, a scatterplot, back and forth without having to actually be familiar with how those work. Can you tell if there's a way to export the filters or is there any mechanism for that? There is, I don't think so. At least in the YouTube explainer video, there were some comments like, you know what would be awesome? Export this as code from here so that I can just turn it back into Python. I didn't see anything like that, but.

Starting point is 00:32:43 Yeah. Sometimes GUIs are a little weird for me because of that. You end up in this GUI world and you can't reproduce anything. I clicked on a whole bunch of stuff and it looked great, but don't touch it. I can't do it again. Okay, but to be fair, it is a fairly quick way to look at the data and know what you, maybe you can't produce that exact plot again, but you know what the data looks like. And you can use a different plotting mechanism to do that.

Starting point is 00:33:13 Yeah, and the visual is pretty clear. Like, okay, well, X is assigned to speed and we know it's a histogram. And so you could pretty quickly, you know, with some Googling and Stack Overflow and go, all right, how do I map plot lib a histogram and get that going? You know, that's a huge time saver. Yeah. But some, some, some sort of export of like, okay, give me the code to make this plot in my own code.

Starting point is 00:33:34 That would be great. Yeah, absolutely. Absolutely. All right. On to the next, but before we get there, I do want to call out just a shout out by Piling that FS spec is sweet. Good mention. Yeah. I like it as well. Cool. All right. X-ray. Okay. Um, so X-ray is, it's my favorite library. Um, it's a, it's like a pandas. So it's a pandas like API. Um, but it's for n-dimensional data. So if you have like a lot of times people talk about in like geospatial data where there's

Starting point is 00:34:10 that long time and others, but also for image data where there's maybe a bunch of different bands from like satellite imagery or other disciplines where you just have labeled data that's not tabular. So the axes like mean something, but there's not tabular. So the axes mean something, but there's not just one or two of them. Then X-Ray is great for that because it lets you do things like you can select a certain subset of time

Starting point is 00:34:37 or a certain subset of whatever your dimension is. And you can also aggregate across different dimensions and you can use the labels directly. So if you don't have a tool like this, I see people doing this a lot with like machine learning workflows where they'll be, they'll have like separate, like a list of all their, they'll have like a list of all their labels and then they'll have their data and they'll do some manipulation and they'll try to like reattach them at the end um and it's just it it just turns into a mess um and it's actually just like takes care of that all for you um it's pretty great uh and i think that it has applications that have not been fully realized yet and it's starting to like take off in other spaces but it really comes from this geospatial world but i think it could be useful for all sorts of people. Right.

Starting point is 00:35:26 Because in geospatial, sometimes you have three dimensions, not just two. Yeah, you almost always have three, right? Sorry, Brian. No, the documentation looks great too. The documentation has like getting started guides and tutorials and videos and galleries and stuff. So definitely check out the documentation. Yeah, I think it got a major... It seems like I looked at it for this too,

Starting point is 00:35:48 and it seems like it got a major facelift. So it looks really nice. It also has like plotting, it supports the dot plot API or some different version of it that's like the pandas version, but you can plot in different you know three dimensions or aggregate and then plot um and that's that's like a really nice way to get the visuals quickly um and then the last thing i wanted to say about is that um it's normally backed by numpy arrays but it can also be backed by dask arrays or sparse arrays or all sorts of different um arrays natively so

Starting point is 00:36:23 it's a it's a really cool, it's another one is like building block things where you can have x arrays like your labeling and your indexing and all the like nice stuff. And then down inside it can be NumPy or QPy or Dask. How interesting. So it can do that juggling and piecing back together that other people are mainly doing

Starting point is 00:36:44 and you just have this simple API and if it has to do that, it'll figure it out. Yeah. Yeah. That's pretty cool. Nice. And you talked about QPy and Dask. Like those are some pretty interesting back ends for this.

Starting point is 00:36:55 Yeah. Yeah. The Dask one is, I said QPy and now I'm wondering if maybe it's just like Dask and then QPy. So don't quote me on that but um but yeah the dask one is um is like really integrated with x-ray code so you do like they do just do some special things to make it so that it works with parallelizing and things but uh but from the user experience it's the same yeah fantastic and then also noticed it requires

Starting point is 00:37:22 python 3.7 really nice to see tools sort of keeping up with the latest, not really old stuff. Well, hopefully it's 3.7 and above. Well, yeah. Yeah. Greater than or equal to. Well, I mean, I ran into a library. It was an internal thing that was only 3.7. So I tried it on, I'm like, I assumed or above and I tried it on 3.9 and and it fell over. I'm like, what's going on? It was only 3.7. It's weird. Okay, that is weird.

Starting point is 00:37:50 That'd be interesting to think about what special features of 3.7 there, depending on the broken 3.8. Yeah, that's what I was thinking. How do you do that without just checking for equal equal 3.7 on version? Yeah. So anyway. Yeah. All right. Well, that's it for our six main topics

Starting point is 00:38:06 brian you got anything else you want to throw out there quickly um yeah actually um so i i uh um i didn't have this up but there was a um on twitter somebody's like reacted to me with an emoji and i uh didn't um didn't know what they meant um so i looked up, let me pop this up, this Emojipedia, and it was helpful. And you can just copy and paste the emoji that somebody uses in there, and it tells you what it means. And the, you know, kind of not just what it's supposed to mean,

Starting point is 00:38:42 but also what people are using it for. I don't know, for somebody that's sort of an old guy that is out of touch sometimes, this was helpful. So anyway. Yeah, I mean, sometimes it's obvious, like a heart, we know what a heart means, right? But, you know, like hands together, it's not necessarily that that's like a thank you

Starting point is 00:38:59 sort of bow type of thing. I mean, there's certain ones where you're like, ah, what does that mean? It was like a hands together with like arrows coming out of the top and i'm like i don't know what this is but apparently it's just raising hands like like you're saying hooray for somebody oh okay that's nice so okay it's good i use emojipedia all the time but i think i use it in the opposite way like i use it to get an emoji to like put somewhere because i don't have like an emoji keyboard or whatever oh yeah that would be good too.

Starting point is 00:39:26 The other thing I wanted to bring up is I hopefully have some cool news to share tomorrow about the PyTest book and the news will show up on a revamped PyTest book site. So if you go to PyTestbook.com you get redirected to this

Starting point is 00:39:41 PythonTest.com page where I'll talk about the second edition. you get redirected to this Python test.com page where I'll talk about the second edition. So hopefully there'll be news about the second edition coming out tomorrow. And I- Is this your new static site, Magic? Yeah, yeah, static site.

Starting point is 00:39:56 And I totally, and it goes dark and light. But I totally stole from Prajan. So Prajan has the same, he's got a really nice site. So it's a bunch of great, great. It looked great. And I'm like, that'll work. I'll just do what he's doing. So that's what I did. Yeah. Very cool. I think we have exactly the same stack for our Saturn cloud site now. Oh, how neat.

Starting point is 00:40:18 That's cool. Awesome. How about you, Julie? Anything else you want to give a shout out to? Well, I've been really into entry points recently. Just like the concept of them is very cool. As in like Python packages, you can give them almost like CLI command type of entry points? Yeah, but the thing that I think is really cool is like, like Matplotlib, this is an example that made me first realize about entry points, is Matplotlib has this dot plot. I think I mentioned this three times now.

Starting point is 00:40:46 But you can swap out the backends. So you don't have to have matplotlib. You can use other backends. And all the logic for that is in the other visualization libraries themselves, not in pandas. So it's just like you can swap out other things. It's not just for CLIs, I guess. OK, yeah, how neat. All right, yeah, I learned about entry points a year, It's just like you can swap out other things. It's not just for CLIs. Okay.

Starting point is 00:41:05 Yeah. How neat. All right. Yeah. I learned about entry points a year, year and a half ago. And ever since I'm like, oh yeah, this is awesome. I can now create these little commands that'll be part of just my shell. I love it. Yeah.

Starting point is 00:41:15 The other thing I wanted to say was the GitHub CLI is really cool. I think that's standalone, but I've been using it a lot. I'm sure people know the Git CLI, but what's the story of the GitHub CLI? Oh, well, the GitHub CLI is, makes it, so if you have ever tried to check out a branch on someone else's fork, like, if you want to, like,

Starting point is 00:41:36 evaluate a PR that someone has put on a fork, that is the situation where the GitHub CLI is really great because you can just do, like, gh-checkout-pr or gh-pr-checkout, whatever the that is the situation where the GitHub CLI is really great. Because you can just do like GH checkout PR or GH PR checkout, whatever the number is. And that you're just on their branch then. And if you can push,

Starting point is 00:41:53 if you have push access to their branch, if you're a maintainer and they've allowed it, you can just push directly. And you don't, I mean, I was always looking at that sequence of commands before. I know people have like get aliases and stuff, but yeah, I'd really recommend checking it out if you do a lot of GitHub stuff.

Starting point is 00:42:09 Okay. Awesome. Yeah. That's great advice. Yeah. I often want to like check out some, so pull requests. I want to be able to like play with it and run their code.

Starting point is 00:42:16 And yeah. And so it's the best. Yeah. Awesome. All right. I got a couple of things to add, by the way, first of all,

Starting point is 00:42:23 just that first practical SQL analysis that you talked about. It also is a similar, a similar theme that you were talking about, Brian. One of the things I thought was cool though, as you scroll through it, it has a progress bar for reading at the top and that just made me so happy. I don't know why that was, that was really neat. All right. But I have a bunch of hear all about it sort of thing. So really quick, a Python B2, it's got the center yeah okay live update python 310 beta 2 is out if people want to check that out and you can go download that it also highlights

Starting point is 00:42:53 all the major features like um the pipe operator for writing unions and type specifications and a bunch of other stuff that people might care about. Structural pattern matching is probably a big one. Yeah, go to the completely different. Is that on here? And now for something completely different. I love that part. So write about the files. Yeah.

Starting point is 00:43:14 Oh, interesting. The Aaron Fest paradox concerns the rotation of a rigid disk in the theory of relativity. It's original 1909 formulation presented by. Yeah, okay. That is unexpected, but very cool. And completely different and irrelevant. Yeah. Awesome. Okay, so takeaway, 3.10 Beta 2

Starting point is 00:43:33 is out. People can check that out. There's also some security patches for Django, so be sure to check that out. One thing that surprised me is the Microsoft install Python from the Windows Store already has a 310 beta store install. So, okay, that's pretty cool that they're keeping that up to date. And it's rated E for everyone.

Starting point is 00:43:54 Yeah, even kids can pip install. Awesome. So Frederick Bankston sent a message in response to our last show where we talked about the method overloading by type. Like if it takes an int or a string, it calls different functions. It's also pointed us towards this multi-method other library that is similar. So people can check that out. That's cool. Neat. Speaking of the GitHub stuff, I've been starting to use PyCharm 2021 to early access version, early access program version.

Starting point is 00:44:24 And it's been working fine so people want to try out the new features there's a bunch of cool stuff uh you have support for python 310 and new stuff for pytest i don't remember if this came in here but one thing that i did learn about that recently that's in there that's super cool is they have in pycharm if you log in pycharm into your github account there's a pull request section and you can just click it and it'll do those same steps that Julia was talking about. Like right there in PyCharm. Just go, I want to try that PR before I accept it. And just click that and go.

Starting point is 00:44:56 You can even have comments. You see the conversation inside there and everything. It's cool. Never go to GitHub again. Exactly. And just forget how to use it, basically. All right. That's it. That's all the items i got so yeah i've got other stuff that's just hanging around from before cool all right well you want to close it out with a joke yeah a couple of jokes always

Starting point is 00:45:16 all right so over at upjoke.com slash programmer to ask jokes you'll find many bad jokes some even that are not very appropriate or whatever but there's a few that are funny. So I pulled out three here. I'll do the first one. Brian, you can do the second. Julie, you can do the third, I guess, if you're up for it. Okay. So this one we should have saved for six months from now, but I asked a programmer what her new year's resolution would be. She answered 1920 by 1080. That's so bad. No, that's awesome. It's really bad.

Starting point is 00:45:45 All right, well, you got to do the next one. How does a programmer confuse a mathematician? I don't know how. Just saying that X equals X plus one. All right, Julia.

Starting point is 00:46:00 Okay. Why do Python programmers have low self-esteem? They're constantly comparing their self to other. Okay. Why do Python programmers have low self-esteem? They're constantly comparing their self to other. Also bad. Probably the worst. Sorry we gave you that one. It's okay. I saw the one that Brian did and I was like, oh, it should be x plus equals one. I was like, no, that ruins the joke. Exactly. Yeah. Yeah, i actually often do the the slow way or the the non-obvious way

Starting point is 00:46:30 yeah x equals x plus one just to make it more obvious to people reading it sometimes yeah yeah no i agree yeah at least it's not c plus plus with x plus plus x I love that. No, no, we should have that. I'm okay with X plus plus, but not that also plus plus X. Oh, the pre-increment. Yeah. The pre-increment.

Starting point is 00:46:53 The slight. That's weird. Yes, exactly. Exactly. But I could go for it. X plus plus. Come on.

Starting point is 00:46:58 All right. Well, Julia, thanks for joining us this week. And Brian, thanks as always. Oh, it was a pleasure.

Starting point is 00:47:03 Thanks, Julie. Yeah. Bye everyone.

Your Ad Here

Python Bytes - #238 A cloud-based file system for Python and a new GUI!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.