Python Bytes - #152 You have 35 million lines of Python 2, now what?

Episode Date: October 15, 2019

Topics covered in this episode: JPMorgan’s Athena Has 35 Million Lines of Python 2 Code, and Won’t Be Updated to Python 3 in Time organize PEP 589 – TypedDict: Type Hints for Dictionaries Wit...h a Fixed Set of Keys gazpacho How pip install Works daily pandas tricks Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/152

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 152, recorded October 9th, 2019. I'm Michael Kennedy. And I'm Brian Ocken. And this episode is brought to you by DigitalOcean. Check them out at pythonbytes.fm slash DigitalOcean. Get $50 credit for new users. Now, we may have touched on this concept of legacy Python before, Brian.
Starting point is 00:00:22 Have we covered it? Yeah, I think we have. We definitely have. So we know that there are companies out there that say it's really tricky for us to upgrade to Python 3 because, and sometimes that's because, I don't know, just they don't put the resources into it, right? Like they would rather work on features
Starting point is 00:00:39 rather than going back and rewriting old code to do the same thing, but so it's not so old, things like that. Other times it's because they old, things like that. Other times it's because they have a ton of Python code. And we're hearing more and more stories of these companies that have been like head in the sand, waiting until the very, very last minute to make those migrations. And they're just like, all right, finally,
Starting point is 00:00:59 somebody has raised it to the level that it has to be dealt with, right? Yeah. Well, it turns out that banks use a lot of Python code, as we know. And I've heard of Bank of America using a ton of code and having a lot of people working on some Python projects. But JPMorgan, JPMorgan Chase, they use maybe even more. They use a ton of Python. So there's an article that's based on this presentation by Misha Selman,
Starting point is 00:01:24 who is the executive director at JPMorgan Chase, about this. It was given at PyData 2017. So they've been working on it. But the problem is they have 35 million lines of Python 2 code. Oh, that's a lot. In terms of Python code, that's kind of ridiculous, right? That's an insane amount. And so they've got a lot of Python code that has to be converted to Python 3.
Starting point is 00:01:45 And this is from their Athena trading platform, which is at the core of their business operations. So they got a late start to migrating Python 3. And people are pointing out this could be a security risk for them. We saw what happened with Equifax and some outdated things there. Who knows what the risks are? I think it's probably less than something like the web frameworks that were out of date at other places. But yeah, they have a lot of stuff that has to be migrated. And internally they use Python for pricing, trading,
Starting point is 00:02:15 risk management, analytics, and even machine learning. So just to look at some stats from this project, the feature set utilizes 150,000 Python modules, over 500 open source packages, 35 million lines of Python code contributed by 1,500 developers. Okay. So they got a big team. That's a huge scale. And by the way, I wonder how much JPMorgan Chase is contributing back to those 500 open
Starting point is 00:02:43 source projects. Hopefully some. All right. Now it says they're going to miss the deadline, right? That most of the strategic elements are going to be in place by Q1 2020, but they can't do it all. And I know it's probably like a good roadmap for folks, right? Like they don't have to upgrade at all and then release that new thing, right?
Starting point is 00:03:02 They can upgrade elements at a time. And there's a lot of great stories on how folks have done that. I think probably the Instagram project was the most awesome one I've seen where they didn't even branch, right? They just found a way to seamlessly move from Python 2 to 3 while still running on 2 and then finally flipping the switch. Here's another one I thought you would find interesting, though. They have some other stats.
Starting point is 00:03:24 You know, on your projects, how often do you commit code? It's like once a week, once a day, once an hour. Yeah, several times a day. Yeah, I'm kind of the same. I do that. And you guys don't really release stuff, but like say the Python Bites website or the TalkPython training site,
Starting point is 00:03:39 you know, those probably do some form of website release every other day. Some sort of deploy, restart, like run through the whole deployment process. So JP Morgan Chase uses continuous delivery with continuous integration, continuous delivery with 10,000 to 15,000 production changes a week. That's amazing. It's like mind blowing, isn't it? Yeah.
Starting point is 00:04:01 Yeah. So they're on it, I guess's just it's just such a project of massive scale that it's hard to get your mind around and hard to uh find analogies so i'm sure there's a few other projects like this in the world but it can't be many no well that's like one a second or faster it's constantly deploying i it's got to be microservices and other stuff right otherwise just like how would you go to the website? How would you use the services? Anyway, quite incredible. All right, well, what you got for our next one? This is just kind of a cool little tool called Organize, and it was suggested from Ariel Barkin on Twitter.
Starting point is 00:04:36 And I took a look at this, and I'm going to start using it right away. So it's a Python-based file management automation tool. And the idea is people are lazy with how they save files and download files and whatever. And on my Mac, for example, all the screenshots just show up on the desktop. And then, you know, occasionally I'll just take everything and lump them into a clutter folder or something. But this is a tool where you can give it rules in the YAML file and say, have it do things like move all your screenshots from the desktop into a screenshots folder or look through all your downloads to look at the incomplete downloads that you canceled or something. They're still sitting there and just trash those if they're older than a few days old or something like doing things like removing empty files from certain folders, like your download or desktop or other places. One of the examples is to organize your receipts and invoices into date based folders, which is pretty cool because there's
Starting point is 00:05:34 macros involved that you can look at the file touch time and, and figure out what date and extrapolate the dates and stuff. And yeah, I always, when I'm paying bills or something, I save the receipt to just wherever in the downloads folder or something. And having this, just running this every once in a while could clean it up and put everything in its place. It's pretty cool. And super cool. You could just put it on like a cron job
Starting point is 00:05:58 that runs every five minutes or every minute or something, right? It just goes, boop. It's gotta be super quick. Just looks at the files, a few folders, and then does some text matching. It's one of those automate the boring stuff sort of things that somebody thought, everybody has this problem,
Starting point is 00:06:13 so it's nice. Yeah, I like it. I have the same problem with receipts and stuff. I'll get them an email or as a PDF attachment or actually it's just an email that I'll print it to PDF so that I can save it for taxes and they just like clutter up yeah it's i could totally see just using that the rules seem like they're rich enough to do that so yeah it looks really good yeah super cool all right speaking
Starting point is 00:06:34 of cool let me just tell you about digital ocean so all of our services run on digital ocean audio you're listening to now somehow flowed through the digitalOcean servers to get to you. And they've got all sorts of great options out there. They're simple but powerful. There's not knobs to run absolutely every little edge case, right? You set up the main servers that you want to work with. You have spaces. You have hosted databases in MySQL and Postgres. And you even have caching like Redis and things like that.
Starting point is 00:07:05 So super nice. Check them out at pythonbytes.fm slash digitalocean and get $50 credit for new users. Highly recommended. Now, this next one is a fun one, and it took me a minute to realize what this was about, Brian. So I realized there's this new PEP, PEP 589, and it allows you to define typed dictionaries,
Starting point is 00:07:28 like define a type that represents a dictionary. Well, it turns out there was already a way to do that, which is why I was confused, because there's PEP 484, which has been around for a while, which lets you create a dict of K, V, which is like, here's a dictionary of arbitrary keys, and it has maybe integers or it has user objects or whatever right so you can define these uniform dictionaries which is kind
Starting point is 00:07:53 of interesting but this new pep it lets you go much farther and it's proposed by juka let's sallow let those allow and it's actually sponsored by Guido van Rossum. So remember recently we spoke about Guido and we had this philosophical debate of like, well, he's all about typing these days, but originally typing was like explicitly left out of the language. What's the story? So here's another typing thing that he's participating in, which I think is interesting. So this is accepted. It's scheduled for three, eight. So all sorts of interesting stuff. And it's, you know, it's coming down the line, right? Soon, actually. So what it lets you do is imagine you have an arbitrary JSON document or an arbitrary Python dictionary, really, but right, like, it's super easy to think of, like, well, somebody sends
Starting point is 00:08:38 me a JSON request, and I want to treat it as if I know what's happening here. It lets you actually specify the shape of those things, both the keys, as well as the values and potentially nested documents, right? So you might have a JSON object that's got like, some values, one of those values might be a list of other JSON documents, you can describe that with this type dict thing. So the way it works, kind of caught me off guard at first, but I think I like it. So what you do, instead of just saying, you know, there's a dictionary of like string comma user, you actually create a class which derives from typed dict.
Starting point is 00:09:15 Okay. Okay. And then it has fields. It looks a lot like data classes a little bit. So you might have like a name colon stir and a year colon int. In this thing that is not actually the dictionary but it is the type that validates the dictionary all right oh okay and then you can say it is one of those right so i say the example they give is there's a movie so you say movie colon capital m
Starting point is 00:09:37 movie is the name of the class and then it's just a dictionary but the dictionary has the name which is a string value and a year which is which is an integer value, and so on. And then you can actually validate it. And the static type checker, like mypy and so on, will, if you say movie of director, it'll say, no, no, no. You can't set this value into this dictionary because it doesn't have a key called director. Or if you try to set the year to the string 1982, in quotes, it'll say no, no, then this is a string at expected integer. But the errors come at the type checking time, right? This is a type checking time. Although, you know, it's totally reasonable that things like PyCharm
Starting point is 00:10:15 and VS Code would add edit time checking for this as well, because they do for all the other type stuff. Yeah, but it's not a runtime. It's not a runtime thing. Yeah, all the typing stuff. Okay. And this is definitely that way. So you're not like re-implementing the dictionary. You're not creating a dictionary type that is like different. You create a type which then talks
Starting point is 00:10:35 about just a plain dictionary. So quite interesting actually. Yeah, it does take a little while to look at it and go, does this make sense? But yeah, it does. Right, imagine you're getting, you're writing an API and somebody's submitting like a JSON post to you and you want to know, is it valid? Right.
Starting point is 00:10:50 You could use this to basically to validate your schema or at least describe the schema you expect. Yeah. Neat. It is neat. Speaking of APIs and new web things, your next one is one of those, right?
Starting point is 00:11:00 Oh, I got carried down that rabbit hole. No, that's cool. The next one, I was just enticed by the name. So there's a package called Gazpacho. It's just great. It's fun to say.
Starting point is 00:11:12 It's fun to eat. But anyway, Gazpacho is a web scraping library. And the goal of it is to replace requests and beautiful soup for most web scraping projects. And I got to tell you, I have some web scraping projects that I wanted to do. And I know that Requests and Beautiful Soup are easy to use and are super powerful. But that one use case where you're just grabbing, like you're just doing a get, then you parse it, and then you find some stuff in it and separate it out, that's so common
Starting point is 00:11:45 that this is basically, it's optimizing for that. There's an example article that I'll link to also that uses Gazpacho to scrape hockey data for the use of fantasy sport use. But it's just a really simple interface. You import from Gazpacho, you import get and soup as a class and you can use those to grab some some html and and parse it find some stuff in there it's just you know a handful of lines of code and you've got a web scraper on your hands so yeah i like it i think i'll give it a shot but i tried it out and i wanted to bring this up because i tried it out and I ran into a problem that I was getting these certificate errors. Have you ever gotten certificate errors when you're trying to parse things or pull things down? Yeah, just once or twice.
Starting point is 00:12:31 And it's the kind of thing where you bounce off the walls of Stack Overflow until you get it fixed and then you forget how to fix them. Yeah. So what did you do? I did the same thing. Went to Stack Overflow. And apparently within the, and I don't know if this is just a Mac thing or not, but on Macs at least, when you install Python, in the install directory in applications, Python 3, whatever, there's a file called installcertificates.command.
Starting point is 00:12:59 And you just have to run that. And then it has the list of certificates or something. I don't know how certificates work, but it makes it so that you can access SSL stuff from Python. So I ran into that today. That's right. I'm glad you're linking to it, so now we'll have it forever. Yeah. Yeah, that's cool.
Starting point is 00:13:18 It's nice. Gazpacho is like two to three times faster than Beautiful Soup, which is pretty sweet. I like that. It also does a lot less, which is pretty sweet. I like that. It also does a lot less. So that makes sense. Yeah, for sure. It's a more focused thing. And that's like the 80% case though, right? You just need to go do simple things. That's what I'm going to use it for. So the last thing I want to cover for our main items is a pip. So remember,
Starting point is 00:13:39 actually, we spoke about PyDist, P-Y-D-I-S-T. Yeah. Yeah. This is like a private PyPI as a service, I guess. It's kind of the way I would describe it. So right now, I think they, before we talked about this, well, just in beta
Starting point is 00:13:56 it doesn't seem to have any pricing or anything like that. They have pricing in a little bit more detail. They've more or less launched at this point. This article is not about this, but it was written by the folks who run that. Just that's the connection back to the previous thing. And it talks about how PIP install works.
Starting point is 00:14:15 So for this section, I just want to talk to you real quick about when you say PIP install certify, like it did in that previous article you just mentioned to fix your certificates, what do you do? How does it work? All right. So it walks you through all the steps and all the decisions and whatnot that pip has to make when you say pip install some package. So the first thing it has to decide, well, first, I guess it does the package exist, right? And then it needs to figure out which distribution of the package to install. Because we have eggs, we have wheels, we have source.
Starting point is 00:14:49 We have all these different types of distributions. There are seven different kinds of distributions. But the most common are either source distributions or binary wheels. So focus on those, right? So source distribution is just just here's your python code and maybe the c code that comes with it and as part of the setup we're gonna like run a compiler against the c code to make sure that that's compiled in your machine right super easy to write not so easy to make sure it works on you know everywhere not just works on my machine right
Starting point is 00:15:19 because you got to have compilers and all the platforms and oh yeah what about that old version of windows that like was a minimal install and doesn't have GCC or Visual Studio or whatever? So, wheels are a little bit more safe and also faster. But that means they have compiled C code, which has to be, you have to have multiple ones for different platforms, right? So, Windows versus macOS or something. The benefit is stuff installs fast. So NumPy takes about four minutes to compile from source. So if you did a source disk of NumPy,
Starting point is 00:15:53 pip install might be slower than you would otherwise expect. So anyway. Yeah, the four-minute pip install, yes. Yeah, that's before you even hit the dependencies, right? That's just the primary thing. Yeah. Yeah, okay. So it has to figure out which one of those are.
Starting point is 00:16:07 And there's actually a known URL. So like pypi.org slash simple slash package name is where you would go. So you could go to that slash request, for example. And there's a huge just flat, it's a weird API. It's like HTML list of like a bunch of wheels with platform names and tar balls and all sorts of stuff. So it starts out by going there to figure out what is here. What can I find? So first it determines what system you're on
Starting point is 00:16:39 and what's compatible with the thing. So like if you have a binary wheel, there's actually a PEP that talks about how you figure out which one that is. And then if it's a source gist, well, you just assume it works. So once it has that, then it'll try to get the best,
Starting point is 00:16:55 and it prefers wheels, and then it has to figure out the dependencies. So for binary wheels, there's a file called metadata that has a list of those. So that's cool. You can just look at that. If it's a source distribution, it figures it out by running the setup.py. So that's interesting.
Starting point is 00:17:11 So it'll run setup.py to actually figure out what dependencies it has to install, you know, go do that. And then you might have two dependencies, you might have a thing and you might depend on, let's say, Beautiful Soup, but you also have some other library that also depends on Beautiful Soup, if you follow the some other library that also depends on Beautiful Soup if you follow the dependency tree. And they might even specify versions. So you might wonder, well, what happens if one depends on one version
Starting point is 00:17:33 and the other depends on the other? Turns out it just installs it anyway. Let's take the latest. That's going to be fine, right? That's different than like a requirements file that has like different dependency, like pinned versions. Like there's a slight difference
Starting point is 00:17:45 there so finally gets it builds it installs it and then it has to figure out where's the path is that i'm going to install it to a virtual environment am i going to install it into the system or the user path things like that so you can look at sys.prefix to figure out which one those are and there's some environment variables and, it copies it over in the right place, and your package installed, before it considers your package installed, also converts the source files into PYC bytecode files so they don't have to get parsed again.
Starting point is 00:18:15 Then your package is installed. Okay. Yeah, so anyway. Simple. Yeah, so if you're wondering what happens as part of the pip install stuff, there's a lot of details, and I didn't cover all of it,
Starting point is 00:18:24 but as much as I thought made sense. I was just curious. I was going to try to find one of those complicated packages that I knew had to be compiled, because I went to a couple of mine, and they're just Python codes, so there's just one per version, one wheel. But NumPy, for instance,
Starting point is 00:18:43 I know it's got some compiled code in it it's got like i lost count it's like 15 16 17 different wheels for each version yeah request is it's got a ton as well yeah it's interesting it is interesting how that works i'm glad it all works i don't have to think about it either but it turns out there's like a lot of conversation in there about some stuff that is not totally solved even today, right? About trying to resolve the dependencies in a totally predictable way before you start installing anything and stuff like that. So it's worth checking out. It's a hard problem.
Starting point is 00:19:15 Yep, for sure. But I want to finish up with a cool trick, like a zoo trick, a zoo animal trick. Oh, yeah. I'm zoning today. So Kevin Markham, he runs, what's the thing he runs? Data School. Data School.io. Data School.
Starting point is 00:19:28 Plus he's a super nice guy. He's doing something neat that's called Daily Panda's Tricks or Tricks and Tips or something like that. But anyway, we'll get a link to it. He's sending out a little tip or trick about pandas every day on Twitter. And the page we're linking to has a whole bunch of them already built in. And I like the notion of just trying to fit something. Often they're little screenshots, but they're still pretty small,
Starting point is 00:19:55 a little lesson of how to do something cool. I just picked out one, which is like, let's say you wanted to rename all of the columns in a data frame the same way like to replace all the spaces with underscores or something and he just shows you how to do that in a little thing i think that's neat especially for something for a package like pandas that's there's a whole bunch of stuff you can do with it to have a way to just see a little extra new thing every day to say that's something something I might use. I'll keep looking
Starting point is 00:20:25 at that later or something. So I don't think we've talked about it before. And I think it's a cool thing he's doing. So I wanted to highlight it. Yeah, it's definitely a cool thing he's doing. And pandas is one of those things where it's not always obvious, all the little magic that you can do, right? Like if you want to go to the columns and do string operations. Just dataframe.columns.str.applyYourOperation, right? Like that's, after you use it for a while, it's obvious, but maybe not right away. It definitely isn't to me. Pandas feels a little like magic to me.
Starting point is 00:20:55 I'm looking at this going, I would not have guessed that. Exactly. It's not obvious, but once you know it, it's like, well, of course that's better than, like, there's this saying that if you find yourself looping over things in like numpy or pandas, you're probably doing it wrong. One of the nice fun things I think is if you get really good at something, you'll start learning the things that you shouldn't do, but that are fun. And some of Kevin's tips are you can do this. It's sort of fun, but don't because it's confusing to other people. But
Starting point is 00:21:25 anyway, here's the trick. Nice. It's neat that he's including those. It's clever, but too clever sometimes. Cool. All right. So do you have any extras to share? Oh, not only that we just got finished with our first Python West meetup and last night, and it was both exhausting and really fun. So thanks for helping out with that. Yeah, you bet. Good job putting it together. It came out really well. Everyone seemed to have a great time. There was a totally good turnout.
Starting point is 00:21:51 I was blown away that it was actually basically sold out. Not sold out, but booked out on its very first run, which is crazy. And people out there listening, if you want to come and give a talk at the meetup and you're willing to find your way to Portland, shoot a message to Brian or me and let us know. That'd be cool. Yeah, would be cool. And then before anybody asks, it was not recorded. So you have to be here.
Starting point is 00:22:15 How about you? You got some news to share. I got all sorts of stuff. A few really quick things. One, I upgraded to Mac OS Catalina yesterday. And so far far so good. No major problems. All the Python things seem to be working.
Starting point is 00:22:27 So if you're wondering, I did hear that someone out there was having trouble with Miniconda. I don't use Miniconda, so I have no idea about that. Maybe do a Google search if that matters to you. Also, Brian, I switched to working with Adobe Audition. I've been using Audacity and GarageBand. Finally broke down and paid the $30 a month for Adobe Audition. I've been using Audacity and GarageBand. Finally broke down and paid the $30 a month for Adobe Audition. And wow, is it worth it? It is so good. What has been wrong with me to not do that? I just didn't want to learn new software. It's not so much about the money.
Starting point is 00:22:55 It's just like, I don't want to learn new hotkeys. I already know the hotkeys, but it's so super good. The reason I bring it up on the show instead of after is if you hear like weird artifacts or something odd in the audio, call attention to it because there's all these dials and knobs that can like do things like chop off the s's at the end of words if you turn them too far and stuff like that so hopefully things down better if they don't let us know and then the two python related things really quick azure databricks also is dropping support for python 2 so just one more brick to fall for uh legacy python the python death clock continues to toll for those who hang on to their python 2 and uh the folks over on the vs code team and wrong lu in particular just
Starting point is 00:23:42 announced that at pycon, they just revealed a cool new Jupyter UI variable explorer and tele-sense stuff for basically running Jupyter's inside of VS Code. So if you're a VS Code user and you care about Jupyter, check that out. Very cool. Yeah, absolutely. Absolutely. Well, that's it for the stuff.
Starting point is 00:24:00 I got a story for you, a joke maybe. Yes, please. This one comes to us from maybe an unexpected space comes to a person on twitter goes by the sarcastic pharmacist sent us this actually really good joke and a nice comment and the theme is that it's hard to distinguish between what is like super easy and programming and what is like nearly impossible for people who are not doing the programming themselves. So this is actually an XKCD article 145. It's got a programmer, a woman sitting there working at her desk,
Starting point is 00:24:33 and there's like a manager type who comes up and is issuing feature requests. Okay? Okay. The manager, I'm going to think of one of the people from Office Space maybe, comes over and says, when a user takes a photo with the app, it should check whether they're in a national park. And the woman says, sure.
Starting point is 00:24:49 Easy, easy GIS. Look up, give me a few hours. Oh yeah. And it should also check whether the photo is a bird. She says, I'll need a research team in five years.
Starting point is 00:24:58 The subtitle is a NCS. It can be hard to explain the difference between the easy and the virtually impossible. Yeah. So there you go. Yeah, I don't know. That resonates a lot with me, at least. Yeah, we'll probably get a bunch of the image people telling us that it's like five minutes now with all the new image libraries to do a bird.
Starting point is 00:25:17 Yeah, but that's now, right? Like, we probably should, I should see if there's a date for this, just to be fair. They don't have dates on these. That's kind of funky. All right. Anyway, well, there's probably some algorithm that figures out the number of the XKCD and maps it back to a date. But yeah.
Starting point is 00:25:31 Yeah. But that's funny. Cool. All right. Well, great to chat with you as always. Thank you. Yep. Bye.
Starting point is 00:25:36 Bye. Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at PythonBytes.fm. If you have a news item you want featured, just visit PythonBytes.fm and send it our way. We're always on the lookout for sharing something cool.
Starting point is 00:25:54 On behalf of myself and Brian Ocken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.