Python Bytes - #152 You have 35 million lines of Python 2, now what?
Episode Date: October 15, 2019Topics covered in this episode: JPMorgan’s Athena Has 35 Million Lines of Python 2 Code, and Won’t Be Updated to Python 3 in Time organize PEP 589 – TypedDict: Type Hints for Dictionaries Wit...h a Fixed Set of Keys gazpacho How pip install Works daily pandas tricks Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/152
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 152, recorded October 9th, 2019.
I'm Michael Kennedy.
And I'm Brian Ocken.
And this episode is brought to you by DigitalOcean.
Check them out at pythonbytes.fm slash DigitalOcean.
Get $50 credit for new users.
Now, we may have touched on this concept of legacy Python before, Brian.
Have we covered it?
Yeah, I think we have.
We definitely have.
So we know that there are companies out there that say it's really tricky for us
to upgrade to Python 3 because,
and sometimes that's because, I don't know,
just they don't put the resources into it, right?
Like they would rather work on features
rather than going back and rewriting old code
to do the same thing, but so it's not so old,
things like that.
Other times it's because they old, things like that.
Other times it's because they have a ton of Python code.
And we're hearing more and more stories of these companies that have been like head in the sand,
waiting until the very, very last minute to make those migrations.
And they're just like, all right, finally,
somebody has raised it to the level that it has to be dealt with, right?
Yeah.
Well, it turns out that banks use a lot of Python code, as we know.
And I've heard of Bank of America using a ton of code
and having a lot of people working on some Python projects.
But JPMorgan, JPMorgan Chase, they use maybe even more.
They use a ton of Python.
So there's an article that's based on this presentation by Misha Selman,
who is the executive director at JPMorgan Chase, about this.
It was given at PyData 2017.
So they've been working on it.
But the problem is they have 35 million lines of Python 2 code.
Oh, that's a lot.
In terms of Python code, that's kind of ridiculous, right?
That's an insane amount.
And so they've got a lot of Python code that has to be converted to Python 3.
And this is from their Athena trading platform, which is at the core of their business operations.
So they got a late start to migrating Python 3.
And people are pointing out this could be a security risk for them.
We saw what happened with Equifax and some outdated things there.
Who knows what the risks are?
I think it's probably
less than something like the web frameworks that were out of date at other places. But yeah, they
have a lot of stuff that has to be migrated. And internally they use Python for pricing, trading,
risk management, analytics, and even machine learning. So just to look at some stats from
this project, the feature set utilizes 150,000 Python modules,
over 500 open source packages, 35 million lines of Python code contributed by 1,500
developers.
Okay.
So they got a big team.
That's a huge scale.
And by the way, I wonder how much JPMorgan Chase is contributing back to those 500 open
source projects.
Hopefully some.
All right.
Now it says they're going to miss the deadline, right?
That most of the strategic elements are going to be in place by Q1 2020,
but they can't do it all.
And I know it's probably like a good roadmap for folks, right?
Like they don't have to upgrade at all and then release that new thing, right?
They can upgrade elements at a time.
And there's a lot of great stories on how folks have done that.
I think probably the Instagram project was the most awesome one I've seen where they
didn't even branch, right?
They just found a way to seamlessly move from Python 2 to 3 while still running on 2 and
then finally flipping the switch.
Here's another one I thought you would find interesting, though.
They have some other stats.
You know, on your projects, how often do you commit code?
It's like once a week, once a day, once an hour.
Yeah, several times a day.
Yeah, I'm kind of the same.
I do that.
And you guys don't really release stuff,
but like say the Python Bites website
or the TalkPython training site,
you know, those probably do some form of website release
every other day.
Some sort of deploy, restart, like
run through the whole deployment process. So JP Morgan Chase uses continuous delivery with
continuous integration, continuous delivery with 10,000 to 15,000 production changes a week.
That's amazing.
It's like mind blowing, isn't it?
Yeah.
Yeah. So they're on it, I guess's just it's just such a project of massive
scale that it's hard to get your mind around and hard to uh find analogies so i'm sure there's a
few other projects like this in the world but it can't be many no well that's like one a second
or faster it's constantly deploying i it's got to be microservices and other stuff right otherwise
just like how would you go to the website? How would you use the services?
Anyway, quite incredible.
All right, well, what you got for our next one? This is just kind of a cool little tool called Organize,
and it was suggested from Ariel Barkin on Twitter.
And I took a look at this, and I'm going to start using it right away.
So it's a Python-based file management automation tool.
And the idea is people are lazy with how they save files and download files and whatever.
And on my Mac, for example, all the screenshots just show up on the desktop.
And then, you know, occasionally I'll just take everything and lump them into a clutter folder or something.
But this is a tool where you can give it rules in the YAML file and say, have it do things like move all your screenshots from the desktop into a screenshots folder or look through all your downloads to look at the incomplete downloads that you canceled or something.
They're still sitting there and just trash those if they're older than a few days old or something like doing things like removing empty files from certain folders, like your download or desktop or other places. One of the examples is to organize
your receipts and invoices into date based folders, which is pretty cool because there's
macros involved that you can look at the file touch time and, and figure out what date and
extrapolate the dates and stuff. And yeah, I always, when I'm paying bills or something, I save the receipt to just wherever
in the downloads folder or something.
And having this, just running this every once in a while
could clean it up and put everything in its place.
It's pretty cool.
And super cool.
You could just put it on like a cron job
that runs every five minutes
or every minute or something, right?
It just goes, boop.
It's gotta be super quick.
Just looks at the files, a few folders,
and then does some text matching.
It's one of those automate the boring stuff sort of things
that somebody thought, everybody has this problem,
so it's nice.
Yeah, I like it.
I have the same problem with receipts and stuff.
I'll get them an email or as a PDF attachment
or actually it's just an email that I'll print it to PDF
so that I can save it for
taxes and they just like clutter up yeah it's i could totally see just using that the rules seem
like they're rich enough to do that so yeah it looks really good yeah super cool all right speaking
of cool let me just tell you about digital ocean so all of our services run on digital ocean audio
you're listening to now somehow flowed through the digitalOcean servers to get to you. And they've got all sorts of great options out there.
They're simple but powerful.
There's not knobs to run absolutely every little edge case, right?
You set up the main servers that you want to work with.
You have spaces.
You have hosted databases in MySQL and Postgres.
And you even have caching like Redis and things like that.
So super nice.
Check them out at pythonbytes.fm slash digitalocean
and get $50 credit for new users.
Highly recommended.
Now, this next one is a fun one,
and it took me a minute to realize what this was about, Brian.
So I realized there's this new PEP, PEP 589,
and it allows you to define typed dictionaries,
like define a type that represents a dictionary.
Well, it turns out there was already a way to do that,
which is why I was confused,
because there's PEP 484, which has been around for a while,
which lets you create a dict of K, V,
which is like, here's a dictionary of arbitrary keys,
and it has maybe integers or it
has user objects or whatever right so you can define these uniform dictionaries which is kind
of interesting but this new pep it lets you go much farther and it's proposed by juka let's
sallow let those allow and it's actually sponsored by Guido van Rossum. So remember
recently we spoke about Guido and we had this philosophical debate of like, well,
he's all about typing these days, but originally typing was like explicitly left out of the
language. What's the story? So here's another typing thing that he's participating in, which
I think is interesting. So this is accepted. It's scheduled for three, eight.
So all sorts of interesting stuff. And it's, you know, it's coming down the line, right?
Soon, actually. So what it lets you do is imagine you have an arbitrary JSON document or an arbitrary Python dictionary, really, but right, like, it's super easy to think of, like, well, somebody sends
me a JSON request, and I want to treat it as if I know what's happening here. It lets you actually specify the shape of those
things, both the keys, as well as the values and potentially nested documents, right? So you might
have a JSON object that's got like, some values, one of those values might be a list of other JSON
documents, you can describe that with this type dict thing. So the way it works, kind of caught
me off guard at first, but I think I like it.
So what you do, instead of just saying, you know,
there's a dictionary of like string comma user,
you actually create a class which derives from typed dict.
Okay.
Okay.
And then it has fields.
It looks a lot like data classes a little bit.
So you might have like a name colon stir and a year colon int.
In this thing that is not actually the dictionary
but it is the type that validates the dictionary all right oh okay and then you can say it is one
of those right so i say the example they give is there's a movie so you say movie colon capital m
movie is the name of the class and then it's just a dictionary but the dictionary has the name which
is a string value and a year which is which is an integer value, and so on.
And then you can actually validate it.
And the static type checker, like mypy and so on, will, if you say movie of director, it'll say, no, no, no.
You can't set this value into this dictionary because it doesn't have a key called director.
Or if you try to set the year to the string 1982, in quotes, it'll say no, no,
then this is a string at expected integer. But the errors come at the type checking time, right?
This is a type checking time. Although, you know, it's totally reasonable that things like PyCharm
and VS Code would add edit time checking for this as well, because they do for all the other type
stuff. Yeah, but it's not a runtime. It's not a runtime thing. Yeah, all the typing stuff. Okay.
And this is definitely that way.
So you're not like re-implementing
the dictionary. You're not creating a dictionary
type that is like different.
You create a type
which then talks
about just a plain dictionary.
So quite interesting actually.
Yeah, it does take a little while to look at it
and go, does this make sense? But yeah, it does.
Right, imagine you're getting, you're writing an API and somebody's submitting like a JSON post to you and
you want to know,
is it valid?
Right.
You could use this to basically to validate your schema or at least
describe the schema you expect.
Yeah.
Neat.
It is neat.
Speaking of APIs and new web things,
your next one is one of those,
right?
Oh,
I got carried down that rabbit hole.
No,
that's cool.
The next one, I was just enticed by the name.
So there's a package called Gazpacho.
It's just great.
It's fun to say.
It's fun to eat.
But anyway, Gazpacho is a web scraping library.
And the goal of it is to replace requests and beautiful soup for most web scraping projects.
And I got to tell you, I have some web scraping projects that I wanted to do.
And I know that Requests and Beautiful Soup are easy to use and are super powerful.
But that one use case where you're just grabbing, like you're just doing a get,
then you parse it, and then you find some stuff in it and separate it out,
that's so common
that this is basically, it's optimizing for that. There's an example article that I'll link to also
that uses Gazpacho to scrape hockey data for the use of fantasy sport use. But it's just a really
simple interface. You import from Gazpacho, you import get and soup as a class and you can use those to grab some some html and
and parse it find some stuff in there it's just you know a handful of lines of code and you've
got a web scraper on your hands so yeah i like it i think i'll give it a shot but i tried it out and
i wanted to bring this up because i tried it out and I ran into a problem that I was getting these certificate errors.
Have you ever gotten certificate errors when you're trying to parse things or pull things down?
Yeah, just once or twice.
And it's the kind of thing where you bounce off the walls of Stack Overflow until you get it fixed and then you forget how to fix them.
Yeah.
So what did you do?
I did the same thing.
Went to Stack Overflow. And apparently within the, and I don't know if this is just a Mac thing or not,
but on Macs at least, when you install Python,
in the install directory in applications, Python 3, whatever,
there's a file called installcertificates.command.
And you just have to run that.
And then it has the list of certificates or something.
I don't know how certificates work, but it makes it so that you can access SSL stuff from Python.
So I ran into that today.
That's right.
I'm glad you're linking to it, so now we'll have it forever.
Yeah.
Yeah, that's cool.
It's nice.
Gazpacho is like two to three times faster than Beautiful Soup, which is pretty sweet.
I like that. It also does a lot less, which is pretty sweet. I like that.
It also does a lot less. So that makes sense.
Yeah, for sure.
It's a more focused thing.
And that's like the 80% case though, right? You just need to go do simple things.
That's what I'm going to use it for. So the last thing I want to cover for our main items is a pip. So remember,
actually, we spoke about PyDist, P-Y-D-I-S-T.
Yeah.
Yeah. This is like a private PyPI as a service, I guess.
It's kind of the way I would
describe it. So right now,
I think they, before we talked
about this,
well, just in beta
it doesn't seem to have any pricing
or anything like that.
They have pricing in a little bit more detail.
They've more or less launched at this point.
This article is not about this,
but it was written by the folks who run that.
Just that's the connection back to the previous thing.
And it talks about how PIP install works.
So for this section,
I just want to talk to you real quick about
when you say PIP install certify,
like it did in that previous article you just mentioned
to fix your certificates, what do you do? How does it work? All right. So it walks you through all the steps
and all the decisions and whatnot that pip has to make when you say pip install some package.
So the first thing it has to decide, well, first, I guess it does the package exist, right?
And then it needs to figure out which distribution of the package to install. Because we have eggs, we have wheels, we have source.
We have all these different types of distributions.
There are seven different kinds of distributions.
But the most common are either source distributions or binary wheels.
So focus on those, right?
So source distribution is just just here's your python code
and maybe the c code that comes with it and as part of the setup we're gonna like run a compiler
against the c code to make sure that that's compiled in your machine right super easy to
write not so easy to make sure it works on you know everywhere not just works on my machine right
because you got to have compilers and all the platforms and oh yeah what about that old version
of windows that like was a minimal install and doesn't have GCC or Visual Studio or whatever?
So, wheels are a little bit more safe and also faster.
But that means they have compiled C code, which has to be, you have to have multiple ones for different platforms, right?
So, Windows versus macOS or something.
The benefit is stuff installs fast.
So NumPy takes about four minutes to compile from source.
So if you did a source disk of NumPy,
pip install might be slower than you would otherwise expect.
So anyway.
Yeah, the four-minute pip install, yes.
Yeah, that's before you even hit the dependencies, right?
That's just the primary thing.
Yeah.
Yeah, okay.
So it has to figure out which one of those are.
And there's actually a known URL.
So like pypi.org slash simple slash package name is where you would go.
So you could go to that slash request, for example.
And there's a huge just flat, it's a weird API. It's like HTML list of like a bunch of wheels
with platform names and tar balls and all sorts of stuff.
So it starts out by going there to figure out what is here.
What can I find?
So first it determines what system you're on
and what's compatible with the thing.
So like if you have a binary wheel,
there's actually a PEP that talks about
how you figure out which one that is.
And then if it's a source gist,
well, you just assume it works.
So once it has that,
then it'll try to get the best,
and it prefers wheels,
and then it has to figure out the dependencies.
So for binary wheels,
there's a file called metadata
that has a list of those.
So that's cool.
You can just look at that.
If it's a source distribution, it figures it out by running the setup.py. So that's interesting.
So it'll run setup.py to actually figure out what dependencies it has to install, you know,
go do that. And then you might have two dependencies, you might have a thing and you
might depend on, let's say, Beautiful Soup, but you also have some other library that also depends
on Beautiful Soup, if you follow the some other library that also depends on Beautiful Soup
if you follow the dependency tree.
And they might even specify versions.
So you might wonder, well, what happens
if one depends on one version
and the other depends on the other?
Turns out it just installs it anyway.
Let's take the latest.
That's going to be fine, right?
That's different than like a requirements file
that has like different dependency,
like pinned versions.
Like there's a slight difference
there so finally gets it builds it installs it and then it has to figure out where's the path
is that i'm going to install it to a virtual environment am i going to install it into the
system or the user path things like that so you can look at sys.prefix to figure out which one
those are and there's some environment variables and, it copies it over in the right place,
and your package installed,
before it considers your package installed,
also converts the source files into PYC bytecode files
so they don't have to get parsed again.
Then your package is installed.
Okay.
Yeah, so anyway.
Simple.
Yeah, so if you're wondering what happens
as part of the pip install stuff,
there's a lot of details,
and I didn't cover all of it,
but as much as I thought made sense.
I was just curious.
I was going to try to find one of those complicated packages
that I knew had to be compiled,
because I went to a couple of mine,
and they're just Python codes,
so there's just one per version, one wheel.
But NumPy, for instance,
I know it's got some compiled code in it it's got like
i lost count it's like 15 16 17 different wheels for each version yeah request is it's got a ton
as well yeah it's interesting it is interesting how that works i'm glad it all works i don't have
to think about it either but it turns out there's like a lot of conversation in there about some
stuff that is not totally solved even today, right?
About trying to resolve the dependencies in a totally predictable way before you start installing anything and stuff like that.
So it's worth checking out.
It's a hard problem.
Yep, for sure.
But I want to finish up with a cool trick, like a zoo trick, a zoo animal trick.
Oh, yeah.
I'm zoning today.
So Kevin Markham, he runs, what's the thing he runs?
Data School.
Data School.io.
Data School.
Plus he's a super nice guy.
He's doing something neat that's called Daily Panda's Tricks or Tricks and Tips or something like that.
But anyway, we'll get a link to it.
He's sending out a little tip or trick about pandas every day on Twitter.
And the page we're linking to has a whole bunch of them already built in.
And I like the notion of just trying to fit something.
Often they're little screenshots,
but they're still pretty small,
a little lesson of how to do something cool.
I just picked out one,
which is like,
let's say you wanted to rename all of the columns in a data frame the same way like to
replace all the spaces with underscores or something and he just shows you how to do that
in a little thing i think that's neat especially for something for a package like pandas that's
there's a whole bunch of stuff you can do with it to have a way to just see a little extra new
thing every day to say that's something something I might use. I'll keep looking
at that later or something. So I don't think we've talked about it before. And I think it's
a cool thing he's doing. So I wanted to highlight it. Yeah, it's definitely a cool thing he's doing.
And pandas is one of those things where it's not always obvious, all the little magic that you can
do, right? Like if you want to go to the columns and do string operations. Just dataframe.columns.str.applyYourOperation, right?
Like that's, after you use it for a while, it's obvious,
but maybe not right away.
It definitely isn't to me.
Pandas feels a little like magic to me.
I'm looking at this going, I would not have guessed that.
Exactly.
It's not obvious, but once you know it, it's like,
well, of course that's better than, like,
there's this saying that if you find yourself looping over things in like numpy or pandas, you're probably doing it wrong.
One of the nice fun things I think is if you get really good at something, you'll start learning the things that you shouldn't do, but that are fun.
And some of Kevin's tips are you can do this.
It's sort of fun, but don't because it's confusing to other people. But
anyway, here's the trick. Nice. It's neat that he's including those. It's clever,
but too clever sometimes. Cool. All right. So do you have any extras to share? Oh,
not only that we just got finished with our first Python West meetup and last night,
and it was both exhausting and really fun. So thanks for helping out with that. Yeah, you bet.
Good job putting it together.
It came out really well.
Everyone seemed to have a great time.
There was a totally good turnout.
I was blown away that it was actually basically sold out.
Not sold out, but booked out on its very first run, which is crazy.
And people out there listening, if you want to come and give a talk at the meetup and you're willing to find your way to Portland,
shoot a message to Brian or me and let us know.
That'd be cool.
Yeah, would be cool.
And then before anybody asks, it was not recorded.
So you have to be here.
How about you?
You got some news to share.
I got all sorts of stuff.
A few really quick things.
One, I upgraded to Mac OS Catalina yesterday.
And so far far so good.
No major problems.
All the Python things seem to be working.
So if you're wondering, I did hear that someone out there was having trouble with Miniconda.
I don't use Miniconda, so I have no idea about that.
Maybe do a Google search if that matters to you.
Also, Brian, I switched to working with Adobe Audition.
I've been using Audacity and GarageBand.
Finally broke down and paid the $30 a month for Adobe Audition. I've been using Audacity and GarageBand. Finally broke down and paid the
$30 a month for Adobe Audition. And wow, is it worth it? It is so good. What has been wrong with
me to not do that? I just didn't want to learn new software. It's not so much about the money.
It's just like, I don't want to learn new hotkeys. I already know the hotkeys,
but it's so super good. The reason I bring it up on the show instead of after is if you hear like
weird artifacts or something odd in the audio, call attention to it because there's all these dials and knobs that
can like do things like chop off the s's at the end of words if you turn them too far and stuff
like that so hopefully things down better if they don't let us know and then the two python related
things really quick azure databricks also is dropping support for python 2 so just one more
brick to fall for uh legacy python the python death clock continues to toll for those who hang
on to their python 2 and uh the folks over on the vs code team and wrong lu in particular just
announced that at pycon, they just revealed a cool
new Jupyter UI variable explorer and tele-sense stuff for basically running Jupyter's inside
of VS Code.
So if you're a VS Code user and you care about Jupyter, check that out.
Very cool.
Yeah, absolutely.
Absolutely.
Well, that's it for the stuff.
I got a story for you, a joke maybe.
Yes, please.
This one comes to us from maybe an
unexpected space comes to a person on twitter goes by the sarcastic pharmacist sent us this
actually really good joke and a nice comment and the theme is that it's hard to distinguish
between what is like super easy and programming and what is like nearly impossible for people who are not doing the programming themselves.
So this is actually an XKCD article 145.
It's got a programmer, a woman sitting there working at her desk,
and there's like a manager type who comes up and is issuing feature requests.
Okay?
Okay.
The manager, I'm going to think of one of the people from Office Space maybe,
comes over and says, when a user takes a photo with the app,
it should check whether they're in a national park.
And the woman says,
sure.
Easy,
easy GIS.
Look up,
give me a few hours.
Oh yeah.
And it should also check whether the photo is a bird.
She says,
I'll need a research team in five years.
The subtitle is a NCS.
It can be hard to explain the difference between the easy and the virtually
impossible.
Yeah.
So there you go.
Yeah, I don't know.
That resonates a lot with me, at least.
Yeah, we'll probably get a bunch of the image people telling us that it's like five minutes now with all the new image libraries to do a bird.
Yeah, but that's now, right?
Like, we probably should, I should see if there's a date for this, just to be fair.
They don't have dates on these.
That's kind of funky.
All right.
Anyway, well, there's probably some algorithm that figures out the number of the XKCD and
maps it back to a date.
But yeah.
Yeah.
But that's funny.
Cool.
All right.
Well, great to chat with you as always.
Thank you.
Yep.
Bye.
Bye.
Thank you for listening to Python Bytes.
Follow the show on Twitter via at Python Bytes.
That's Python Bytes as in B-Y-T-E-S.
And get the full show notes at PythonBytes.fm.
If you have a news item you want featured,
just visit PythonBytes.fm and send it our way.
We're always on the lookout for sharing something cool.
On behalf of myself and Brian Ocken,
this is Michael Kennedy.
Thank you for listening and sharing this podcast
with your friends and colleagues.