Python Bytes - #167 Cheating at Kaggle and uWSGI in prod
Episode Date: February 3, 2020Topics covered in this episode: clize: Turn functions into command-line interfaces How to cheat at Kaggle AI contests * Configuring uWSGI for Production Deployment* * Thinc: A functional take on d...eep learning, compatible with Tensorflow, PyTorch, and MXNet* * pandas-vet* * NumPy beginner documentation* Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/167
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 167, recorded January 29th, 2020.
I'm Michael Kennedy, and Brian Ocken is away.
We miss you, Brian, but I have a very special guest to join me, Vicky Boykus.
Welcome, Vicky.
Thanks for having me.
Yeah, it's great to have you here.
I'm excited to get your take on this week's Python news.
I know you found some really interesting and controversial ones that we're going to jump into and that will be great.
Also great is Datadog. They're sponsoring this episode. So check them out at pythonbytes.fm
slash Datadog. I'll tell you more about them later. Vicky, let me kick us off with command
line interface libraries for lack of a better word. So back on episode 164, so three episodes ago,
I talked about this thing called Typer. Have you heard of Typer? T-Y-P-E-R?
I have not, but I've heard of Qlik. So I'm curious to see how this differs from that even.
Yeah, so this is sort of a competitor to Qlik. Typer is super cool because what it does is it uses native python concepts to build out your
cli rather than attributes where you describe everything so for example you can have a function
and you just say this function takes a name colon stir to give it a type or an int or whatever
and then typer can automatically use the type and the name of the parameters and stuff to generate like your help and the inbound arguments and so on.
So that's pretty cool, right?
Yeah, seems like a great excuse to start using type annotations if you haven't yet.
Yeah, exactly.
Very, very nice that it leverages type annotations, hence the name Typer, right?
So our listeners are great.
They always send in stuff that we haven't heard about or, you know, like, I can't believe you didn't talk about this other thing.
So Marcello sent in a message and says, hey, you should talk about Clize, C-L-I-Z-E, which
turns functions into command line interface.
So Clize is really cool.
And it's very similar in regard to how it works to Typer.
So what you do is you create functions, you give them variables,
you don't have to use the types in the sense that Typer does. But you have positional arguments,
and you have keyword only arguments. And you know, Python has that syntax that very few people use,
but it's cool if you want to enforce it, where you can say, here are some parameters to a function,
comma, star, comma, here are some more.
And the stuff after the star has to be addressed as a keyword argument, right? Yeah. So it leverages
that kind of stuff. So you can say, like their example says, here's a hello world function,
and it takes a name, which has a default of none, and then star comma, no capitalize is false,
and it gives it a default value. So all you got to do to run it is basically
import clies.run and call run on your function and then what it does is it verifies those
arguments about whether or not they're required and then it'll convert the keyword arguments to
like dash dash this or that so like dash dash no capitalize will pass true to no capitalize. If you admit it,
it'll pass, you know, whatever the default is, I guess. So false. So there's like positional ones
where you don't say the name, but then also this cool way of adding on these dash dash capitalize
and so on. So it seems like a really cool and pretty extensive library for building command
line interfaces. Yeah. So this seems like it'd be good if you have a lot of parameters that you have
to pass in.
I'm thinking specifically of some of the work
that you would do in the cloud, like in the AWS command line.
Yeah, yeah.
Or similar.
Yeah, for sure.
Another thing that's cool is it will take your doc strings
and use those as help messages.
Oh, that's neat.
Yeah, so you know, in like some editors,
you can type triple quote enter and it'll generate,
you know, here's the summary of the method. And then here's the arguments. And you can put this, or you can just write them enter and it'll generate you know here's the the summary of the method and
then here's the arguments and you can put this or you can just write them out of course and then
here's the descriptions about each parameter those become help messages about each command in in
there so it's really nice and i like how it uses just pure python sort of similar to typer in that
regard that you don't put like three or four levels of decorators
on top of things and then reference other
parts of that. You just say, here's some
Python code. I want
to treat it as a command line interface.
Clies.run. That is pretty cool.
So there's now a lot of choices
if you want to do command line interfaces.
Yeah, yeah. Definitely. And click is
good and it's very popular in argpars
as well, but I'm kind of a fan of these pure Python ones
that don't require me to go do a whole bunch of extra stuff.
So yeah, definitely loving that.
You know what?
I bet that Kaggle's not loving
what you're talking about next.
Before we get into this.
Well, I think they might be, but...
Yeah, we'll see.
Okay.
Tell us about Kaggle and what the big news here is.
Yeah, so there was a dust up at
Kaggle a couple of weeks ago. So just as a little bit of background, Kaggle is a platform that's now
owned by Google that allows data scientists to find data sets, to learn data science. And
most importantly, it's probably known for letting people participate in machine learning competitions.
That's kind of how it gained its popularity and notoriety. Yeah, that's how I know it. Yep. And so people can sharpen their data science and modeling skills on it.
So they recently, I want to say last fall, hosted a competition that was about analyzing
pet shelter data. And this resulted in enormous controversy. So what happened is there's this
website that's called petfinder.my that helps people find pets to rescue in Malaysia from shelters.
And in 2019, they announced a collaboration with Kaggle
to create a machine learning predictor algorithm
which pets would be most likely to be adopted
based on the metadata descriptions on the site.
So if you go to petfinder.my, you'd see that they'll have a picture of the pet
and then a description, how old they are and some other attributes about them.
Right. Were they vaccinated or things like that, right? Sort of, you might think, well,
if they're vaccinated or they're neutered or spayed, they may be more likely to be adopted,
but you don't necessarily know, right? So that was kind of some, like, what are the important
factors was this whole competition, right? Yeah, the goal was to help the shelters write better descriptions so that pets would be adopted more quickly.
So after several months, they held the competition for several months.
And there was a contestant that won and he was previously what was called a Kaggle Grandmaster.
So he'd won a lot of different stuff on Kaggle before and he won $10,000 in prize money.
But then what happened is they started to validate all of his data.
Because when you do a Kaggle competition, you then submit all of your data and all of your results and your notebooks and your code.
Like how you trained your models and stuff like that, right?
Yeah, all of that stuff.
And then what happened was Peter wanted to put this model
into production so you initially have something like a jupiter or a colab notebook in this case
and the idea is that now you want to be able to integrate it into the pet finder website
so they can actually use this predict these predictors to fine-tune how they post the
content and so when a volunteer who who was Benjamin Minicoffer,
offered to put the algorithm into production, and he started looking at it, he found that there was
a huge discrepancy between the first and second place entrants in the contest. And so what happened
was, so a little to get more into the technical aspect, the data they gave to the contestants
asked them to predict the speed at which a pet would be adopted from one to five and included some of the features you talked about,
like animal, breed coloration, all that stuff. The initial training set had 15,000 animals.
And then after a couple months, the contestants were given 4,000 animals that had not been seen
before as a test of how accurate they were. So what the winner did was he actually
scraped basically most of the website so that he got that 4,000 set, the validation set also.
And he had the validation set in his notebook. So basically what he did was he used the MD5 library
to create a hash for each unique pet. And then he looked up the adoption score for each
of those pets, basically when they were adopted from that external data set. And there were about
3,500 that had overlaps with the validation set. And then he did a column manipulation in pandas
to get at the hidden prediction variable for every 10th pet, not every single pet,
but every 10th pet. So it didn't look too obvious.
Right, so he gave himself like a 10% head start or advantage or something like that.
Exactly.
And he replaced the prediction that should have been generated by the algorithm with the actual value.
And then he did a dictionary lookup between the initial MD5 hash and the value of the hash.
And this was all obfuscated in a separate function that happened in his data.
Wow. And so they must have been looking at this going,
what does the MD5 hash of the pet attributes have to do with anything?
You know what I mean, right?
The hashes are meant to obscure stuff, right?
Right, yeah.
So what was the fallout?
So the fallout was this guy worked at h2o.ai.
And so he was fired from there.
And Kaggle also issued an apology where they explained exactly what happened.
And they expressed the hope that this didn't mean that every contest going forward would
be viewed with suspicion for more openness and for collaboration going forward.
Wow.
And it was an amazing catch.
Yeah, that's such a good catch.
I'm so so so glad that
benjamin did that i've got the whole deal here now did kaggle actually end up paying him the
10 000 before they caught it is there like some sort of waiting period unfortunately i think the
money had already been dispersed by that point yeah i can easily see something. Well, you know, like the prize money will be sent out after a very deep.
It may change the timing of that for sure in the future.
Who knows?
But wow, that's crazy.
Do you know why he was fired?
I mean, they're just like, we don't want you to say, I mean, H2O.AI, they're kind of a,
we'll help you with your AI story.
So I guess, you know, they're probably just like, we'll help you with your AI story. So I guess, you know,
they're probably just like, we don't want any of the negativity of that on our product.
Yeah, I think that's essentially it. And it was a pretty big competition in the data science
community. And I think also once they'd started to look into it in other places, previously,
he talked about just basically scraping data to gain competitions as well.
So all of that stuff started to come out as well.
I think they wanted to distance themselves.
Yeah, I can imagine.
Yikes.
Okay, well, thank you for sharing that.
Now, before we get to the next one, let me tell you about this week's sponsor, Datadog.
They're a cloud scale monitoring platform that unifies metrics, logs, and traces.
Monitor your Python applications in real time, find bottlenecks with detailed
flame graphs, trace requests as they travel across service boundaries, and they're tracing client
auto instruments, popular frameworks like Django, async IO, flasks, you can quickly get started
monitoring the health and performance of your Python apps. Do that with a 14 day free trial
and Datadog will send you a complimentary t-shirt.
Cool little Datadog t-shirt.
So check them out at pythonbytes.fm slash Datadog.
This next one kind of hits home for me because I have a ton of services and a lot of servers
and websites and all these things working together running on MicroWhiskey, UWSGI.
And I've had it running for quite a few years.
It's got a lot of traffic, you know, we do
like, I don't know, 14 terabytes of traffic a month, or maybe even more than that. So quite a
bit of traffic going around these services and whatnot. So it's been working fine. But I ran
across this article by the engineers at Bloomberg. So they talked about this thing called configuring micro-WSGI for production deployment.
And I actually learned a lot from this article. So I don't feel like I was doing too many things
wrong, but there was a couple of things I'm like, oh yeah, I should probably do that. And other
stuff just that is really nice. So I just want to run you through a couple of things that I learned.
If you want to hear more about how we're using micro-WSGI, you can check that out on
TalkPython 215. Dan Bader and I swap stories about how we're running our various things,
you know, TalkPython training and realpython.com and whatnot. So this is guidance from Bloomberg's
engineering structured products application group. That's quite the title. And they decided to use micro
whiskey because it's really, you know, good for performance, easy to work with. However, they said
micro whiskey is as it's maturing, some of the defaults that made sense when it was new,
like in 2008, don't make sense anymore. The reason is partly just because the way people use
these sites is different or the servers is partly just because the way people use these sites is different or these
servers it's different for example doing proxies up in front of micro whiskey with say engine x
that used to not be so popular so they made these defaults built into the system that maybe don't
make sense anymore and so what they did is we're going to go through and they said we're going to
go through and talk about all the things that we're going to override the defaults for and why unbit the developer micro whiskey is going to
fix all these bad defaults in the 2.1 release. But right now it's 2.0 as of this recording. So
you're going to have to just, you know, hang in there or apply some of these changes. Now,
I do want to point out one thing. When I switched on a lot of these, I did them one at a time. And the way you get it to reload its config is you say, relaunch the
process, restart the process with a system CTL, like a daemon management thing from Linux. And
one of their recommendations is to use this flag die on term, which is for it to die on a different
signal that it receives. And for whatever reason,
maybe I'm doing it wrong. But whenever I turn that on, it would just lock up and it would take
about two minutes to restart the server because it would just hang until it eventually timed out.
It was like forcefully killed. So that seems bad. So I'm not using that. But I'll go quickly over
the settings that I use that I thought were cool here. So there's you've got these complicated
config files.
If you want to make sure everything's validated,
you can say strict equals true.
That's cool.
That will verify that everything that's typed in the file
is accurate and is valid
because it's kind of forgiven at the moment.
Master is true is a good setting
because this allows it to create worker processes
and recycle them based on number of requests and so on.
Something that's interesting,
I didn't even realize you could do. Maybe tell me if you knew this was possible in Python apps.
You can disable the GIL, the global interpreter lock. You can say, you know what, for this
Python interpreter, let's not have a GIL. Wow. How does that work?
Yeah. Well, it's, I mean, people talk about having no GIL. It's like, oh, you can do all
this cool concurrency and whatnot. But what it really means is you're basically guaranteed you can only have one thread.
So if you try to launch, say, a background job
on a MicroWhiskey server
and you don't pass enable threads as true,
it's just going to not run
because there's no GIL and there's no way to start it.
So that's something you want to have on.
Vacuum equals true.
This one I had off and I turned it on.
Apparently, this cleans up like temporary files and so on. Also single interpreter. It used to be that micro whiskey was more of an
app server that might have different versions of Python and maybe Ruby as well. And this will just
say, no, no, it's just the one version. A couple other ones. You can specify the name that shows
up in like top or glances. So it'll say like, you can give it, say, your website name, and it'll say things like
that thing, worker process one, or that thing, master process, or whatnot, and so there's just
a bunch of cool things in here with nice descriptions of why you want these features,
so if you are out there, and you're running MicroWhiskey, give this a quick scan. It's really
cool. Now, this next one also is pretty neat, so this one comes from the people who did Spacey, right?
What do they got going on?
Yep, that's right.
So this was just released a couple days ago, and it's called Think,
and they bill it as a functional take on deep learning.
And so basically, if you're familiar with deep learning,
there's kind of two big competing frameworks right now,
TensorFlow and PyTorch, and MXNet is also in
there. So the idea of this library is that it abstracts away some of the boilerplate that you
have to write for both TensorFlow and PyTorch. PyTorch has a little bit less TensorFlow, with
Harris on top also has a little bit less, but you end up writing a lot of the same kind of stuff.
And there's also some stuff that's obfuscated away from you,
specifically some of the matrix operations that go on under the hood. And so what Think does is,
so it already runs on Spacey, which is an NLP library under the covers. So what the team did
was they surfaced it so that other people could use it more generically in their projects. And so it has
that favorite thing that we love. It has type checking, which is particularly helpful for
tensors when you're trying to get stuff and you're not sure why it's not returning things.
It has classes for PyTorch wrappers and for TensorFlow. And you can intermingle the two
if you want to, if you have two libraries that bridge things. It has deep support for NumPy structures, which are the kind of the
underlying structures for deep learning. It operates in batches, which is also a common
feature of deep learning projects. So they process features and data in batches. And then it also,
sometimes a problem that you have with deep learning
is you are constantly tuning hyperparameters
or the variables that you put into your model
to figure out how long you're going to run it for,
how many training epochs you're going to have,
what size your images are going to be.
Usually those are those clustered
in the beginning of your file
is kind of like a dump or a dictionary or whatever.
It has a special structure to handle those as well.
So it basically hopes to make it easier and more flexible to do deep learning,
especially if you're working with two different libraries
and it offers a nice higher level abstraction on top of that.
And the other cool thing is they have already released all the examples and code
that are available in Jupyter notebooks on their GitHub
repo. So I'm definitely going to be taking a closer look at that. Yeah, that's really cool.
They have all these nice examples there and even buttons to open them in Colab, which is, yeah,
that's pretty awesome. This looks great. And it looks like it's doing some work with FastAPI
as well. I know they hired a person who's maintaining fast API, which is cool. Also their
prodigy project. So yeah, this looks like a really nice library that they put together.
Cool. And Enos has been on the show before Enos from Explosion AI here as a guest co-host as well.
Super cool. That's awesome. Yeah. This next one I want to talk about, you know, I'd love to get
your opinion because you're more on the data science side of things right yeah yeah so this next one i want to tell folks about this is another one from listeners
you know that we talked about something that validates panda panda's like oh you should also
check out this thing so this comes from jacob deppin thank you jacob for sending this in and so
it's pandas dash vet and what it is is a plugin for flake 8 that checks pandas code and it's this opinionated take
on how you should use pandas they say one of the the challenges is that the if you go and search
on stack overflow or other tutorials or even maybe video courses they might show you how to do
something with pandas but maybe that's a a deprecated way of working with pandas or some
sort of old api and there's there's a better better way so the idea is to make pandas but maybe that's a a deprecated way of working with pandas or some sort of old api and
there's there's a better better way so the idea is to make pandas more friendly for newcomers
by trying to focus on best practices and saying don't do it that way do it this way you know read
csv it has so many parameters what are you doing here's how you use it things like that so this is
based on a talk or this linter was created the idea was sparked by a talk by
anya kupsinska sorry i'm sure i blew that name bad but at pie cascades 2019 in seattle i want
your code responsibly so i'll link to that as well so it's kind of cool to see the evolution like
anya gave a talk at pie cascades and then this person's like, oh, this is awesome.
I'm going to actually turn this into a Flake 8 plugin and so on.
What are your thoughts on this?
Do you like this idea?
Yeah, I'm a huge fan of it.
I think in general, there's been kind of like this, I wouldn't want to say culture war about whether notebooks are good or bad.
And there was recently a paper released, I want to say, not a paper, but a blog post a couple days ago about how you should never use notebooks. There was a talk by Joel Gruss last year about all the things that notebooks are bad
with. I think they have their place. And I think this is one of the ways you can have, I want to
say, guardrails around them and help people do things. I like the very opinionated warning that
they have here, which is that DF is a bad variable name.
Be kinder to yourself, because that's always true.
You always start with the default of DF, and then you end up with 34 or 35 of them.
I joke about this on Twitter all the time.
But it's true.
So that's a good one.
The loc and the.ix and loc iloc is always a point of confusion.
So it's good that they have that.
And then the pivot table one is preferred to pivot or unstack. So there's a lot of confusion. So it's good that they have that. And then the pivot table one is preferred
to pivot or unstack. So there's a lot of places. So pandas is fantastic, but there's a lot of these
places where you have old APIs, you have new APIs, you have people who usually are both new to Python
and programming at the same time coming in and using these. So this is a good set of guardrails
to help write better code if you're writing it in a notebook. Oh, yeah, that's super cool.
Do you know, is there a way to make Flake 8 run in the notebook automatically?
I don't know.
You probably can, yeah.
It probably wouldn't be too hard.
Yeah, but I don't know.
Yeah, but I think it's interesting that you ask that because that's not generally that's something you would do with notebooks.
But maybe this kind of stuff will push it in the direction of being more like what we consider quote unquote
mainstream or just web dev or backend programming. Yeah, cool. Well, I definitely think it's nice if
I were getting started with pandas. Give this a check. You also if you're getting started with
pandas, you may also be getting started with NumPy, right? Yep. So NumPy is the backbone of
numerical computing in Python. So I talked about TensorFlow, PyTorch,
machine learning in the previous stories.
All of that kind of rests on the work
and the data structures that NumPy created.
So Panda, Scikit-learn, TensorFlow, PyTorch,
they all lean heavily,
if not directly depend on the core concepts,
which include matrix operation
through the NumPy array,
also known as an NDA array.
The problem was with NDRays is they're
fantastic, but the documentation was a little bit hard for newcomers. So Anne Bonner wrote a whole
new set of documentation for people that are both new to Python and scientific programming,
and that's included in the NumPy docs themselves. Before, if you wanted to find out what arrays
were, how they worked, you could go to, if you wanted to find out what arrays were, how they
worked, you could go to the section and you could find out the parameters and attributes and all the
methods of that class, but you wouldn't find out how or why you would use it. And so this
documentation is fantastic because it has an explanation of what they are. It has visuals
of what happens when you perform certain operations on arrays. And it has a lot of really great
resources if you're just getting started with NumPy. I strongly recommend for me, if you're of what happens when you perform certain operations on arrays. And it has a lot of really great resources
if you're just getting started with NumPy.
I strongly recommend for me,
if you're doing any type of data work in Python,
especially with pandas,
that you become familiar with NumPy arrays.
And this makes it really easy to do so.
Yeah, nice.
It has things like,
how do I convert a 1D array to a 2D array?
Or what's the difference between a Python list
and a Num numpy array and
whatnot yeah it looks really helpful i like the y that's often missing it yeah you know you'll see
like you do this use this function for this and here are the parameters sometimes they'll describe
them sometimes not you know and then it's just like well maybe this is what i want stack overflow
seem to indicate this is what i want i'm not sure i'll give it. Right. So I like the little extra guidance behind it. That's great.
Yeah, it does a really good job of orienting you.
Cool. All right. Well, Vicky, those are our main topics for the week. But we got a few extra
quick items just to throw in here at the end. I'll let you go first with yours.
Sure. This is just a bit of blatant self-promotion about who I am. So I am a data scientist on the side.
I write a newsletter that's called Norm Cortak, and it's about all the things that I'm not seeing
covered in the mainstream media. And it's just a random hodgepodge of stuff. It ranges from anything
like machine learning, how the data sets got created initially for NLP. I've written about
Elon Musk memes. I wrote about the recent raid of the NGINX office
in great detail and what happened there. So there's a free version that goes out once a week
and paid subscribers get access to one more paid newsletter per week. But really,
it's more about the idea of supporting in-depth writing. So it's just vicky.substack.com.
Cool. Well, that's a neat newsletter and I'm a subscriber. So very, very
nice. I have a quick one for you all out there and maybe two actually. One, PIP 20.0 was released.
So not a huge change. Obviously, PIP is compatible with the stuff that I did before and whatnot,
but it does a couple of nice things. And I think this is going
to be extra nice for beginners, because it's so challenging. You go to a tutorial, and it says,
all right, the first thing you got to do to run, whatever I want to run flask, or I want to run
Jupyter, as you say, pip install flask or pip install Jupyter. And it says you do not have
permission to, you know, write to wherever those are going to install, right?
Depending on your system.
And so if that happens now in PIP20,
it will install as if dash dash user was passed into the user profile.
That's cool, huh?
That's really neat.
Yeah, yeah.
So that's great.
And cache wheels are built from GitHub requirements and a couple of other things.
So yeah, nothing major, but nice to have that there and then also i'd previously gone on a bit of a rant saying i was
bugged that homebrew which is how i put python on my mac was great for installing python 3 until 3.7
so if you just it's even better because if you just say brew install python that means python
3 which not legacy python, which is great.
But that sort of stopped working.
It still works, but it installs Python 3.7.
So that was kind of like a sad face.
But I'm sorry, I forget the person who sent this over on Twitter,
but one of the listeners sent in a message that said,
you can brew install Python at 3.8, and that works.
Why? Is it safe to brew again? I've just started can brew install Python at 3.8 and that works. Why?
That's not.
Is it safe to brew again?
I've just started downloading directly from Python.
I know.
Exactly.
Exactly.
So I'm trying it today and so far it's going well.
So I'm really excited that on Mac OS, we can probably get the latest Python.
Even if you got to say the version, I just have an alias that re-alias is what Python
means in my ZSHRC file.
And it'll just say, you know, if you type Python,
that means Python 3.8 for now.
Anyway, I'm pretty, yeah, fingers crossed.
So it looks like it's good.
And that's nice.
Hopefully it just keeps updating itself.
I suspect it will, at least within the 3.8 branch.
All right, you ready to close this out with a joke?
Yeah.
Yeah.
So I'm sure you've heard the type of joke,
you know know a mathematician
and a physicist walk into a bar and right well some weird thing about numbers and space ensues
so this one is kind of like that one it's it's about search engine optimization so an seo expert
walks into a bar bars pub public house irish, tavern, bartender, beer, liquor, wine, alcohol, spirits, and so on.
It's bad, huh?
I like that.
That's nice.
Yeah, it's so true.
You remember how blatant websites used to be like 10 years ago?
They would just have like a massive bunch of just random keywords at the bottom.
Just, you know, like it seems like.
Yeah, and sometimes they would be in white and white text.
Yes, exactly, white and white.
But then if you highlight it, it would be like a whole three paragraphs.
And here's where the SEO hacker went.
I don't think that works so well anymore.
But yeah, it's a good joke nonetheless.
And Vicky, it's been great to have you here.
Thanks so much for filling in for Brian
and sharing the data science view of the world with us.
Thanks for having me. You bet. Bye.
Thank you for listening to Python Bytes.
Follow the show on Twitter via
at Python Bytes. That's Python Bytes as in
B-Y-T-E-S. And get the full
show notes at PythonBytes.fm.
If you have a news item you want featured,
just visit PythonBytes.fm and send it
our way. We're always on the lookout for
sharing something cool. On behalf of myself
and Brian Auchin, this is Michael Kennedy. Thank you for listening and sharing this podcast with your
friends and colleagues.