Python Bytes - #20 Finding similar but not identical images in 128 bits via Python
Episode Date: April 5, 2017See the full show notes for this episode on the website at pythonbytes.fm/20...
Transcript
Discussion (0)
Hello and welcome to Python Bytes. This is episode 20 where we are delivering Python news and
headlines directly to your earbuds. I'm Michael Kennedy. And I'm Brian Ockett. And we've got a
bunch of stuff lined up for you today. I'm really excited to share, especially this first article,
which is so clever that you chose Brian. Before we do, I want to say thank you. Thank you to
Rollbar, who's back to sponsor a bunch more Python bytes. And we'll
talk more about Rollbar later, but thanks Rollbar. That's awesome. Yep. So we were just talking about
pictures. Like I have many gigabytes of pictures. And if, if you ran a website that accepted uploads
in large numbers of pictures, how do you deal with all that data? Especially there's probably
a lot of duplicate data, right? I'm not sure. And so this is an interesting article.
There's an article from JetSetter.com, and they're an invitation-only travel community.
But the article is duplicate image detection with perceptual hashing in Python.
And that actually sounds more—
Perceptual hashing.
That's awesome.
Perceptual hashing.
It's awesome. Perceptual hashing is it's awesome.
And the idea is they've got, I mean, the, the site's got a bunch of pictures of different
places around the world and they don't want pictures that are mostly close to each other.
I mean, for family photos, you got a ton that are close to each other, but I get for like,
there's a lot of cases where you don't want things that are almost the same. Right. Like pictures of hotels or pictures of a marina to say, here's the view out of the hotel.
Like if they're going to have a listing on like some location, some hotel, and they ask people to upload them, they don't need like 100 ones from this one view.
And if you check out JetCenter.com, it is an intensely photo heavy site.
Like I'm pretty impressed with the number of
photos on that on that page with the idea of perceptual hashing i was definitely interested
in reading about this and i expected it to be a fairly complicated algorithm but it's actually
ingenious and it's a they use python and and get transfer the image down to just a nine by nine
square i don't get of gray values. Even I don't
get how that's enough information, but it is apparently enough to, to determine whether or
not an image is close to another image. And, uh, they do a Delta. I'm not going to be able to,
can you explain that much better? I can try. I mean, they, when I read read we take a, you know, a five megapixel image, and we generate 128 bit
hash. And that means a thing like that means uniqueness, or actually means similarity,
which is actually more important. I was like, Okay, I have to figure this out. And my I guess
what they do is they take a large image, and they like average it down to a nine by nine,
or they say for larger images, like a 17 by 17 image.
And to determine the similarity, maybe somebody's off by five feet to one side or the other to take
a picture of a hotel or a view or something. But if you like kind of average it down to that nine
by nine, it's, that's where the similarities kind of collapse into those grids. And then they run
an algorithm on that, that grayscale grid, right?
Yeah, and then the interesting thing is that, of course,
it's clear to me that you could come up with a hash algorithm for an image,
but the difference in the hashes is enough to tell you how close the image is.
Yeah, and it's actually the opposite that really blows me away,
is like two similar images that are not the same generate the same hash.
That's what's the magic.
Like that totally blows my mind.
I could see like, well, obviously hash is different.
Images are different.
But images are similar, not the same.
Hash the same.
That blows me away.
Yeah.
And I like it that it's not that complicated of an algorithm and it's a fun read. Yeah's you know so i think one there's a couple levels
of interesting that you brought up this article and one of them i think is really interesting is
when i first heard that i thought okay one this is going to be super hard super computational
two maybe this is like machine learning or something like that like two machines like they
two images given to an ai like a deep learning neural network or something say yeah these are
sufficiently similar in ways that i don't really, people don't really understand, but
magic on GPUs and lots of, you know, neurons, it works out somehow.
But the fact that it's really, really a simple algorithm is what's, I think, kind of special
about it, right?
It's like, hey, there's still lots of places to be clever and not just throw AI plus GPUs
at a thing.
Yes, definitely.
Yeah.
And not only that, you get to take it with you, right?
It's available on GitHub.
Yeah, they do have it.
What is it?
P-Y-B-K tree?
Pi BK tree, whatever that means.
Okay, awesome.
I'm sure it's part of the algorithm.
Excellent.
So keeping with open source projects that you can go find and just grab and do cool things with,
one of the listeners pointed me towards, pointed us towards Google Open Source.
In fact, it was the guy from Google Fire, Python Fire, which we'll talk more about later.
But he has one of the projects there.
And on Google Open Source, they've basically created like a listing directory of all of the open source projects now many of
the projects still live on github but this is like a place where you can go search and analyze and
discover projects from google and what's cool is you can sort by language so show me the python
projects show me the c++ projects whatever so i grabbed six or seven interesting projects i just
wanted to run them down for you brian okay yeah. Yeah, so one of them is Subprocess32,
a reliable subprocess module for Python 2.
Apparently, Subprocess, the built-in,
is not reliable for Python 2.
I don't know, but I didn't know that either.
That's partly why it's interesting to me,
but also, you know, there it is.
That's cool.
Grumpy, we've talked about Grumpy before.
Grumpy is Python on Go instead of Python on CPython.
Yeah, that's a good one.
Python Fire, of course.
Python Fire, of course, like I pointed out.
That's a way to take any Python object or module and turn it into a command line interface.
There's a Python client for Google Maps services.
So if you want to consume Google Maps from Python, do it. There's Hue, H-Y-O-U, a Python interface for manipulating Google spreadsheets.
That's cool, right?
Okay.
I'm going to have to try that out.
That's neat.
Yeah.
I mean, I've seen the stuff for working with docxlsx files, the Microsoft Office ones,
but I didn't know about the Google spreadsheet.
So this is cool.
Another thing that's always tricky for me is working with OAuth right there's always this like I've got some
app the app needs to go like open a browser window and there's some sort of
funky callback and things happen and so one of the places that's especially
challenging I think is over a command line interface well there's OAuth 2l I
think it's L and what that is is it's a way a command line tool to get an OAuth 2 L. I think it's L. And what that is, is it's a way,
a command line tool to get an OAuth token.
Just let that sink in for you.
Okay.
So I want to log in as Google.
I can do that like through my app.
Like I could basically create a shell script that through the CLI gets an
OAuth token from the user.
That's pretty interesting.
Okay.
And also I talked about the Google maps API.
Like that sounds like that's something that's
really hard to like unit test or test at all without actually going to Google. So there's
a mock maps API. So a small little app engine app for testing, like basically mocking out
Google Maps API. And last but not least, TensorFlow, the amazing deep learning, machine learning stuff.
That's about 50% Python, 50% C++, and a lot of GPUs in action there.
And I don't know where I read this, but I think that this Google open source location is not just all projects.
It's projects that they consider still active.
Okay. Yeah, that's cool.
I mean, obviously you don't want just like a dumping ground, right?
Yeah, cool.
I mean, everything there looked pretty neat and fresh, so it's good.
It's a fairly neat interface too with, I guess, with panels and stuff. Yeah, it's worth checking out.
Okay, what do we got next?
Oh, next is me.
Yeah, more machine learning type stuff.
Yeah, so there's an article from Jason Brownlee called,
and I just clicked away, How to Handle Missing Data with Python.
And this is something that I definitely deal with, measurement values that deal with at work.
But the gist of it is a lot of times you're dealing with a lot of large or small data sets, and some of the values are missing.
And there's a whole bunch of different ways you can deal with missing
data, but there are a few of them that he talks about are replacing, you know, you have to know
what the magic number is that some data collection will fill in a zero, maybe if there's no data or
some other known number, but all your math is going to get messed up if you actually just leave
that there. So there's a couple of ways to get around it.
One of the ways he lists is using magic, not a number values. And I think pandas can deal with
that correctly and not average those in. Yeah. What I think is really nice about it is like,
I could be given a CSV file or some sort of data thing, set of data, and I could like work my way through it and maybe find the bad
data and fill it in potentially. But his fix are like, you run this one line in pandas and magic
happens and it's better, right? It's like the fix is so much better than the fixes that I would come
up with. Yeah. And I do like that he's talking about different ways to deal with it with NumPy,
even without pandas also also because you might not be
using pandas but the like one of the ways you would do it with any math package really would
be to oh i guess i don't know how to do that actually never mind filling in the you'd somehow
have to find all of the values anyway and fill them in with like one of the ways is if you're
if you're calculating an average calculate the average of everything else and then fill in the blanks with the average number.
Right.
I guess it depends on what you're going to do.
Are you going to average it?
Are you going to max it in a minute?
You could, like, push that through, right?
Yeah.
Yeah.
Interesting.
The best solution definitely, I think, is using the not a number and letting the libraries take care of it for you. But I wanted to bring this up partly because anybody that's working with data collection
and doing math with that has to deal with the fact that sometimes there's not numbers there
and you have to deal with it.
Okay. Awesome.
He's from machinelearningmastery.com, I think,
and he's got just a ton of cool stuff going on over there.
It's not just this one article. So if you're into these kinds of things, definitely check it out.
Yeah, it looks good.
Okay. So what's up next is the hug rest framework. But before we get to them,
I want to give rollbar a hug. Rollbar is awesome. I've been, as people know,
I've been using them for a long time on the websites and the websites are getting more and more traffic and i recently i'm not sure
whether it was a wise decision or not because i'm really busy with other stuff but i just got
really frustrated with the way my servers are working the way i could sort of move them around
and performance and stuff so i said that's it one day i just woke said, that's it. One day I just woke up and said, that's it. Converting it all to MongoDB. And so that was last week. And that took like three days of
rewriting all my sites to Mongo, which I really think Mongo is the right choice. And I'm just
loving the way it's working now. But that was a pretty serious, like take the guts out of all my
web apps and stick in a new set of guts that are similar but not entirely compatible.
I spent a little time with Rollbar and they helped me out.
Find a few problems like where maybe types used to be strings.
I could compare them where one was no longer a string and they didn't compare the same.
So I got weird errors, but Rollbar made it super easy to track that down.
So if you want to have reliability and most importantly awareness of the state of your apps, plug in Rollbar to your web apps.
You can use it in Pyramid, Flask, Django, whatever.
Just plug it in and you'll get notifications right away.
So be sure to visit rollbar.com slash pythonbytes and you'll get a special offer to get started there.
And I bet that you definitely noticed those messages, but I didn't even notice you were
mucking with things.
And I'm pretty sure that nobody else did, or very few people did either.
Yeah, that's true.
And thank you for saying that.
But I actually know how many people ran into problems, right?
There was a couple, but I got an email from a couple of people saying, hey, I had this
problem with your app.
I'm like, I know, but I didn't know your email address,
but I know what your problem was and it's already fixed.
I just couldn't contact them
because they hadn't actually created an account yet.
So it was really nice to be able to just say,
yeah, actually the problem you're telling me
is already fixed.
I just couldn't communicate that back to you.
Really sorry about that.
It's awesome.
You seem like a big team then because of that.
Oh yeah, definitely.
It's all the folks here in the cubicle farm.
We're busy.
You know, one of the next things that I want to do is build some nice APIs.
And I think it's really an interesting time for the web in Python.
There's a lot of flowers blooming, if you will.
Right.
We've got Pyramid, Django, Flask.
Those guys are all doing super stuff.
And like most of my stuff's Pyramid.
But we've got Topronto coming along, Sanic.
And another one that I just learned about is called Hug, at hug.rest.
How's that for a name and a domain?
Yeah, actually it is.
It's www.hug.rest.
Hug.rest.
That's beautiful. So Hug is a Python web framework just specifically for building RESTful, documented, documentable, versionable APIs.
And it's built both for like super simplicity and flexibility as well as performance.
So I started looking this up.
Wow, this is quite interesting.
Okay, so the idea is you can create an API once
and you can consume it in all these different ways.
So you can import it as a module or a package
into your project and use the API that way.
You can communicate it obviously over HTTP
as like a RESTful API,
or it also has a CLI, command line interface,
way to expose that.
So if you write like some kind of a web app or functionality,
you want to expose over an API, but you also want to call it locally.
It's like the same code.
That's interesting.
It's also written in Python three.
It's uses Cython all over the place.
So it's like super fast.
It's one of the fastest web frameworks out there for these kinds of things.
At least the, the non async version, let's say.
If you compare those, it's pretty cool.
It's got a decorator model, so the code looks really clean.
Yeah, and the decorator model is cool because the decorator model will do version management.
You can have version 1 and version 2 of the API that have different data formats, and they can just coexist.
You get automatic documentation based on that.
Like it'll do type annotations
and then like use the type annotations
as part of the documentation and things like that.
Oh, that's great.
It's a pretty cool, simple little framework.
So, you know, hug for those guys.
Nice job.
Definitely.
Speaking of CLIs.
Yeah, speaking of CLIs,
I'm actually working on,
I had an example I wanted to do
that I'm running with the PyTest book that I'm working on.
And for the front end of it, I was punting before and not using actually putting a front end on the application.
But I wanted to at least put a command line interface in.
And my first attempt was to go down ArcParse.
And the particular quirks of this application, I needed subcommands.
Actually, just the tutorials I found were out of date, didn't work.
And I was having a little bit of difficulty, so I went ahead and tried Qlik.
I'd heard of Qlik before and hadn't tried it.
And, man, a tutorial from like three years ago was about what I needed.
And it works right off the, right away. I've got like half,
half a page of code and my interface, my command line interface is done. So that's really cool.
It's also decorator heavy, right? Yeah. In my sublime editor, it's colored nicely. And my wife
walked by and said, that's such beautiful code. Oh, lovely. Let's take that on many, many levels,
right? That's awesome. Yeah. That's by armin roeneker the guy from flask
so definitely uh oh did he do click i think so yeah i believe so yeah nice click is cool i've
done a little bit of work with it and i've liked what i've seen but i i also kind of want to yeah
we'll talk about later but i might want to try adding a different cli interface to it as well
yeah cool so the last one that i chose for us is kind of a refresher,
back to the fundamentals type thing. So Python inheritance class and our instance class and
static methods demystified. So this one is on realpython.com. And I went over there and checked
it out. And I said, Okay, realpython.com. That's cool. And then I realized this is actually from
Dan Bader. And we seem to be covering a lot of Dan's stuff over here. And I actually have more to say about Dan later still.
So this was a guest post Dan did for that, although I didn't realize that until I started
getting into it. And idea was to like demystify what's behind class methods, static methods,
and regular instance methods. If you learn Python classes, if you learn classes and inheritance and object-oriented programming only through Python, this will be like obvious to you.
But if you come from other languages like C++ or Java or C Sharp or JavaScript, there's differences to the way Python classes and inheritance works.
And it's worth kind of a compare and contrast.
So he comes up with a class and it's got like a regular method, a class method,
so an at class method decorator, and takes a CLS parameter,
and a static method with an at static method decorator, and nothing,
and basically compares and contrasts how they work.
And so some of the things that I think are not obvious when you're first getting started is,
like instance classes, those are pretty straightforward.
Like you call them on instances, like all other languages, but the fact that I can call static
methods or class methods on instances, that's a little bit funky, right? That seems a little
weird. And then the other one, the main one, I think is like, what's the different, why are
there two things like static method and class method? They seem the same. Why are there two? And then like, when would I use one versus the other? Right? The class method takes a CLS
method, which is literally the type that it's on. And the static method just doesn't. But other than
that, they seem the same, right? And so if you're going to say, like interact with the class, like
during the class method, if you're going to create an instance of the class, you can use the CLS parameter to support like inheritance and stuff. So if I got like a,
let's say a vehicle class and a car, like a Tesla car class, that static method could say,
like allocate a CLS, whatever that is. And if you call it on a Tesla static ish function class
method, it would actually create a Tesla. I would change the thing, the type that it knows it is,
where the static method is just like a grouping.
So I thought that was interesting.
Does the class method follow then the hierarchy then?
So if I declare a class method on a base class,
is it available to the subclass?
Yes, always.
And that's always true for static methods.
But the difference is the static method doesn't really know what type it's being called on.
Oh, okay.
Whereas the class method, it's given the type.
So if there's like, you call it farther down in the inheritance chain, that whatever level you're at, that instant or that type actually is communicated to it.
And so you're kind of, you're told where you are in the hierarchy in a class method, where in static, it's just like, it's just a method. Go for it.
Okay. I don't think I've ever used static methods for anything.
Yeah. Well, they're out there hanging out with their friend class methods.
Interesting.
Indeed. So I have a quick follow-up from the last show, David Bieber from Google. And he,
the guy who works on Python Fire sent us a note. And you said something to the effect of, look, Python Fire is
awesome, but IPython is a serious dependency to take if I just want to CLI, right? And I think
that's fair. That's fair. But he said, hey, you know what? One of our primary plans is to remove
IPython as a dependency. We're just not there yet. So if anybody in the audience wants to help those
guys move forward, they're totally working on that. And so Python Fire from Google is definitely getting some interesting
thinning out and will be very nice. And actually, I like to hear that,
that they're working on eventually getting rid of that dependency. And it's pretty cool. Also,
it's something I had mentioned when we talked about Python Fire,
that your development time is important too.
And putting an interface together with that is pretty fast.
So keep that in mind.
Yeah, it's not always about optimizing for the machines.
Definitely.
Hey, one more follow-up is we did cover pdir2 or pdir a couple episodes ago
with the dir colors prints out one of the complaints
i had was that it um it didn't look that great on my black terminal i had the same problem i like
darker stuff and i'm like wait where's all the words they just updated it and uh i guess yesterday
i think and it does have color configuration now.
So you can drop a Peter 2 config file in your home directory.
And I set my background color to magenta so that it was visible for docs, visible on both black and white.
And now it looks great.
Oh, nice.
Peter 2 now has themes.
Love it.
All right.
How's the book coming?
I heard there's a spotting.
Yeah.
So on Twitter the other day, somebody, a guy named Jacob Chirose, I think that's right,
noticed that it was listed on the Pragmatic Publisher's website.
So it's out there.
That's awesome.
I love the cover.
The rocket is cool.
Yeah.
A 50s sci-fi nerd.
Yeah. And it's awesome. I love the cover. The rocket is cool. Yeah, a 50s sci-fi nerd. Yeah, and it's perfect.
It's 50s, 60s vintage rocket.
So how about you?
Well, it has been a super busy couple of weeks.
I've been working on a couple of classes.
One of them I'm about to release.
By the time this recording comes out, it will be out.
So tomorrow, basically.
A course called Using and Mastering Cookie Cutter.
So really deep dive into what is cookie cutter?
How do you create and manage projects with cookie cutter?
I think it's going to be a really fun course.
And I also just a few hours ago launched Managing Python Dependencies with PIP and Virtual
Environments, which Dan Bader, speaking of Dan Bader,
came over to join me to write a class for us over here and we're shipping that as well. So I took
that course and I actually learned quite a bit from it. It's not just like pip install done,
it's what is the process that you use to manage your dependencies? What is the thinking and
workflow you use to evaluate
whether a package
is worth taking a dependency on
and all sorts of cool stuff like that.
Bunch of best practices.
Launch both of those
and I just started selling course bundles
on TalkPython training as well
to sort of go along with those.
So lots of stuff.
That's pretty exciting.
I gotta check out the cookie cutter thing.
Yeah, thanks, Ed.
It'll be out tomorrow morning.
For everyone listening,
that's today. But for you. It'll be out tomorrow morning. For everyone listening, that's today.
But for you, Brian,
that's tomorrow morning.
The magic of time travel.
Thanks so much for finding
all these great items.
That was fun as always, Brian.
It was fun for me too.
And thanks to everybody
for all your feedback
that you send in.
Yep.
Thanks, everyone.
And thank you, Rollbar,
for supporting the show.
Thank you for listening
to Python Bytes. Follow the show you for listening to Python Bytes.
Follow the show on Twitter via at Python Bytes.
That's Python Bytes as in B-Y-T-E-S.
And get the full show notes at pythonbytes.fm.
If you have a news item you want featured,
just visit pythonbytes.fm and send it our way.
We're always on the lookout for sharing something cool.
On behalf of myself and Brian Auchin,
this is Michael Kennedy.
Thank you for listening and sharing this podcast
with your friends and colleagues.