Python Bytes - #20 Finding similar but not identical images in 128 bits via Python

Episode Date: April 5, 2017

See the full show notes for this episode on the website at pythonbytes.fm/20...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes. This is episode 20 where we are delivering Python news and headlines directly to your earbuds. I'm Michael Kennedy. And I'm Brian Ockett. And we've got a bunch of stuff lined up for you today. I'm really excited to share, especially this first article, which is so clever that you chose Brian. Before we do, I want to say thank you. Thank you to Rollbar, who's back to sponsor a bunch more Python bytes. And we'll talk more about Rollbar later, but thanks Rollbar. That's awesome. Yep. So we were just talking about pictures. Like I have many gigabytes of pictures. And if, if you ran a website that accepted uploads in large numbers of pictures, how do you deal with all that data? Especially there's probably
Starting point is 00:00:42 a lot of duplicate data, right? I'm not sure. And so this is an interesting article. There's an article from JetSetter.com, and they're an invitation-only travel community. But the article is duplicate image detection with perceptual hashing in Python. And that actually sounds more— Perceptual hashing. That's awesome. Perceptual hashing. It's awesome. Perceptual hashing is it's awesome.
Starting point is 00:01:06 And the idea is they've got, I mean, the, the site's got a bunch of pictures of different places around the world and they don't want pictures that are mostly close to each other. I mean, for family photos, you got a ton that are close to each other, but I get for like, there's a lot of cases where you don't want things that are almost the same. Right. Like pictures of hotels or pictures of a marina to say, here's the view out of the hotel. Like if they're going to have a listing on like some location, some hotel, and they ask people to upload them, they don't need like 100 ones from this one view. And if you check out JetCenter.com, it is an intensely photo heavy site. Like I'm pretty impressed with the number of photos on that on that page with the idea of perceptual hashing i was definitely interested
Starting point is 00:01:49 in reading about this and i expected it to be a fairly complicated algorithm but it's actually ingenious and it's a they use python and and get transfer the image down to just a nine by nine square i don't get of gray values. Even I don't get how that's enough information, but it is apparently enough to, to determine whether or not an image is close to another image. And, uh, they do a Delta. I'm not going to be able to, can you explain that much better? I can try. I mean, they, when I read read we take a, you know, a five megapixel image, and we generate 128 bit hash. And that means a thing like that means uniqueness, or actually means similarity, which is actually more important. I was like, Okay, I have to figure this out. And my I guess
Starting point is 00:02:37 what they do is they take a large image, and they like average it down to a nine by nine, or they say for larger images, like a 17 by 17 image. And to determine the similarity, maybe somebody's off by five feet to one side or the other to take a picture of a hotel or a view or something. But if you like kind of average it down to that nine by nine, it's, that's where the similarities kind of collapse into those grids. And then they run an algorithm on that, that grayscale grid, right? Yeah, and then the interesting thing is that, of course, it's clear to me that you could come up with a hash algorithm for an image,
Starting point is 00:03:19 but the difference in the hashes is enough to tell you how close the image is. Yeah, and it's actually the opposite that really blows me away, is like two similar images that are not the same generate the same hash. That's what's the magic. Like that totally blows my mind. I could see like, well, obviously hash is different. Images are different. But images are similar, not the same.
Starting point is 00:03:35 Hash the same. That blows me away. Yeah. And I like it that it's not that complicated of an algorithm and it's a fun read. Yeah's you know so i think one there's a couple levels of interesting that you brought up this article and one of them i think is really interesting is when i first heard that i thought okay one this is going to be super hard super computational two maybe this is like machine learning or something like that like two machines like they two images given to an ai like a deep learning neural network or something say yeah these are
Starting point is 00:04:03 sufficiently similar in ways that i don't really, people don't really understand, but magic on GPUs and lots of, you know, neurons, it works out somehow. But the fact that it's really, really a simple algorithm is what's, I think, kind of special about it, right? It's like, hey, there's still lots of places to be clever and not just throw AI plus GPUs at a thing. Yes, definitely. Yeah.
Starting point is 00:04:25 And not only that, you get to take it with you, right? It's available on GitHub. Yeah, they do have it. What is it? P-Y-B-K tree? Pi BK tree, whatever that means. Okay, awesome. I'm sure it's part of the algorithm.
Starting point is 00:04:40 Excellent. So keeping with open source projects that you can go find and just grab and do cool things with, one of the listeners pointed me towards, pointed us towards Google Open Source. In fact, it was the guy from Google Fire, Python Fire, which we'll talk more about later. But he has one of the projects there. And on Google Open Source, they've basically created like a listing directory of all of the open source projects now many of the projects still live on github but this is like a place where you can go search and analyze and discover projects from google and what's cool is you can sort by language so show me the python
Starting point is 00:05:16 projects show me the c++ projects whatever so i grabbed six or seven interesting projects i just wanted to run them down for you brian okay yeah. Yeah, so one of them is Subprocess32, a reliable subprocess module for Python 2. Apparently, Subprocess, the built-in, is not reliable for Python 2. I don't know, but I didn't know that either. That's partly why it's interesting to me, but also, you know, there it is.
Starting point is 00:05:40 That's cool. Grumpy, we've talked about Grumpy before. Grumpy is Python on Go instead of Python on CPython. Yeah, that's a good one. Python Fire, of course. Python Fire, of course, like I pointed out. That's a way to take any Python object or module and turn it into a command line interface. There's a Python client for Google Maps services.
Starting point is 00:05:59 So if you want to consume Google Maps from Python, do it. There's Hue, H-Y-O-U, a Python interface for manipulating Google spreadsheets. That's cool, right? Okay. I'm going to have to try that out. That's neat. Yeah. I mean, I've seen the stuff for working with docxlsx files, the Microsoft Office ones, but I didn't know about the Google spreadsheet.
Starting point is 00:06:21 So this is cool. Another thing that's always tricky for me is working with OAuth right there's always this like I've got some app the app needs to go like open a browser window and there's some sort of funky callback and things happen and so one of the places that's especially challenging I think is over a command line interface well there's OAuth 2l I think it's L and what that is is it's a way a command line tool to get an OAuth 2 L. I think it's L. And what that is, is it's a way, a command line tool to get an OAuth token. Just let that sink in for you.
Starting point is 00:06:50 Okay. So I want to log in as Google. I can do that like through my app. Like I could basically create a shell script that through the CLI gets an OAuth token from the user. That's pretty interesting. Okay. And also I talked about the Google maps API.
Starting point is 00:07:04 Like that sounds like that's something that's really hard to like unit test or test at all without actually going to Google. So there's a mock maps API. So a small little app engine app for testing, like basically mocking out Google Maps API. And last but not least, TensorFlow, the amazing deep learning, machine learning stuff. That's about 50% Python, 50% C++, and a lot of GPUs in action there. And I don't know where I read this, but I think that this Google open source location is not just all projects. It's projects that they consider still active. Okay. Yeah, that's cool.
Starting point is 00:07:45 I mean, obviously you don't want just like a dumping ground, right? Yeah, cool. I mean, everything there looked pretty neat and fresh, so it's good. It's a fairly neat interface too with, I guess, with panels and stuff. Yeah, it's worth checking out. Okay, what do we got next? Oh, next is me. Yeah, more machine learning type stuff. Yeah, so there's an article from Jason Brownlee called,
Starting point is 00:08:04 and I just clicked away, How to Handle Missing Data with Python. And this is something that I definitely deal with, measurement values that deal with at work. But the gist of it is a lot of times you're dealing with a lot of large or small data sets, and some of the values are missing. And there's a whole bunch of different ways you can deal with missing data, but there are a few of them that he talks about are replacing, you know, you have to know what the magic number is that some data collection will fill in a zero, maybe if there's no data or some other known number, but all your math is going to get messed up if you actually just leave that there. So there's a couple of ways to get around it.
Starting point is 00:08:45 One of the ways he lists is using magic, not a number values. And I think pandas can deal with that correctly and not average those in. Yeah. What I think is really nice about it is like, I could be given a CSV file or some sort of data thing, set of data, and I could like work my way through it and maybe find the bad data and fill it in potentially. But his fix are like, you run this one line in pandas and magic happens and it's better, right? It's like the fix is so much better than the fixes that I would come up with. Yeah. And I do like that he's talking about different ways to deal with it with NumPy, even without pandas also also because you might not be using pandas but the like one of the ways you would do it with any math package really would
Starting point is 00:09:30 be to oh i guess i don't know how to do that actually never mind filling in the you'd somehow have to find all of the values anyway and fill them in with like one of the ways is if you're if you're calculating an average calculate the average of everything else and then fill in the blanks with the average number. Right. I guess it depends on what you're going to do. Are you going to average it? Are you going to max it in a minute? You could, like, push that through, right?
Starting point is 00:09:55 Yeah. Yeah. Interesting. The best solution definitely, I think, is using the not a number and letting the libraries take care of it for you. But I wanted to bring this up partly because anybody that's working with data collection and doing math with that has to deal with the fact that sometimes there's not numbers there and you have to deal with it. Okay. Awesome. He's from machinelearningmastery.com, I think,
Starting point is 00:10:20 and he's got just a ton of cool stuff going on over there. It's not just this one article. So if you're into these kinds of things, definitely check it out. Yeah, it looks good. Okay. So what's up next is the hug rest framework. But before we get to them, I want to give rollbar a hug. Rollbar is awesome. I've been, as people know, I've been using them for a long time on the websites and the websites are getting more and more traffic and i recently i'm not sure whether it was a wise decision or not because i'm really busy with other stuff but i just got really frustrated with the way my servers are working the way i could sort of move them around
Starting point is 00:10:59 and performance and stuff so i said that's it one day i just woke said, that's it. One day I just woke up and said, that's it. Converting it all to MongoDB. And so that was last week. And that took like three days of rewriting all my sites to Mongo, which I really think Mongo is the right choice. And I'm just loving the way it's working now. But that was a pretty serious, like take the guts out of all my web apps and stick in a new set of guts that are similar but not entirely compatible. I spent a little time with Rollbar and they helped me out. Find a few problems like where maybe types used to be strings. I could compare them where one was no longer a string and they didn't compare the same. So I got weird errors, but Rollbar made it super easy to track that down.
Starting point is 00:11:40 So if you want to have reliability and most importantly awareness of the state of your apps, plug in Rollbar to your web apps. You can use it in Pyramid, Flask, Django, whatever. Just plug it in and you'll get notifications right away. So be sure to visit rollbar.com slash pythonbytes and you'll get a special offer to get started there. And I bet that you definitely noticed those messages, but I didn't even notice you were mucking with things. And I'm pretty sure that nobody else did, or very few people did either. Yeah, that's true.
Starting point is 00:12:12 And thank you for saying that. But I actually know how many people ran into problems, right? There was a couple, but I got an email from a couple of people saying, hey, I had this problem with your app. I'm like, I know, but I didn't know your email address, but I know what your problem was and it's already fixed. I just couldn't contact them because they hadn't actually created an account yet.
Starting point is 00:12:32 So it was really nice to be able to just say, yeah, actually the problem you're telling me is already fixed. I just couldn't communicate that back to you. Really sorry about that. It's awesome. You seem like a big team then because of that. Oh yeah, definitely.
Starting point is 00:12:44 It's all the folks here in the cubicle farm. We're busy. You know, one of the next things that I want to do is build some nice APIs. And I think it's really an interesting time for the web in Python. There's a lot of flowers blooming, if you will. Right. We've got Pyramid, Django, Flask. Those guys are all doing super stuff.
Starting point is 00:13:06 And like most of my stuff's Pyramid. But we've got Topronto coming along, Sanic. And another one that I just learned about is called Hug, at hug.rest. How's that for a name and a domain? Yeah, actually it is. It's www.hug.rest. Hug.rest. That's beautiful. So Hug is a Python web framework just specifically for building RESTful, documented, documentable, versionable APIs.
Starting point is 00:13:33 And it's built both for like super simplicity and flexibility as well as performance. So I started looking this up. Wow, this is quite interesting. Okay, so the idea is you can create an API once and you can consume it in all these different ways. So you can import it as a module or a package into your project and use the API that way. You can communicate it obviously over HTTP
Starting point is 00:13:57 as like a RESTful API, or it also has a CLI, command line interface, way to expose that. So if you write like some kind of a web app or functionality, you want to expose over an API, but you also want to call it locally. It's like the same code. That's interesting. It's also written in Python three.
Starting point is 00:14:14 It's uses Cython all over the place. So it's like super fast. It's one of the fastest web frameworks out there for these kinds of things. At least the, the non async version, let's say. If you compare those, it's pretty cool. It's got a decorator model, so the code looks really clean. Yeah, and the decorator model is cool because the decorator model will do version management. You can have version 1 and version 2 of the API that have different data formats, and they can just coexist.
Starting point is 00:14:42 You get automatic documentation based on that. Like it'll do type annotations and then like use the type annotations as part of the documentation and things like that. Oh, that's great. It's a pretty cool, simple little framework. So, you know, hug for those guys. Nice job.
Starting point is 00:14:55 Definitely. Speaking of CLIs. Yeah, speaking of CLIs, I'm actually working on, I had an example I wanted to do that I'm running with the PyTest book that I'm working on. And for the front end of it, I was punting before and not using actually putting a front end on the application. But I wanted to at least put a command line interface in.
Starting point is 00:15:16 And my first attempt was to go down ArcParse. And the particular quirks of this application, I needed subcommands. Actually, just the tutorials I found were out of date, didn't work. And I was having a little bit of difficulty, so I went ahead and tried Qlik. I'd heard of Qlik before and hadn't tried it. And, man, a tutorial from like three years ago was about what I needed. And it works right off the, right away. I've got like half, half a page of code and my interface, my command line interface is done. So that's really cool.
Starting point is 00:15:50 It's also decorator heavy, right? Yeah. In my sublime editor, it's colored nicely. And my wife walked by and said, that's such beautiful code. Oh, lovely. Let's take that on many, many levels, right? That's awesome. Yeah. That's by armin roeneker the guy from flask so definitely uh oh did he do click i think so yeah i believe so yeah nice click is cool i've done a little bit of work with it and i've liked what i've seen but i i also kind of want to yeah we'll talk about later but i might want to try adding a different cli interface to it as well yeah cool so the last one that i chose for us is kind of a refresher, back to the fundamentals type thing. So Python inheritance class and our instance class and
Starting point is 00:16:32 static methods demystified. So this one is on realpython.com. And I went over there and checked it out. And I said, Okay, realpython.com. That's cool. And then I realized this is actually from Dan Bader. And we seem to be covering a lot of Dan's stuff over here. And I actually have more to say about Dan later still. So this was a guest post Dan did for that, although I didn't realize that until I started getting into it. And idea was to like demystify what's behind class methods, static methods, and regular instance methods. If you learn Python classes, if you learn classes and inheritance and object-oriented programming only through Python, this will be like obvious to you. But if you come from other languages like C++ or Java or C Sharp or JavaScript, there's differences to the way Python classes and inheritance works. And it's worth kind of a compare and contrast.
Starting point is 00:17:22 So he comes up with a class and it's got like a regular method, a class method, so an at class method decorator, and takes a CLS parameter, and a static method with an at static method decorator, and nothing, and basically compares and contrasts how they work. And so some of the things that I think are not obvious when you're first getting started is, like instance classes, those are pretty straightforward. Like you call them on instances, like all other languages, but the fact that I can call static methods or class methods on instances, that's a little bit funky, right? That seems a little
Starting point is 00:17:56 weird. And then the other one, the main one, I think is like, what's the different, why are there two things like static method and class method? They seem the same. Why are there two? And then like, when would I use one versus the other? Right? The class method takes a CLS method, which is literally the type that it's on. And the static method just doesn't. But other than that, they seem the same, right? And so if you're going to say, like interact with the class, like during the class method, if you're going to create an instance of the class, you can use the CLS parameter to support like inheritance and stuff. So if I got like a, let's say a vehicle class and a car, like a Tesla car class, that static method could say, like allocate a CLS, whatever that is. And if you call it on a Tesla static ish function class method, it would actually create a Tesla. I would change the thing, the type that it knows it is,
Starting point is 00:18:47 where the static method is just like a grouping. So I thought that was interesting. Does the class method follow then the hierarchy then? So if I declare a class method on a base class, is it available to the subclass? Yes, always. And that's always true for static methods. But the difference is the static method doesn't really know what type it's being called on.
Starting point is 00:19:08 Oh, okay. Whereas the class method, it's given the type. So if there's like, you call it farther down in the inheritance chain, that whatever level you're at, that instant or that type actually is communicated to it. And so you're kind of, you're told where you are in the hierarchy in a class method, where in static, it's just like, it's just a method. Go for it. Okay. I don't think I've ever used static methods for anything. Yeah. Well, they're out there hanging out with their friend class methods. Interesting. Indeed. So I have a quick follow-up from the last show, David Bieber from Google. And he,
Starting point is 00:19:39 the guy who works on Python Fire sent us a note. And you said something to the effect of, look, Python Fire is awesome, but IPython is a serious dependency to take if I just want to CLI, right? And I think that's fair. That's fair. But he said, hey, you know what? One of our primary plans is to remove IPython as a dependency. We're just not there yet. So if anybody in the audience wants to help those guys move forward, they're totally working on that. And so Python Fire from Google is definitely getting some interesting thinning out and will be very nice. And actually, I like to hear that, that they're working on eventually getting rid of that dependency. And it's pretty cool. Also, it's something I had mentioned when we talked about Python Fire,
Starting point is 00:20:26 that your development time is important too. And putting an interface together with that is pretty fast. So keep that in mind. Yeah, it's not always about optimizing for the machines. Definitely. Hey, one more follow-up is we did cover pdir2 or pdir a couple episodes ago with the dir colors prints out one of the complaints i had was that it um it didn't look that great on my black terminal i had the same problem i like
Starting point is 00:20:56 darker stuff and i'm like wait where's all the words they just updated it and uh i guess yesterday i think and it does have color configuration now. So you can drop a Peter 2 config file in your home directory. And I set my background color to magenta so that it was visible for docs, visible on both black and white. And now it looks great. Oh, nice. Peter 2 now has themes. Love it.
Starting point is 00:21:23 All right. How's the book coming? I heard there's a spotting. Yeah. So on Twitter the other day, somebody, a guy named Jacob Chirose, I think that's right, noticed that it was listed on the Pragmatic Publisher's website. So it's out there. That's awesome.
Starting point is 00:21:40 I love the cover. The rocket is cool. Yeah. A 50s sci-fi nerd. Yeah. And it's awesome. I love the cover. The rocket is cool. Yeah, a 50s sci-fi nerd. Yeah, and it's perfect. It's 50s, 60s vintage rocket. So how about you? Well, it has been a super busy couple of weeks.
Starting point is 00:21:55 I've been working on a couple of classes. One of them I'm about to release. By the time this recording comes out, it will be out. So tomorrow, basically. A course called Using and Mastering Cookie Cutter. So really deep dive into what is cookie cutter? How do you create and manage projects with cookie cutter? I think it's going to be a really fun course.
Starting point is 00:22:16 And I also just a few hours ago launched Managing Python Dependencies with PIP and Virtual Environments, which Dan Bader, speaking of Dan Bader, came over to join me to write a class for us over here and we're shipping that as well. So I took that course and I actually learned quite a bit from it. It's not just like pip install done, it's what is the process that you use to manage your dependencies? What is the thinking and workflow you use to evaluate whether a package is worth taking a dependency on
Starting point is 00:22:48 and all sorts of cool stuff like that. Bunch of best practices. Launch both of those and I just started selling course bundles on TalkPython training as well to sort of go along with those. So lots of stuff. That's pretty exciting.
Starting point is 00:22:59 I gotta check out the cookie cutter thing. Yeah, thanks, Ed. It'll be out tomorrow morning. For everyone listening, that's today. But for you. It'll be out tomorrow morning. For everyone listening, that's today. But for you, Brian, that's tomorrow morning. The magic of time travel.
Starting point is 00:23:10 Thanks so much for finding all these great items. That was fun as always, Brian. It was fun for me too. And thanks to everybody for all your feedback that you send in. Yep.
Starting point is 00:23:19 Thanks, everyone. And thank you, Rollbar, for supporting the show. Thank you for listening to Python Bytes. Follow the show you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at pythonbytes.fm.
Starting point is 00:23:34 If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Auchin, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.