Python Bytes - #137 Advanced Python testing and big-time diffs
Episode Date: July 2, 2019Topics covered in this episode: Comparing the Same Project in Rust, Haskell, C++, Python, Scala and OCaml MongoDB 4.2 Deep Difference and search of any Python object/data Advanced Python Testing Un...derstanding Python's del Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/137
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 137, recorded June 26th, 2019. I'm Michael Kennedy.
And I'm Brian Harkin.
And this episode is brought to you by Rollbar. I'll tell you all more about them later.
For now, Brian, I always wonder about, you know, you hear that Python is an efficient and expressive language,
and if you write code in C++, it'll be a lot longer. But, you know, how can you quantify that?
Well, you can set up a whole bunch of people
to write the same thing in a whole bunch of languages.
Well, that's awesome that people did that.
It seems like a lot of work, but yeah, I guess that's cool.
Tell us about it.
Like, this is your first item, right?
So this is an article called
Comparing the Same Project in Rust, Haskell, C++, Python, Scala,
and how do you pronounce that?
OCaml?
O-C-A-M-L?
OCaml, I think, yeah.
So this was written up by Kristen Hume,
and this is about a university project, which is kind of a neat project.
Basically, they had to implement a big chunk of Java.
So it's a Java to x compiler and as part of a compiler
class and they were uh basically had to set up get up teams teams of people to do it and they
could pick any language they wanted which is kind of cool because you know people be better at
different languages so let them use what they're good at yeah let them use what they're good at
because then they'll do it properly and not just try to cram it one. They'll have the most
efficient use of that language for sure. Up to three people on a team, and it was a multi-month
project. And then also tests were added. So this is kind of a neat part of the process, which I
think is an awesome way to teach people, is have some published tests of like, you're going to have
to run these test cases and they have to pass, but then also have some secret ones where people don't, they don't know
what tests are going to be passed, tested against it, which is kind of nice because people will,
they'll have to be able to make sure their implementation is robust without knowing,
without the test cases. It's kind of neat. Yeah, that's cool. So I do love it that there's
unknown tests. Like these are the specifications. You can kind of get close with these tests,
but to pass, you actually have to just work. That's totally like real life. You know,
you'll have write down some specifications and there's some specifications that are not written
down. They're just supposed to be known. And then there's other things that people, once they see
the implementation, they'll go, oh yeah, I wish it did this also.
So I think that's a cool idea.
And they weren't shooting for lines of code or anything complexity.
They were just trying to finish the project.
So this analysis was done after was they had a Rust, a baseline implementation written by two people that were familiar with Rust.
And then they compared everything against that.
So there was another Rust team that chose different design decisions,
and they had like three times the code.
So these are all just comparing lines of code.
The Haskell implementation was about equal,
but depending on how you measure it, 1 to 1.6 times the code.
Same for the OCaml.
C++ was bigger, about 1.4 times the baseline.
And Scala was a little bit less with about 70% of the lines of code.
The big outlier was Python, which had a lot of standouts.
Python implementation was half the size approximately,
plus written by one person,
and had extra features past all the secret tests plus others.
Somebody excellent at programming, of course,
used some of the metaprogramming techniques.
And anyway, kind of a fun article.
One of the things I forgot to mention one of the
the hindrances was they were only supposed to use standard libraries no extra parsing and then not
any parsing libraries even if they were part of the standard library so even the parsing had to be
kind of built up from scratch yeah how interesting i wonder if that would make things like python
even better possible i don't know about rust maybe well, but like C++ doesn't have parsing libraries built in that I know of.
Things like that, right?
There's a lot of mini language parsing libraries around Python. So,
I mean, it'd be interesting to do that with, you know, go wild and use whatever's available.
Right. Like maybe take this project and then go, all right, well,
what if we hit it with all the pip installable things? What happens then, right?
Yeah, exactly.
Yeah, it sounds like a super intense project though, right?
Like deep, deep into the language, right?
I mean, on one, you're writing the compiler, you're understanding Java,
you're compiling to x86, you're doing metaprogramming.
Like there's a lot of stuff going on here.
It's a pretty cool article.
Cool.
If that last one really connected with like your deep geek outlets,
like go like really
hard into the language, this next one is going to connect with your, I just wanted to work
really quick and easy.
Yeah.
So this one is really nice.
If I was a data scientist, I might use matplotlib or just any kind of person who wanted some
visualization of data.
I might use matplotlib for that.
And that's great.
Except for at least I personally can't make my plot lib look super good, right?
Like if I used Excel, I could put the data in there and I'd highlight the stuff and I would say, okay, insert chart.
And I would pick the kind and then I would go and I would like right click and edit the chart.
And I would like maybe drag it around to size it correctly.
Double click on the axes to change the axes.
But in matplotlib, you just write code and the picture comes out, right?
Yeah.
And I know you can do all this stuff, but it's not obvious. And you have to look
every little thing up and tweak it. Yeah. So, there's this project that we heard about from
one of our listeners. And I can't remember. I'm trying to remember who it was. Oh, here it is.
This is from Lee Wagner. So, thank you, Lee, for sending it in because this is killer. So there's this project called Pi Illustrator for styling your matplotlib plot. So you just do your matpl thing much like excel where you can drag and drop
and arrange your different plots you can like go to the properties and edit like the axes and the
colors and just all the kind of stuff that you might do it even has uh like the cool design
layout stuff where like it'll help you equally space stuff between each other so put those
little bars to say right there if you drag and drop it they'll be equally space stuff between each other. So put those little bars to say right there,
if you drag and drop it,
they'll be equally spaced or like align the tops and the sides.
Yeah.
And with that,
the start thing,
you can even fill it with some of your data to begin with.
So if you,
you kind of know the data you want to plot because that's going to affect how
you're going to design it.
So pre-fill it and then drag it around and design it.
It's just totally cool. It's totally cool. I'm glad they have. So you can pre-fill it and then drag it around and design it. It's just totally cool.
It's totally cool.
I'm glad they have. So the link we're going to show has a little embedded video. And that's where, I mean, talking about it, you're like, yeah, I think this might be useful. But you
watch this video and you're like, oh my God, I need to use this right away.
Yes. I had the exact same experience. I'm like, ah, kind of interesting. Oh,
look at the video. Oh my God, it's amazing. Yeah. So this is super cool. And obviously you don't save your changes to like an Excel workbook.
What you do is you save your changes and you can actually call save in PyListrator. And what it'll
do is it'll put the configuration in Python back into the file that ran it. So that's pretty wild
actually. Yeah. And then you uncomment the PyListrator. You don't have to import it later because it's not a dependency on
your project afterwards. Right, it's just a little design tool. So it's super cool if
anyone's doing matplotlib and they want to have it styled, especially if you're doing
more than one plot and you want to put them side by side. This is super cool. So check that out.
I'm definitely a fan. Another thing I'm a fan of, Brian, MongoDB.
Love it. Since you and I are paying attention to a lot of projects,
there's a lot of different release cycles,
and we kind of decided early on that we weren't going to try to track everybody's releases
because that might get boring to people.
However, we covered MongoDB 4 because it came out with transactions,
which was a big thing and but 4.2 is out and i'm kind of
excited about a couple features that it came out with so the transactions are there but now they're
multi-doc they're distributed transactions so they they're transactions that cross uh
sharded clusters and replica sets and that's just really cool yeah Yeah, that's super cool. Yeah, I mean,
you could use a cool transactional set before,
but you're kind of limited, right?
And now it's like,
no matter what crazy cluster
with scaling and sharding
and replication you have set up,
you just do a transaction
and it's all good.
Pretty cool.
They're a good idea anyway.
But with testing,
you can set up a complex database
full of stuff.
And then at the beginning of your test,
start a transaction. And then after your beginning of your test, start a transaction.
And then after your test, and you roll this into a fixture, you can just roll back and your next test has the same data.
It saves time.
So that's cool.
Yeah.
And it's probably got isolation if for some reason they ran in parallel or whatever.
Yeah, it's really cool.
Yeah.
The other feature that's pretty amazing is the field level encryption.
And this is encryption done on a per field basis on the client side.
So the server doesn't even have it doesn't do that.
It's not doing it on the server.
So there's like system administration can be done without having to make sure everybody
signs NDAs and all that stuff that you can just, you can manage your database without
even being exposed to any of the secret stuff. Yeah, that's awesome. Like most databases,
most of what's in them is not sensitive, but there's often like a little bit that is,
that's really, you don't want, you don't want anyone to get access to. And yeah,
this is really cool because like you said, it's done in the library that talks to MongoDB.
So in PyMongo for the Python folks.
And you just set the encryption key or decryption key over there.
And the server cannot decrypt it.
So if somebody breaks into the server or you lose it,
or it's like you set it up on the cloud for like testing,
you forget that it's there. All the kind of random stuff that happens to databases.
It doesn't matter in terms of this encrypted stuff
because literally the database doesn't know how to read it.
It's the drivers on the client side that have the keys.
So with GDPR stuff, if the customer says,
hey, delete my stuff,
that's always been an issue with databases.
It might be in a whole bunch of tables,
but if you destroy the customer key,
the data might still be there, but it's unreadable to anybody. So it may as well be garbage.
Absolutely.
And it gets to be really tricky because even if you set up the right code to delete all the customer data out of your database,
what about the backup that somebody made when the older admin was hired
and they stored that in the S3 buckets, so it was offsite, right? How do you delete the data
out of there? You know what I mean? But if it's encrypted, then you can just throw away
the encryption key and then it's just gobbledygook. Yeah, cool. Pretty cool. I like it.
Speaking of cool, Rollbar, happy to have them come along and sponsor the show.
We use Rollbar on PythonBytes.fm, among other things.
So if anything goes wrong, and it's kind of fortuitous, I guess,
I woke up this morning with a ton of Rollbar messages
because there was a data center failure that caused some connectivity
between MongoDB and PythonBytes.fm.
How about that for a funny thing?
So some network card broke, right?
And like the site couldn't talk to the database server, so it was freaking out.
How do I know?
Nobody complained to me.
They probably should.
Like, Michael, your site's down.
What's going on?
It's really messed up.
But I just opened up my email.
I'm like, whoa, there's a lot of roll bar stuff going on it's really messed up but i just opened up my email like whoa there's a lot
of rollbar stuff going on here so if if you want to be notified right away even when users don't
tell you check them out they have a free tier they have some great paid tiers visit pythonbytes.fm
slash rollbar super easy to integrate into python into the web frameworks they've just got like one
or two lines you enter and or maybe a little configuration a few settings off you go it's
really really nice so check them out pythonby go. It's really, really nice.
So check them out, pythonbytes.fm slash rollbar.
Nice.
So kind of like PyLustrator, that sounds kind of useful and interesting.
This next one also sounds useful and interesting, but like PyLustrator,
it's like as you look into it, you're like, whoa, this thing does a lot, man.
Look at it go.
So there's this project that was recommended by Francois Leblanc thank you for
that Francois and it's called deep difference was a lot it's just called deep diff and so it does
deep differences in search of any python object graph so I've got an object which holds a list
that list points to a bunch of objects those have other other pointers. Like I want to know, is this thing somehow referenced by that?
Let me do a search on it.
Where is it?
Is this giant crazy data structure same or different than other giant crazy data structure?
And you could compare them.
So that's pretty cool.
So it has deep diff, it has deep search, and then also has deep hash.
So if I've got some giant crazy data structure, you would like to know that if
the data is the same across two of those, that the hash result is identical. And if any part of the
data changes that the hash then changes. Oh, yeah, possibly, right. So you will do that on
object graphs that are not even hashable themselves. Really? Yeah. So that's pretty wild.
I just a lot of nice touches in here that kind of made me realize like, wow, this? Yeah. So that's pretty wild. I have just a lot of nice touches in here that
kind of made me realize like, wow, this is wild. So for example, it'll give me the differences in
a list, ignoring orders and duplicates, right? Just what is the essence of this data? Or you
can say, is any data repeated in this list or in this dictionary or something like that?
You can exclude certain types.
Maybe I want to know the data is the same, but they're both using a thread object, and the thread object is different, so of course they're going to be different.
But say, don't check on the thread object.
Just check the other stuff.
So you can explicitly opt in or out data types that you might use. You can say, I'd like to compare these things,
but only to like four significant digits because I computed them slightly
differently. And maybe they're, you know,
I can't get them like to the decimal accuracy to be exactly the same,
just the way they're done. Right.
You can exclude parts of your object tree that you've got for compared.
I mean, this insane.
They being able to do like significant digits in a deep data structure that's amazing that's really cool for for a lot of stuff i work with
um yeah i can imagine exactly and you know what i bet this would be really good to mix in with
testing like you create your test data and then you deep diff it against the result yeah exactly
because there may be noise in the system and you know you know some of the signals are noisy. So, yeah, this is awesome.
Cool.
It's super simple, but, yeah, it's pretty cool.
So if that sounds like problems you're trying to solve, it sounds like you are, Brian,
then I think it's definitely worth having a look at.
Yeah, thanks.
Yeah, you bet.
See, we just do this podcast to help each other out.
People can listen in.
Yeah.
Speaking of testing.
Josh Peek is somebody that we, I'm sure we met him before at a previous PyCon, but he stopped by at PyCon this last year. Yeah. Speaking of testing. In a situation at work where he was asked to do complex tasks where he had to, he knew that testing and making sure that he was doing things properly would, and do good coding practices would help the entire process and make it go smoothly.
So this is sort of a start to finish summary of it, but it's not that long of a read, but he talks about his learning journey, which he includes some great podcasts, including ours.
Also, an awesome book on testing, and I know the author for that one.
Not just plugging our own stuff, he's got some great stuff in here.
He starts off with just a basic, for people new to testing, what a basic test function looks like and having good structure. But then he talks about, he wanted to ensure,
you know, do static analysis and code style. So he uses black within his testing. And when he was
talking about using Pylint, I don't use Pylint every day. So I didn't know that there was,
it's a very comprehensive check, but it takes some time for large codebases. I didn't know
that. But he has a cool
hack that he puts in place
for check-in
tests, only lint
modified files. Oh, that's cool.
Because of course, if they're unmodified, then why would they
have a different outcome?
Right. And then he uses
incorporating Flake8 to do
dockstring testing to make sure that people are using consistent dock string styles.
He covers all of his ToxInny configuration changes.
He was trying to increase his code coverage,
so he includes coverage.py,
but then also has a cov fail under flag that he adds for testing to make sure that if code coverage
drops below a certain point, it fails the test. And then
just generally gradually ratchet that up so that increase
his target was 75%. So it even goes into fixtures
and mocks and spies and stubs and then even a
cool tool called PyTest VCR,
which records your network interactions
and then replays those for future test runs.
And he saw a 10x speedup in that.
That's really cool.
There's so much cool stuff in here.
PyTest VCR, that's really cool.
I think the only problem with it is like
maybe a lot of folks using it have no idea what VCR means.
Oh, yeah, that's true.
I mean, even, yeah.
Yeah, but no, it's awesome that you just record the network interactions and don't have to depend on anything at all.
I love it.
And the recordings are done based on a per test basis.
So if you rerun an individual test, it only plays back the recording for that portion.
It doesn't have order dependency built
in, which is cool. Yeah, super cool. I love it. Yeah, that's a really nice article, Josh. Well
done. The last one I want to talk about was sent over by Kevin Books. Now, we've covered a few of
the language sort of language level learning things recently, we talked about the CPython
byte compiler, either last time or the
time before that, how it doesn't really optimize stuff. And maybe there's some opportunities there,
but more just to understand what's going on. So Kevin sent in a message, said, hey, I'm basically
a C, C++ guy. And I saw the Dell keyword in Python and it threw me for a loop because Dell seems like delete in C++, which
means free memory. But it doesn't necessarily mean that in Python. So it even seems like some
of the books out there are kind of being a little misleading, at least according to Kevin's reading
of them. So I thought I'd just pull up an article that he sent over and then talk a little bit
over some of the uses for Dell.
Great. I don't use it. So this would be good.
Yeah. So the context where I know Dell is I want to get something out of a list or I want to get
something out of a dictionary. Right. Okay. And it's a little bit weird. It's like in keyword,
right? A lot of times I would expect some operator to be on the object I'm modifying, right? Like list or, you
know, string dot in or something, and you give it the value, right? But you say string space in space,
the variable, right? So it's a little bit funky that you apply it not on the object, but as a
keyword in the language and Dell's like that, right? So if I have a dictionary and I want to
remove a key, not set it to nothing, but make it not be in the keys collection you can say dell dictionary of bracket
like as if you're accessing that value but putting the dell there takes it out oh okay yeah and you
also do that for lists so i can go in and remove it uh remove something from a list if i want
there's a remove function on the list but somewhat somewhat confusingly, potentially it's by value,
right?
So I could say,
remove Jeff from the list and Jeff will no longer be in that list wherever he
appeared.
But if I want to say,
remove the third thing,
there's no remove at or anything like that,
right?
I can't pass to,
that's not a value,
right?
So Dell will let me remove that.
You can also use pop for that.
I believe on the list,
but Dell's a little more general purpose.
And you can also delete slices.
So I could say, go to this list and take out everything from 2 to 5.
Yeah.
You know, 2 colon 5, like that.
All right, so these are all pretty interesting.
Now, I'm linking over to the official docs that talk about it.
And this article that kind of talks through some of these examples and shows you how to use it.
You can also delete a variable out of like a local or a global
namespace. So if there's a variable that's been defined and you want it to not be defined,
I can say del space variable name. And now it's as if I didn't do that line that defined it,
right? That created it. Does it remove it from the namespace?
It removed, yeah. It doesn't free the memory necessarily, but it takes it out as a global
variable.
Okay.
Interesting.
Or a local one, right?
Yeah.
So does it actually free any memory, right?
It depends, right?
So if I have it in the global names, let's say it's a global, right?
It has, obviously, the thing that has a value at that variable, it's taking up some memory.
If nothing else is pointing at it, right,
it's still going to be around because that global variable is pointing out. But if you call del that
variable, you'll dereference that one reference to it, putting the reference count to zero and
freeing it up. So theoretically, you could free up memory using del. Similarly, if it's in a list,
and the only place that points to it has a reference to it is that list itself, and you
delete it out of there out of the dictionary, goes away right memory wise but if something else is pointing on it then
obviously it's not going to go away yeah we also talked about how the c python bytecode compiler
is dumb dumb as in not super optimizing maybe on purpose and i think you could also you know if
you're like really dealing with memory issues and you you're like, I really wish this thing would just go away sooner
in this one little edge case,
you could probably use Dell to put in some of the optimizations
that you might hope that the compiler itself might do but doesn't.
Like dereference a thing as soon as it's used within a function
before you can get to the end or things like this.
Yeah, okay.
So is it for memory?
Sort of, not really.
But maybe as a side effect.
Yeah.
This has been a long time, but I do remember it tripping me up because I was like, it seems
a lot like delete, which should have a matching new to it.
Exactly.
Exactly.
We both done the C++ thing, right?
Like, where's the new that goes with Dell?
I've never seen a new.
Anyway, it's pretty cool.
There's a couple of links here there's a visual documentation there's the article understanding
python's dell and then there's the reference to that bytecode compiler people can check out yeah
in c++ i don't think there's a way to remove a name from a namespace yeah i don't think so
either right yeah so you can like make it point at null but that's about it right yeah but? Yeah. But I mean, you got to think about it, right?
Like classes, you could delete a field out of a class, right?
Because it's just a dictionary, right?
So much of Python is built on like dictionaries, right?
Like the variables are, their variable names are the keys in the dictionary and their values
are their value.
So you just take it out of the global dictionary effectively, right?
Yeah.
Okay, cool.
Pretty sweet.
So those are our main items
for today you got anything else you want to chat about brian i'm just i'm glad it's summer it's
starting to feel nice feels like summer but uh other than that not much how about you summer's
awesome it makes programming hard because programming is indoors although uh some of my
friends and i who work from home we try to get out and program in like a coffee shop or a cafe
by a lake or
something. And periodically, we have the weird experience of getting a sunburn while writing
code. And yeah, we've dubbed it a code burn. And it's kind of a badge of honor.
That's funny. Cool.
Yeah. So there's actually a couple of things I want to throw out here. We recently had Max
Sklar from the local Maximum podcast. And afterwards, he had me on to his podcast. So
I'll be on episode 73, which should be out. Not yet, but thanks to time shifting, when this episode
comes out, it should already be out. I'll put a link to that. Josh Thurston sent over a cool
video of the popularity of languages on Stack Overflow over time as a bar chart race. I didn't know about bar chart races,
but these are basically animated bar charts over time. And you just watch the bars grow and shrink.
And it's really cool. Python is kind of like a little tiny consideration at the bottom. And
obviously, we know that Python is crushing it on popularity and Stack Overflow and all those things.
So it's like a minute and a half video. I think everyone will appreciate watching it if they just got a minute to kill.
No, it's a fun video.
And one of the things I enjoy about it is early on,
you see the Java bar going up and down
based on the time of the year
because it was used in education a lot.
That totally made sense.
Exactly.
You're like, oh, there's a huge spike in September.
I wonder why.
Maybe a bunch of people got a job. No, like CS101 is now back in session.
Yeah.
Exactly. Then the last one I want to throw out is this thing called Pinesource.
So what this is, this comes to us from Anders Klint.
It's basically a UML diagram creation tool for Python code.
So you give it some Python files.
It will generate a UML diagram that shows the relationship of all the classes in there.
Oh, that's cool.
Yeah, it's pretty cool.
There's a free, maybe even open source version.
And then there's also a paid version.
So you can buy it.
I'm actually not a huge fan of UML.
But if you have Python code and you think a UML diagram
would help describing it, this thing's pretty cool actually.
And it's a little GUI app.
There's a bunch of screenshots.
You can check it out and see if it'll help you, but it looks pretty neat.
And it does proper UML, not just like sort of visualization of classes.
So that's kind of nice.
My favorite use of these kinds of diagrams is to print them out and pin them to your
wall, your cubicle wall, so that other programmers think that you're smarter than they are.
Absolutely.
Put some little cryptic notes on them, like as if, you know, you're marking them up.
Yeah, absolutely.
Love it.
Yeah, so you can do this with your project.
Yes, this huge thing is our project.
Anyway, it's pretty cool.
And there's a free version, like I said.
So maybe it'll help some folks out there.
All right. You ready for some jokes, Brian?
Yes, definitely.
All right. You've heard about the glass being half full and half empty and like,
oh, I'm a half empty sort of person. I kind of see the world as slightly negative.
Yes.
So here's the developer version. So we have an optimist who says the glass is half full. We have the pessimist who says the glass is half empty. And we have the programmer who says
the glass is twice as large as necessary.
Yes, definitely.
So I wanted to extend that with the pragmatist that says that I'm just allowing enough room for requirements oversight, scope creep, and schedule overrun.
That's right.
It's perfect.
I love it.
And then you have this other one about software startups.
Yeah, man. It's not really any startup, but I watched the upside with Kevin Hart last night, and it was a joke that I couldn't help but sharing. I can't remember the characters, but Kevin's character said, would you invest in my business idea? And the other guy says, that seems too niche. Kevin, what's niche mean? Oh,
it's the girl version of nephew.
It's terrible.
I love it. That's bad.
If you got to ask, that's a pretty good
answer. Yeah. Cool.
Well, thanks for putting all
the cool topics together, as always, and being here.
Thank you. Bye. Thank you for listening to
Python Bytes. Follow the show on Twitter via
at Python Bytes. That's Python Bytes as Follow the show on Twitter via at Python Bytes.
That's Python Bytes as in B-Y-T-E-S.
And get the full show notes at PythonBytes.fm.
If you have a news item you want featured, just visit PythonBytes.fm and send it our way.
We're always on the lookout for sharing something cool.
On behalf of myself and Brian Ocken, this is Michael Kennedy.
Thank you for listening and sharing this podcast with your friends and colleagues.