Python Bytes - #53 Getting started with devpi and Git Virtual FS
Episode Date: November 22, 2017Topics covered in this episode: Exploring Line Lengths in Python Packages NumPy: Plan for dropping Python 2.7 support How to Learn Pandas Microsoft and GitHub team up to take Git virtual file syste...m to macOS, Linux Getting started with devpi Marketing-for-Engineers Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/53
Transcript
Discussion (0)
Hello and welcome to Python Bites, where we deliver Python news and headlines directly to your earbuds.
This is episode 53, recorded November 21st, 2017.
I'm Michael Kennedy.
And I'm Brian Ocken.
And we've got a ton of cool things we've picked for you today.
Well, six plus a few more, but it feels like a lot of good stuff to share with you guys.
So I'm looking forward to it. How about you, Brian?
I'm really looking forward to it, yeah.
Yeah, definitely.
So before we get into that, let's say thank you, Rollbar.
If you think there are errors lurking in your app and you want to get notified right away,
go to pythonbytes.fm slash Rollbar and check it out.
Tell you more about that right now.
I want to know your philosophy on line length.
Are you a strictly 79 or less sort of person, Brian?
I'm trying to do the 79 thing, but it's really short.
So we do like 120 in my work group at
work you guys what use like 44 inch like tv hd tvs for like your monitors or what it's still pretty
good to have like i like a little bit shorter so that you can put a whole bunch of uh you can do
side-by-side diffs easier and stuff yeah for sure but 79 is really tight but it is tight how about
you what do you use i guess i I do stick to 79 pretty much.
You know, if the editor says,
hey, this one's too long,
you should reformat it according to pip8.
I guess I do,
but I feel like it has a tendency
to put pressure on you to make bad decisions.
For example, if you have like an expression
involving like say five variables
and like a string,
like you're say formatting a string
and it would encourage you to have those variables super short and non-descriptive
so they fit within 79 but if they're long and descriptive that might be 100 right and so i feel
like there's this pressure but i guess i succumb to it anyway things that i share on the on github
or something i try to keep it to 79 but i don't know if it's a good idea or not mostly because
i do testing and stuff,
people will run a flake eight over my code and say, Hey dude, how come your code doesn't
exactly, you failed the build clean. Yeah. So there's a, there's an article from Jake
Vanderplass. He's the, the astronomer guy that was a did PyCon talk to the keynote.
Yeah. He did a keynote there. And I think he also did another talk, but yeah, it was,
it was great. He's up at a university washington doing all sorts of cool astronomy stuff so what
do you have to say about line lengths because of the switch of twitter between 140 and 280 that
they've done he was uh intrigued by looking at the statistics and did an exploration of line
lengths in python packages and he did it like a Jupyter Notebook-type article
so that you can kind of follow through all of his stuff
and mostly looking at NumPy, SciPy, Pandas,
I can learn Matplotlib and AstroPy.
So I didn't know about AstroPy,
but that makes sense because I'm not an astronomer.
Yeah.
How often do you analyze telescopic images
with machine learning? So far,
zero times. It sounds fun though, doesn't it? Yeah. But it's kind of a neat look at basically,
I wouldn't know how to do this right off the bat. I mean, but it's pretty simple to write a little
bit of code to import a bunch of modules and check out the line lengths and examine that and graph it
and plot it and clean it up.
It's a pretty cool article.
And then also just sort of looking at it, it looks like most of them, they follow a distribution, a... Normal distribution?
It's not exactly normal, but it's...
An abnormal distribution?
An abnormal, a log normal distribution.
That's it.
Oh, wow. Okay.
That's a little bit more statistics than I understand, but it's sort of normal, I guess.
But it follows a log normal distribution, and except for there's an artificial bump near the right side, the 80 character side, because many of these packages are trying to hit 80 or less.
But there's an argument there for you don't really need it because code naturally
fits anyway it's a cool look at it um i was thinking about using the code within this to
take a look at our code at work to see um where our line lengths are at work yeah that'd be an
interesting analysis to like run some pep8 style metrics across like your organization yeah you know i
think people should do someone some enterprising listener out there should build like a little
package we can all drop in and do cool stuff like that with yeah and at the end of the article he
does ask he's curious about what different uh popular packages where they fit into the line
length distribution match so that'd be neat right And other languages like how does this compare to say JavaScript versus C++ versus Python? Things like that. Also interesting to know, but I don't have
those answers. So they're open questions for now. So it's a good day. Yet another good day for
modern Python and, you know, sort of the sun continues to set on legacy Python. This time
around very, you know, you mentioned this package just previously numpy yeah
there's some interesting news with numpy yeah so numpy is dropping support for legacy python
and they say you know we know that the python core developers are dropping support for python
2 in 2020 it's still an open question on the day i like that that guy who voted for the keynote of PyCon 2020 as the official
end date. But who knows what day it is? It hasn't been officially announced. But they say basically
this requirement to continue supporting Python 2 makes it harder and harder to advance NumPy.
And so they're going to drop it. I think that's great. I can see that. It's such an important
library. And, you know, data science is definitely moving towards Python 3.
And so their plans are December 31st, 2018.
Up until then, they're going to support Python 2 and Python 3 100%.
And that's not very far away.
What is that, like 41 days?
No, that's 41 days in a year.
So a little bit of time on that one.
And then January 2019, all new features will be Python 3 only.
And then the year after that, I guess when Python 2 support goes out,
it probably goes out of here as well.
It isn't just a spiteful thing.
They've got real reasons to do it
because the increased burden of trying to be Python 2 compatible is unreasonable.
Yeah, definitely. It means it's like there's features that are not in NumPy because it to be Python 2 compatible is unreasonable. Yeah, definitely.
It means it's like there's features that are not in NumPy because it works on Python 2.
Right.
So it's time to say thank you, but goodbye to Python 2, they say, which is, I think, great.
Speaking of data science, one thing I've tried to learn a lot,
but haven't done a great job of is pandas.
Actually, pandas and like kind of the whole data science tool chain.
It's something I'm curious about, but I'm not sure how to go about it. So I really liked this article from
Ted Petro about how to learn how to learn pandas and how to go about it. His opinion, of course,
but it's a it seems like a really pretty reasonable thing to he was recommending some of the learning,
reading the documentation and reading about about pandas and how it works, but then also kind of jumping back and forth and using it for small projects.
And I guess with any tool, that makes sense.
But there is some, he gives a little bit more, I guess, more details of how to do that so that you can jump back and forth and know what to
learn first? Yeah, I think one of the challenges that I have learning pandas, like I can sort of
do a few things with it, but not a lot is I don't really have a project to use it on. Like I just
kind of poke at it and go, Oh, okay, it does this cool stuff. But you know, like, I just haven't
done like data science-y things or financial analysis things. So he talks about things like
here's some Jupyter notebooks,
here's some Kaggle kernels and data sets in the form of,
these are data sets in the form of Jupyter notebooks.
So some concrete ways to play with it,
not just, you know, fired up and poke at the API.
Yeah, or maybe go back to that Jake article
and examine your line lengths.
Exactly. There's an example.
And then one of the things I thought was a nice ending
is when you think you have it fairly well,
go a little bit further
and then start answering some questions on Stack Overflow
and kind of measure yourself against the other things
that people are running into problems with.
I think that's a cool idea.
That is a cool idea.
And the people on Stack Overflow
will let you know if you're wrong.
Yeah, definitely. It's one of the nice and not nice things about the internet is the best way
to find out whether you're right about something is to post the wrong answer.
Yeah, people don't really hold back on you too often, do they?
Yeah, no, no, you get that right away.
Yeah, if you have a thick skin or if you're willing to grow a thick skin,
then that's actually a great way to do it. Yeah. Reddit would probably also work too. Also, I'm sure the data science people are similar, but the Python community as
a whole is fairly gentle with people. They'll tell you you're wrong, but they'll be nice about
it and probably use more words than you've written to explain something to explain why you're wrong
about it. Yeah. Maybe they'll have a good explanation of your misunderstanding and you can connect some more dots, right?
I depend on that a lot.
Nice.
All right, before we get to the next one, which is some more social coding stuff, I just want to say thank you to Rollbar.
If you have a web application and it's running on the internet, it's probably crashing at some point.
And it would be great to know about that.
Like, how often do you go back and read logs? Like, do you go and read logs at your work
very often, Brian? Actually, more than I want to. Yes. I'm in a manager role, so I get to tell other
people to do it. Here's a problem in the log. Go fix that. Yeah. But you don't want to have to
depend on reading that, right? If you could avoid it and just get the notifications right away,
that'd be awesome. So, Rollbar actually, I normally talk about in the context of Python, and that's totally true, but it actually supports 26 languages and
frameworks. So Python, obviously, Flash, Django, Pyramid, etc. But node.net, it even has a Flash
plugin and client-side JavaScript. So totally cool. Like whatever you're using, you can use
rollbar. It's awesome. And they have this thing called people tracking. So for example, on like
my training site, people are logged in. And if there's a crash, I can emit a little thing that
will tell rollbar, this is the user that had this error. So not only do I know what the error was,
I can actually go back and send that person a message, say, I saw you run into a crash,
and here's how I fixed it. Like, whoa, I didn't even tell you what happened. That's kind of
creepy, but awesome. So anyway, if you want to be creepy and awesome, check out bind on bytes.fm slash rollbar and solve the problem before your users
even tell you about them. All right. So one of the things that came out recently was an announcement
from Microsoft and GitHub. I'm not sure what the order of, but this sort of came out, but it
started, I think it started at Microsoft and they want to use Git.
Okay. So everybody wants to use Git because Git is awesome. But the problem is they actually have
some pretty large projects and it turns out they tried to use Git and it was basically unusable
for some of their projects at Microsoft. So Brian, you're probably thinking Git was built for Linux
and Linux is a huge project,
right? Yeah. So what's up with these Microsoft people? They must be doing it wrong. And I kind
of actually thought that when I read this first as well, but it turns out if you look at the Linux
kernel, it's like 640 megs of data in the source code repository and Git. That's big, right? That's
quite big. But it turns out that if you look at like the visual studio tools, those are three gigabytes, which is five times bigger than Linux.
And they're trying to use it for that. And that was kind of a little sketchy,
but then they wanted to use it for windows. And apparently the, the repository for windows is
270 gigabytes or 421 times larger than Linux. Wow. No wonder it's slower. That's a little bit bigger.
And there's 4,000 people committing to it like all day as their job, right? So it's got a lot
of contention as well. And so what they've done in the announcement is Microsoft and GitHub team up
to create a Git virtual file system. And the GitHub part is mostly to make this work on other platforms, macOS and Linux and
things like that. So what they did is they said, look, the problem is, we literally have like,
I don't know how many million, thousands, maybe millions of files when we do a checkout. So
when did a like a regular get checkout, it would take 12 hours to clone the repository three hours to do
just a straight checkout of a branch eight minutes to ask git status and 30 minutes to commit like
one file so it was pretty broken and they said the reason it's broken primarily is there's like
all these files and generally you're only working with like a little sub part of them so what they
did is they created a virtual file system that understands Git
repositories, and it only checks out like a metadata list, like a directory listing.
Wow, cool.
And then if you interact with it, it basically will create those files by getting them from the
server on demand. And it doesn't have to be like some plugin, it's like at the file system level.
So if I open up like command prompt, or I open up some editor, I just type like GCC,
and it has to touch like 10 files, like that will automatically get them from Git if they weren't there. Isn't
that crazy? It sounds a lot like Clearcase before Clearcase started to suck. Yeah, exactly. So
they built this for Windows and they got really good success. They said instead of 12 hours to
clone it, it takes 90 seconds. Instead of eight minutes to do a Git status, it takes three seconds.
Instead of 30 minutes to do Git commit, it takes eight seconds. And so they've actually been pushing about half of these changes back upstream into Git. And they've been working with the Git
developers to make this a general thing, not a Microsoft thing, which I think is pretty noble.
That's definitely like a new Microsoft, not the old Steve Ballmer Microsoft.
Is it just for GitHub or can we use it with other Git?
This is just purely for Git. So they're pushing this back to the Git developers, not for GitHub.
But where GitHub comes into this is GitHub, maybe they have this problem for projects hosted on
GitHub, but people are already using those projects on GitHub. So it's probably okay,
but they're trying to sell enterprise GitHub, which is like a box you put in your company
to run those things. And these enterprise projects can be like huge, like this Windows problem.
And so GitHub is trying to basically expand this to Linux and Mac OS so that they can make that
part of their enterprise story. That'd be cool. I'd like to have it be part of the
GitLab experience as well. That'd be good. I'd like to have it be part of the GitLab experience as well.
That'd be good.
Yeah, absolutely.
Yeah, so hopefully this makes it back into Git proper.
And then the OS support can come from Microsoft and GitHub.
That'd be awesome.
Yeah, this is pretty cool, actually.
I'll keep an eye on this.
Yeah, yeah, we'll see where it goes.
But they've already got demos and stuff working for Microsoft Windows.
And there's actually a 10-minute little video as they work through this stuff, you can check it out. It's really short. I think that
as well. Speaking of downloading stuff from servers and getting your libraries all put
together. I don't know if I'm just dense or what, but the, uh, the multiple times I've tried to set
up a dev pi server for caching pipey stuff locally. And mostly I need to do this partly because of setting up, you know, if you want to do
a laptop setup. So for, for while you're on the plane or something, but also behind a firewall,
so I can have my build server, not have to go outside the firewall and stuff like that.
I'd like to have a local one. And I ran across this article. I haven't actually gone through it.
I was going to do that this morning, but it looks pretty good from Stefan Scherfke that's getting started with DevPi.
And it walks through basically he had the same thing.
He needed to set it up a local server again.
Couldn't remember how to do it.
The documentation is okay, but it still has some issues.
And so he just sort of walks through the whole thing and shows you how to do it in at least one use case,
which is pretty close to what I think most people need,
which is mostly mirroring the packages from PyPI
that your company actually uses,
not everything, just the stuff you're using,
and then also being able to store your own local things there.
Yeah, that's a great combination.
I think the caching bit is really nice.
Like you can just point at this thing and it'll just pass through and get the ones from the full PyPI, right?
And then you can tell it to refresh occasionally and stuff.
And then you can also just push up your own local ones so that you can share your own stuff around.
I think that's a really great thing that probably not too many organizations are doing.
If you have different teams working on different packages,
like you can actually publish it to like your company
through these things, which is pretty awesome.
We also have a PyPI whitelist.
So that might be really positive
given some of the recent security scares we've had there,
right, depending on how paranoid you are.
Part of the article is talking about user management.
For me, I'd probably set up things for all my local dev team
plus the build to be able to get things.
But he was having it locked down to just the build server being able to do it,
which is an interesting idea as well.
Nice.
So the last thing I want to cover this week is what I think a lot of people who are developers
or work for a company building a product that are kind of new to it, sort of a technical
company, maybe miss, which is the whole marketing side of software, right?
Like the hardest thing about making something successful, if it's a web app,
or it's a regular app, or it's a SaaS thing, or whatever, is not building it. Building it may be
challenging, but that is not the hardest thing. The hardest thing is getting people to notice it
in a busy world and getting the word out. The whole marketing side of stuff that most of us
developers are not super good at. So there's this GitHub repository called Marketing for Engineers.
And it's a curated collection
of marketing articles
and tools to grow your product.
That's nice.
Yeah, isn't that cool?
So these guys,
they created some kind of iOS app
and they're like,
it took us almost two years
to learn how to market our project.
It was painful.
So we're trying to help that.
So they said, look,
we're going to come up
with a bunch of resources
that help you solve practical marketing tasks, such as finding better users,
growing your first user base, advertising your product without a budget, all those different
things. So they have a whole bunch of different areas that if you're new to this, you know,
you can really learn a lot from like how to market on social media, where are the right places,
how to leverage Quora, how to leverage product Hunt and business models, all kinds of stuff. So I thought that might be
useful. There's about 4,000 people who have started on GitHub. They probably also thought
it was useful. It's a huge list. Yeah, it's massive. Yeah. One of the things on there that
I saw, it's near the top, is doing things that don't scale, which I love that advice.
Yeah, I do. I like that as well. Yeah, definitely do things that don't scale.
As I was writing the PyTest book, I tried to help out as many people as possible on the
Slack channel. And even if it meant a couple times, I just asked people,
hey, are you available? Can I just call you on the phone? I just talked to people about their
issues with PyTest and with testing.
Now, clearly you can't do that on a huge scale, but when you don't have any end users at all yet, it's pretty easy.
Yeah, for sure.
And the behavior creates super advocates for you.
And it also lets you realize some of the challenges.
So like maybe in the final version of your book, it reflects some of those challenges
that that one person had, but maybe there's a thousand or more people who actually have it. They didn't call you because they just read your book because you
already got it, right? I love this because a lot of us nerds didn't become nerds because we really
like talking with people. I used to laugh at the people in business school. Now I'm kind of like,
huh, they probably know something, don't they? Yeah. Oh, those guys don't know calculus like
nothing. Oh, I see how it's going for them.
All right.
Anyway.
Awesome.
So that's it for this week.
Those are tons of fun things.
Thanks for sharing them, Brian.
You have one more bit of crazy sort of American flavored shopping madness around Python for us, right?
Yeah, I guess I forget that.
Yeah, there's plenty of listeners outside of America.
But one of the traditions we have is a Black Friday sale, which has spilled over into online things as well.
So starting the day after Thanksgiving, usually, but we're doing it, I think, a little early here.
Maybe not.
If anybody doesn't know, I wrote a book.
I've been talking about it for a year, so you probably do. But the Python testing with PyTest is through Pragmatic and Pragmatic has a book sale going on the 22nd through December 1st and you get 40%
off all eBooks. That is awesome. Yeah. So get in there and get it. The reviews are awesome for that
book. Is this a global thing, even though it's the sort of terminology and date is US inspired?
Can people all over the world come and get it for 40% off, whatever it is?
Yeah.
To get the discount, just use coupon code TURKEYSALE2017.
Awesome.
All right.
Well, go and get that book.
You've been on the shelf.
The fans, if you've been on the shelf.
One more thing that just came up.
I had somebody, somebody actually from the Testing Slack channel again, asked me if I could mention PyCon Colombia.
So tickets are available.
They're going to have their first Columbia PyCon in Medellin in February 9, 10, and 11 of 2018.
So we'll put a link in, but it's pretty easy to find.
So that'll be fun.
Yeah, awesome.
Check it out if you're down in South America.
It could be a good time.
Or if you want to go visit there, right?
How about you?
Do you have any news to share with us?
I have no news.
There's no news for me.
I'm actually working on some stuff.
I don't want to, I don't want to announce it yet, but absolutely got some cool things
that I'm working on.
Always trying to like juggle too much, which is kind of the curse of my personality, but
it's fun.
You're doing a lot of cool stuff though.
I can't wait to see.
Oh yeah.
Thanks. Back on the PyCon Columbia thing, but it's fun. You're doing a lot of cool stuff, though. I can't wait to see. Oh, yeah, thanks.
Back on the PyCon Colombia thing,
they have a really cool logo.
So if anybody's going to that,
if you could snag me a t-shirt, that would be cool.
Yeah, order the t-shirt.
They come with a logo.
Well, thanks for talking to me this year.
You bet.
Great to chat with you, Brian.
And everyone as well, thank you for listening.
See you later.
Thank you for listening to Python Bytes.
Follow the show on Twitter via at Python Bytes.
That's Python Bytes as in B-Y-T-E-S.
And get the full show notes at PythonBytes.fm.
If you have a news item you want featured,
just visit PythonBytes.fm and send it our way.
We're always on the lookout for sharing something cool.
On behalf of myself and Brian Auchcken, this is Michael Kennedy.
Thank you for listening and sharing this podcast with your friends and colleagues.