Python Bytes - #25 Could we have more in-database machine learning please?

Episode Date: May 12, 2017

Topics covered in this episode: Python in SQL Server 2017: enhanced in-database machine learning Stack Overflow Trends tool We asked 20,000 people who they are and how they’re learning to code Be...eware: A request for your help Extras Joke See the full show notes for this episode on the website at pythonbytes.fm/25

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 25, recorded May 10th, 2017. I'm Michael Kennedy. And I'm Brian Ocken. And we've gathered up a bunch of cool Python things to share with you this week. So Brian, I want to start with some news coming out of Microsoft's biggest developer conference this year. There's some actually Python news, which is kind of cool. You wouldn't expect that, right?
Starting point is 00:00:27 Right. Yeah, they actually did a whole section on machine learning and AI and some very cool things. But what I want to talk about is one of the biggest databases in the corporate space, the most popular ones, is Microsoft SQL Server. So the thing that I want to point out or talk about is they've just announced a very
Starting point is 00:00:47 interesting feature. And I'm kind of hoping other database providers copy this like straight away. So what they've announced is in database machine learning. Wow. So yeah, that's crazy, right? Like, wait a minute, what does in database machine learning even mean? So here's the idea. Like, if you're going to transfer a lot of data, machine learning or otherwise, and you've got one server over here with your data and another server that's executing it, then you've got the network latency,
Starting point is 00:01:16 you've got the crossing process boundaries, you've got all sorts of latency working there. So especially if you have a chatty API, this can be problematic. But what this new feature is, they have now built the ability to run CPython 3.5 in process in SQL Server. And you can install external packages.
Starting point is 00:01:36 It comes built in with some of the machine learning packages already there. It runs a subset of the Anaconda distribution included right there. So inside your database, you can basically install Python scripts and do full-on machine learning with zero latency to your data. That's pretty cool. I think it's really cool. You might have to go back to teaching people about Microsoft products.
Starting point is 00:01:59 Yeah, I'm not so sure. I'm not so sure i'm going that far in fact like i what i would really like to see is other databases other databases other database providers take this on and go this is a cool idea can we put this in other places like on mysql on mongodb i think it would be super cool to see it there i mean you kind of have that with sqlite in reverse and like your database runs in your machine learning process rather than your machine learning runs in your database process. But if you're already, for some reason, using SQL server, like, you know, check and you want to do machine learning, check this out.
Starting point is 00:02:31 This is a pretty cool feature. Yeah, that's pretty neat. All right. Awesome. Okay. That's really neat. I want to talk about some real fake stuff and actually a tool called Faker. So there's an article to introduce.
Starting point is 00:02:43 Faker has been around for a while, but there's a new article on the Semaphore blog called generating fake data for Python unit tests with Faker. I had heard of it and I hadn't played with it before. So the article is pretty neat, but I played around with it just this afternoon. And what Faker is, is a way to, you know, basically generate data for you, just random stuff but in the right format. The list of stuff that you would want to do to be able to fill out a set of data to make it look real and to test a system. That's cool.
Starting point is 00:03:33 Without having to... Yeah, I see two major uses for this. I agree, like Faker is awesome. It's no joke. So basically you install Faker and you can go to it and say, give me some words, give me a name. And if you install Faker, and you can go to it and say, give me some words, give me a name. And if you say Faker dot name, it'll be like Joshua Wheeler, give me a month
Starting point is 00:03:50 for give me a sentence, and I'll give you a sentence. Give me a state, Michigan, give me a random number, like you can ask for all these different things. One of the really good uses for this is if you're doing web development, and you don't have any data yet, it is super hard to even write the code to process the sequences, it is super hard to even write the code to process the sequences, but also very hard to do the design of like, well, how is this supposed to look? And having real-ish data makes that process so much easier. And it's really easy to go, give me a month, give me a year, give me a state, things like that, and generate fake data with this. The other one obviously is with testing, right? Like instead
Starting point is 00:04:25 of having like all the trouble of coming up with these things for the fake pieces of data, you're going to pass and you don't necessarily want to hard code it. Maybe that's going to put some dependency on that hard coded value in your test. Like just run faker across your objects and fill them up. It has some in it, some things that you don't really think about. Like I ran the phone number a few times and it listed phone numbers with extensions, phone numbers with dashes, phone numbers without, so phone numbers with parentheses and stuff that you probably should deal with, but might not come up with on your own. And then I was looking through and one of the neat things is it has
Starting point is 00:05:00 PI structures too. It has a, under the PY section, you can generate a PI dictionary or basically get a dictionary or a tuple or set. And it just comes up with random tuples and random dictionaries. It's pretty cool. Oh, wow. How cool. I didn't even know about the PI section.
Starting point is 00:05:17 You can also switch it to multilingual. So US English, Japanese, Italian, Russian. And so if you were like doing localization, like what would it be like if I got a Russian name in here? Would my system still work? Like, well, try it. So that's pretty cool. I like it. Yeah.
Starting point is 00:05:35 If you need fake data, check out Faker. Seems funny to say, but, you know. Yeah, indeed it does. So, Brian, I totally skipped over your first one with stack overflow trends that's pretty exciting oh yeah so let's uh let's go ahead and talk about it so stack overflow trends stack overflow came out with a tool called stack overflow trends and uh the article that they have to introduce it the first example that they show is Python overtaking PHP for questions asked per programming language. Of course, they only, they, they only compared to PHP,
Starting point is 00:06:15 Perl and Python and Perl. Apparently nobody asks questions about Perl. Yeah. Perl is not, not a growing area of study. Is it? I think, you know, the closest analogy to this, it would be like, what does Google trends do? This is i think you know the closest analogy to this it would be like what does google trends do this is like you know that does that for searches this is like the same type of tool but for stack overflow popularity yeah i think it's it's neat to to look at like what kind of questions people are asking and how that grows and there was definitely a steep, so there was, Python was fairly around, fairly flat from like 2008 through 2012. Yep. And then a sharp curve up just starts taking off.
Starting point is 00:06:56 So, it's cool. Yeah, it's really, it's really great. Yeah, it's just like somebody flipped a switch in 2012 and like, you know, Python is growing. It's awesome. Yeah, so if you want to study things, definitely this is a place to go do it you know maybe you're looking like what should we base our next project on what are the future trends in programming technologies this is a good tool for that and it's great to see that highlighting python the growth and popularity yeah so uh normally we would have like a sponsor spot right now.
Starting point is 00:07:25 Yeah. But there's like this quiet period, right? Yeah. Yeah. So no sponsor this week, you guys. We have upcoming sponsors. They're kind of playing stuff out in sort of sparsely. But if you're out there and you're like, hey, my company wants to get the word out to Python developers, send us a message. Just go to the contact page on pythonbytes.fm and we'd love to talk to you. All right. I get questions all the time from people who are learning to code. And one of the guys on Twitter, Alan Jones, sent us a message about a pretty cool Medium article that really is very data-driven about people learning to code.
Starting point is 00:07:59 So this article is called, We Asked 20,000 People Who They Are and How They're Learning to Code. So that's a lot of people. Yeah. Now, they said, all right. They probably did Skype or something because that would be a really big phone bill. Yeah, I'm going to mail you a letter.
Starting point is 00:08:14 No. So they said, all right, who participated? Well, there's 20,000 people who did this survey, and most of them have been coding for less than five years. 62% live outside the US. This is interesting. Their average age of people learning to code is 28. So I get messages all the time like, hey, I'm 30.
Starting point is 00:08:32 There's no way I can learn to code. Like you're with these 20,000 other people, right? It's not that uncommon. That's actually the average age. And if you're over it, right, you know, it's still a lot of definitely an age range. There's many interesting pictures in this article and graphs. It's a data analysis type thing. And they've got average age to learn to code by country.
Starting point is 00:08:54 So you look at like France and the UK, and those guys are in the 30s on average. You look at India and they're in their teens on average which is i don't know what that means but that's interesting another interesting stat that i thought we could pull out is 19 are women while obviously that is super low compared to where it should be right that should be 50 but still 19 is i guess it's higher than i expected and it kind of made me happy because i feel like it's a positive trend, even if it's not where it should be. Yeah. The average person learning to code has been coding for 21 months,
Starting point is 00:09:30 and 25% of them already have the first job. So there's a bunch of cool stats like this that you can go and pull out. So check out that article. We asked 20,000 people who they are and how they're learning to code. And almost 59% wanted to become full stack web developers. Yeah, it's interesting, right? Like the web definitely factors heavy with data science being number two. So you can imagine, this is not a Python only study, right?
Starting point is 00:09:56 There's just people learning the code, but you can imagine Python is playing a heavy role in those two areas. They also have a podcast section, which is kind of cool. What do you mean? They have a section of what podcasts people who are learning as a code listen to. Okay. Are we on there?
Starting point is 00:10:10 TalkPython is. TalkPython is. But I didn't find Python Bytes, unfortunately. But that's because we're still letting them know. Yeah. Well, I'm glad that TalkPython is on there. That's pretty cool. Congratulations.
Starting point is 00:10:22 Thank you. Would you say that it's an anomaly that Python Bytes wasn't on there? I think it cool. Congratulations. Thank you. Would you say that it's an anomaly that Python Bytes wasn't on there? I think it's just because we're new. We don't really teach people how to code though. No, no, no. I think this is... Oh, you were trying to do a transition. Oh, that's so cool. I was trying to. So our next item is about anomaly detection. Yeah. Anomaly detection. You have to forgive me. It's almost midnight here in Munich. That's right. You're still on your German tour. Yeah.
Starting point is 00:10:46 Two more days. There was a really great article, and I should have written the person's name down, called Introduction to Anomaly Detection. And it's kind of a link to Emmanuel Ruf. Emmanuel, I can pronounce that part. But using Python, but using it for an interesting piece of a need for data analysis is anomaly detection. Basically looking at a whole bunch of data from something and finding the ones that you don't really know what the trend is going to
Starting point is 00:11:21 be, but the ones that don't fit, whatever the trend is for everything else. And it's actually just a fascinating couple of pages on here. And there's code samples. I'm not doing it justice talking about it, but it's definitely a well thought out, well studied article from datascience.com. Yeah. They have have a couple of areas that they focus on. They've got the types of categories of anomalies, like the ones you might think of, which they call point anomalies. So detecting credit card fraud based on amounts spent. Like, I live in the US.
Starting point is 00:11:57 Somebody tried to buy $1,000 worth of lumber in Mexico with my card. No, that's probably not OK. Real story. So then they have it is yeah uh contextual anomalies so they say like sometimes these things make sense but only within a context so for example spending a hundred dollars on food every day is totally reasonable on a vacation but it's odd if you're not on vacation so you can can you determine are they on vacation right or collectively like copying like
Starting point is 00:12:25 tons of data off network servers might look like you're trying to steal data if it knows that you're doing this all over the place but copying one big file would mean nothing right yeah yeah so the they basically break it down by those three categories it's pretty interesting all the the machine learning based approaches and stuff yeah and the math behind it like the moving averages and the the the, um, K nearest neighbor and K means algorithms. Oh, nice. Things like that. Yeah, absolutely. Very, very cool. I think I'm going to use the K nearest neighbor just in random conversation tomorrow, just to make me sound smarter. Where should we go to eat? I don't know, but we're going to have to apply the K nearest neighbor to these restaurant choices and figure it out. Yes, definitely.
Starting point is 00:13:07 So I want to close this out with a message from the Beware guys. So Beware is a cool project that it really does a bunch of fairly unique things. So it supports running Python apps on things like iOS and Android, macOS apps that are native.app files in Python, two alternate Python implementations, some cross-platform widgets, and a couple of other things.
Starting point is 00:13:33 So it's done by Russell Keith McGee, and it's been going on for about four years, so really great. And he posted a thing that said, a request for your help. So basically he's been working for a company that's largely funded the development or the furthering of these projects, right? So they've got like, extensive improvements for like this cross code compiler, an Android backend, a Django backend for like these TOGA apps that can be run as web apps or local windowsforms.net UI for TOGA. So you can have like a Windows app that has a modern
Starting point is 00:14:07 natural appearance on Windows, all sorts of cool stuff, right? So cool project. And so obviously, with the request for help, you know, what is up, right? Well, his contract ended. So now he's like, I don't have all this time and energy I can put in here, I got to go back to work. And the reason I'm bringing it up is we've got a lot of projects that are looking at different funding models to allow people to work on it, right? There's the pretty standard, I'll create a project and then try to sell consulting on top of the project. There's more interesting like platform as a service type things that people are doing. So the redash guys that i talked about last week on talk python have like hosted versions of their open source thing we've got the scraping hub guys
Starting point is 00:14:51 with scrapey doing their infrastructure as a service or platform well web scraping as a service right all these are very interesting so basically russell says hey could you sponsor my project and you know one check out his page. You can become a member and give him $10 and help keep this moving because he's doing a bunch of cool stuff. But also, if you have a project and it needs funding, think about what he's up to. Does it
Starting point is 00:15:16 make sense for your project? Things like that. I think it's a neat idea. I do too. So I definitely think the Beware project has huge possibilities for where it could help people. And certainly if you just want to work on an open source project, people ask me all the time, like, hey, can you recommend a project I could work on? Because I just want to get started on something. I don't really know enough to pick something myself.
Starting point is 00:15:39 I think Beware is a really good one. They have a very welcoming, explicit way of onboarding people who are new to open source. So that's also a way to help them out. Definitely. All right. Well, good luck to all of them. Yeah, good luck. That would be cool to see that keep growing because it's doing cool stuff over there in that project.
Starting point is 00:15:54 All right. Well, I have one shout out for us out of my own personal news, Brian. So there's a brand new PyCon, not major main PyCon, but a regional PyCon. And it's in a pretty sweet place. So I'm starting to think I might have to attend this. So if you check out PyCascades.com, it's in Vancouver, BC in January this year, the next January. So if you want to go up to the Pacific Northwest, one of the more beautiful cities around here, they have things like a PyCon hike as well as all the talks and stuff. If you want to go up to the Pacific Northwest, one of the more beautiful cities around here,
Starting point is 00:16:29 they have things like a PyCon hike, as well as all the talks and stuff. You can check it out at PyCascades.com. That sounds like a lot of fun. Yeah. Actually, being in the Northwest, you'd think I'd have been to Vancouver. I haven't ever been there. Here's your chance. So I might have to go up there. Yeah, we might have to just jump on the train and go up there.
Starting point is 00:16:42 Yeah. Sounds good. Well, I've been... The book is uh very close so i've been working late evenings here getting it ready working with my editor that uh right now it's uh it's supposed to the beta is supposed to be available right before pycon or i guess awesome technically at the beginning it's at uh on the 17th so next wednesday yeah so come check us out at our booth. Meet Brian, talk to him about his book,
Starting point is 00:17:09 and you should be ready by then. You'll be looking probably tired. Yeah, and my family's going to be a little irritated with me. I haven't slept for three days. So. Yeah, perhaps. All right, well, Brian, thanks for chatting with me and sharing all this news with everyone.
Starting point is 00:17:24 Yeah, thank you. You bet. Bye. Bye. Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's Python Bytes as in B-Y-T-E-S. And get the full show notes at PythonBytes.fm.
Starting point is 00:17:39 If you have a news item you want featured, just visit PythonBytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Ocken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.