Python Bytes - #235 Flask 2.0 Articles and Reactions

Episode Date: May 26, 2021

Topics covered in this episode: Flask 2.0 articles and reactions Python 3.11 will be 2x faster? 3 Tools to Track and Visualize the Execution of your Python Code DuckDB + Pandas Extras Joke See t...he full show notes for this episode on the website at pythonbytes.fm/235

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 235, recorded May 26, 2021. And I'm Brian Ocken. I'm Michael Kennedy. And I'm Vincent Wormadam. We talked about Vincent a while ago and got his name wrong. And he told us a story that was good, that we accidentally pronounced his name, what, Wanderman. Yes.
Starting point is 00:00:27 So sorry about that. That's fine. It's fine. I was bragging to my wife that I was on the podcast and then I was announced as Vincent Wanderman and she's still kind of philosophical about the whole thing. But it was a fun introduction. It's the best mispronunciation of my life. Let me put it that way.
Starting point is 00:00:42 It's your alter ego. It's like your spy name. I'll take it um well thanks for joining us today my pleasure should we jump into the first step topic sure okay um well i think we we covered we mentioned uh last time that flask 2.0 was out and um and then michael had um you had you talked with somebody, didn't you? I did. I had David Lord and also Philip Jones on TalkPython to basically announce Blast 2.0 and talk about all their features. Yeah. And that was a great episode.
Starting point is 00:01:19 I listened to both of those. I listened to that. It was great. What I wanted to cover was a couple articles or an article and video so uh first off uh we've got a link to the change list so if um actually lost the change list yeah there it is uh so you can read through that um and maybe that's exciting to you but i i like a couple other ways so there's uh um article by Patrick Kennedy, Async in Flask 2.0. And I really like this article.
Starting point is 00:01:50 It goes through kind of describing what it means to have Async in Flask and how it works with some nice little diagrams. Diagrams are always nice. Yeah. Oh, yeah. Pictures. Yes. And then a description of the ASGI and why we don't need it yet.
Starting point is 00:02:10 And I'm not sure what the timeline is for Flask, if they're going to do it more. But there is a discussion of that it's not completely async um yet there's gonna there was a lot of discussion with david and philip that they may be leaving court to take the place of full-on okay asgi flask and the idea being that there's there's a lot of stuff that kind of has to change especially around the extensions and you get nearly that but not exactly that by using the g event async stuff that's in regular Flask. And that integrates in, if you just do an async def method in your regular Flask, but if you want true async IO integration,
Starting point is 00:02:53 then they basically were saying for the time, for the foreseeable future, instead of import Flask and go in that, just import core and wherever you see Flask, replace it with the word core. Okay. But there's other cool stuff other than the async that's coming into Flask 2.0.
Starting point is 00:03:10 So I appreciate it. There's also a video from, we don't want it to play, from Miguel Grinberg, and talking about some of the new stuff in Flask. And I really like this. One of the things that he covers right away is the new stuff in Flask. And I really like this. One of the things that he covers right away is the new route decorators. Yeah, those are nice.
Starting point is 00:03:31 It might be just a syntax thing, but it's really nice. So you used to have to say app route and then methods equals post or list the method. And now you can just say app post. That's nice. And then a really clean discussion of the WebSocket support with Flask.
Starting point is 00:03:48 And then he goes in to talk about the async. And with that also does a little demo timing it. And I was actually surprised at how easy it was to set up this demo of timing and showing that he showed that you could increase the users and then still, and still get, it doesn't really increase your response time or how many, how many users per request per second doesn't increase because of the way that Flask 2.0 is done, but it was nice. And then he also talked about some of the extensions that he wrote to that work with Flask 2.0 was done but it was nice um and then he also uh talked about some of the extensions that he
Starting point is 00:04:25 he wrote to uh that work with flask 2.0 and stuff so it was definitely worth the listen well that's always cool that's always the thing when you get like flask is like a pretty big project so when there's like a new upgrade of that one of the things that people sometimes forget is like oh like all the plugins do they kind of still work so it's nice if someone does a little bit of the homework there and so it says, here's a list of stuff that I've checked and that's at least compatible. Well, he's mostly doing some... So, for instance, one of the things is around which...
Starting point is 00:04:53 I don't know which... Some of the WebSocket stuff has changed and some of the other things have changed. And he has some different shims that he was recommending some things before, but now you don't have to do, you don't have to swap out some things. So like, for instance, some of the extensions we're allowing for WebSockets required you to swap out the server for a different server, and you don't have to anymore.
Starting point is 00:05:17 Ah, like that, right. Okay, cool. Yeah, a couple of big other things that come to mind. One, they've dropped Python to support and even 3.5 and below. I mean, we're at this point where 3.5 is like old school legacy, which surprises me. It still feels new. Yeah, I remember when it came out. Yeah, yeah.
Starting point is 00:05:36 Well, that was when async and await arrived, right? So that was a big deal there. But it doesn't have F string, so it's... Yeah, that's the killer feature. Yeah, so it's yeah that's the killer feature yeah yeah so that there's that and they also said that you are not going to need to change your deployment infrastructure if you want to run async flask you can just push a new version and it's good to go so yeah a lot a lot of neat things there very good nice um what do we got next michael
Starting point is 00:06:02 well what if python were faster that would be nice that's always good we actually talked about cinder remember cinder yeah based from the facebook uh world so that's one really interesting thing that is happening around python and there's a lot of cool stuff here but but remember, this is not supported. It's not meant to be a new runtime. It's just there to give ideas and motivation and examples and basically to run Instagram. On the other hand, Mike Driscoll tweeted out, hey, Python might get a two-time speedup of the next version of Python. And you might want to check out Guido's slides from the Python Language Summit at the Virtual PyCon. That's exciting, right?
Starting point is 00:06:45 Yes. I mean, if Guido is saying it, then, you know, odds of it happening increase, right? Exactly. Exactly. So a while ago, we actually covered what has now become known as the Shannon plan for making Python faster a little bit each time over five years, over the next four, at least, I guess, four years at that point, and how to make that happen. So some of these ideas come from there. And so here I'm pulling up the slides. And it says, can we make CPython faster? If so, by how much? Could it be a factor of two? Could it be a factor of 10? And do we break people if we do things like this? So the Shannon plan, which was posted last October and we covered, talks about how do we make it 1.5 times faster each year, but do that four times. And because of compounding performance, I guess, yeah, it's five times faster.
Starting point is 00:07:37 All right. So there's that. Gita said, thank you to the pandemic. Thank you to boredom. I decided to apply at Microsoft. And shocker, they hired him. So as part of that, it's kind of just like, Hey, we think you're awesome. Why don't you just pick something to work on that will contribute back? That'd be really cool. So his project at Microsoft
Starting point is 00:07:57 is around making Python faster, which I think is great. Cool. So yeah. So there's a team of folks, Mark Shannon, Eric Snow, and Guido, and possibly others who are working with the core devs at Microsoft to make it faster. It's really cool. Everything will be done on the public GitHub repo. There's not like a secret branch that will be then dropped on it. So it's all just going to be PRs to github.com slash the Python slash CPython, whatever the URL is, the public spot. And one of the main things they want to do is not break compatibility. So that's important. Also said, what things could we change? Well, you can't change the base object, like
Starting point is 00:08:38 Py, what is it? Py OBJ, basically the base class, right? PyObject pointer. That's it, the PyObject class. So that thing has to stay the same and it really needs to keep reference counting semantics because so much is built on them. But they could change the bytecode that exists, the stack frame layout, the compiler, the interpreter, maybe make it a JIT compiler to JIT compile the bytecode, all of those types of things. So that's pretty cool. And they said, how are we going to reach two times speed up in 3.11? An adaptive specialized bytecode interpreter that will be more performant around certain operations, optimize frame stacks, faster calls, zero overhead exception handling, and things like integral internals. So maybe treating numbers differently,
Starting point is 00:09:25 changing how PYC files. So there's a lot of stuff going on. Also putting the dunderdick for a class always at a certain known location because anytime you access a field, you have to go to the dunderdick, get the value out and then read it. And I suspect the first thing that happens is,
Starting point is 00:09:42 well, go find the dunderdick pointer and then go get the element out of it. So if every access could just go, nope, it's always, you know, one certain byte off in memory from where the class starts. That would save, you know, that sort of reversal there. So some pretty neat things. Yeah, I'm glad you explained that because I read it before and I'm like, why would that help at all? I think you can traverse one fewer pointers yeah and in general doesn't matter but literally everything you ever touch ever if you could cut in half the number of pointers you got to follow that'd be good
Starting point is 00:10:14 yeah yeah this is always one of those things that always struck me with um when you're using python you don't think about these sorts of things it's when you when you're doing something in rust or something then you are confronted with the fact that you really have to keep track of where is the pointer pointing and memory and all that. You take a lot of this stuff for granted, so it's great that people are still sort of going at it and looking for things to improve there. Yeah, absolutely. You know, in C, you do
Starting point is 00:10:36 the arrow, you know, dash greater than sort of thing every pointer, so you're like, I'm following a pointer, I'm following a pointer. You know it, right? Here, you just write nice, clean code and magic happens. So let me round this out with who will benefit. So who will benefit? If you're running CPU intensive pure Python code, that will get faster because the Python execution should be faster. Websites should be faster because a lot
Starting point is 00:11:00 of that code is running in the Python space. And who will happen to use Python? Who will not benefit so much? NumPy, TensorFlow, Pandas, all the code that's written in C, things that are IO bound. So if you're waiting on something else, speeding up the part that goes to wait, doesn't really matter. Multithreaded code because of
Starting point is 00:11:17 the GIL at this point. But Eric Snow is also working on the sub interpreters, which may fix that and so on. I like the last bullet. Pretty neat stuff. There's some peps out there. I'll link to I link to the tweet by Mike Driscoll, but that'll take you straight to the GitHub repo, which has the PDF of the slides. And people can check that out if they're interested.
Starting point is 00:11:38 I like the last bullet for the previous slide of things, people that not benefit, code that's algorithmically inefficient. Otherwise, if your code already sucks, it's not going to be better. It may be better, but it could be better. I was about to say, theoretically, it actually would go faster. Just not as much better as it could, right?
Starting point is 00:11:59 Yeah, it would still be like n to the power of 3 or something like that, but it would be faster n to the power of 3. Yeah, it won but it would be faster N to the power of three. Yeah, yeah. It won't change the big O notation, but it might make it run quicker on wall time. That's right. Yeah. Yeah.
Starting point is 00:12:18 And Christopher Tyler out there in the live stream says, I know I still need to improve my code, but this would be great, right? I mean, it used to be that we could just wait six months. A new CPU would come out that's like twice as fast as what we ran on before. Like, oh, now it's fast enough. We're good. That doesn't happen as much these days. So it's cool that the runtimes are getting faster. Yeah, and I mean, let's be honest,
Starting point is 00:12:32 Python is also still used for like just lots of script tasks. Like, hey, I just need this thing on the command line that does the thing and I put that in cron. And like a lot of that will be nice if that just gets a little bit faster. And it sounds like this will just
Starting point is 00:12:43 be right up that alley. Yeah. And one of the things that I know has been holding certain types of changes back has been concern about slowing down the startup time. Because if all you want to do is run Python to make a very small thing happen, but like there's a big JIT overhead and all sorts of stuff, and it takes two seconds to start and a nanosecond, microsecond to run, right? They don't want to put those kinds of limitations and heal that use case either. So yeah, it's good to
Starting point is 00:13:10 point that out. All right, Vincent, you're up next. Cool. Yeah. So I dabble a little bit in fairness algorithms. It's a big, important thing. So I get a lot of questions from people like, hey, if I want to do like machine learning and fairness, where should I start? And I don't think you should start with algorithms. Instead, what you should do is you should go check out this Python project called Deon. And the project's really minimal. The main thing that it really just does is it gives you a checklist of just stuff to check before you do a big data science project at a big company or an enterprise or something like that. And they're sensible things they're they're sort of grouped together so like hey can i check off that i have informed consent and the collection bias can i check all these things off uh the main themes are literally a checkbox you
Starting point is 00:13:54 can check them off in the page to sort of get the feel like oh yeah these are good it goes further so the thing is this is an actual python project you can generate this as yaml for your github profile so like for your github project you actually have this checklist that has to be checked in git so people so you know that people signed off on it like you can actually see the checklist you can even maybe in your git log see who checked it off um but what's really cool is two things like one you can generate this checklist two you can also customize the checklist so if you are at a specific company of certain legal requirements this tool actually kind of makes it easy to customize this very specific checklist for data projects but the the real killer feature if you are at a specific company of certain legal requirements this tool actually kind of makes it easy to customize this very specific checklist for data projects but the
Starting point is 00:14:28 real killer feature if you ask me like again all of these comments are good like is the data security well done is the analysis reproducible how do we do deployment like all of these things that are usually like things that go wrong and were obvious in hindsight but the real killer feature is usually you have to convince people to take this serious. So what the website offers is like an example list. So for every single item that is on this checklist, they have one or two examples. Typically, these are like newspaper articles of places where this has actually gone wrong in the past. So if you need like a really good argument for your boss, like, hey, we got to take this serious. There's a newspaper article you can just send along as well.
Starting point is 00:15:07 Oh, that's interesting. Yeah, I like it. And the fact you can also generate Jupyter notebooks with this, you can customize it a little bit. The people that made this, the company I think is called Driven Data. They host Kaggle competitions for good causes. That's sort of a thing that they do there.
Starting point is 00:15:22 But Deon is just a really cool project. I think if more people would just start with a sensible checklist and work from there, a lot of projects would immediately be better for it. Yeah, this is really cool. So things are, can you go to the very bottom of that page that you're on? Yeah. Sorry, just the checklist. Oh, right. Yeah. Yeah. So there's some examples like, make sure that you've accounted for unintended use. Have you taken steps to identify and prevent unintended uses and abuse?
Starting point is 00:15:49 So like you created a find my friends in pictures. So like I want to find pictures my friends have taken of me. You could put it up and it would show you all the pictures your friends took. But maybe someone else is going to use that to, I don't know, try to fish you. Like here's the picture of us together or I don't know, try to fish you. Like, here's the picture of us together. Or I don't know, some weird thing. Use it for facial recognition and tracking when it had no such intent, right? Things like that.
Starting point is 00:16:11 I think for, and I might be, so it doesn't have this example. The best example of unintended use, there used to be this geo lookup company where you could give an IP address and it would give you an actual address. However, sometimes you don't know where the IP address actually is, so you just give center point of like a u.s state or the country
Starting point is 00:16:27 so there used to be this house in the middle of kansas i think it was like the center point but the thing is um this they will get like fbi trucks driving by and like doing raids and stuff because they thought there were criminals there because the geo lookup service would always say like ah the crooks at that ip address that's this latitude longitude place right right we had a cyber attack it was from this ip address raid them boys and of course it was just some poor farmer in the midwest going you know yeah no just the geographic center please stop raiding my farm yeah but like the story was actually quite serious like i think the person who lived there got death
Starting point is 00:17:05 threats at some point as well because of the same mistake. So this is stuff to take serious. The one thing that I did like
Starting point is 00:17:11 is the solution. I think now instead of it pointing to the house in Kansas, I think it points to the center of the three
Starting point is 00:17:20 big lakes in Michigan. I think it's just the middle of a puddle of water basically just to make it obvious to the FBI squ like no it's not a person living there
Starting point is 00:17:28 yeah but like darn these submarines are they've moved underwater or or whatever but i mean but that's why you want to have a checklist like this like you're not gonna the thing with unintended use is you it's unintended so you cannot really imagine it but you at least should do the exercise and that's what this list uh does in a very sensible way and more people should just do it and there's interesting examples too you just have a look and there's also a little community interesting there's a little community around it as well of like collecting these examples and they have like a wiki page with examples that didn't make the front page cut um so definitely recommend anyone interested in
Starting point is 00:18:00 fairness uh start here um i was curious you brushed by it fairly quickly of fairness analysis fairness analysis is that what you do um so uh i just don't know what that means so could you yeah so um oh man this is a longer like this topic deserves more time than i'll give it but the idea is that you might be able we know that models aren't always fair, right? It can be that you have models that, for example, the Amazon was a nice example. So they had like a resume parsing algorithm that basically favored men because they hired more men historically. So the algorithm would prefer men. Oh, okay. That kind of fairness.
Starting point is 00:18:40 Okay. Historical, these have been our good employees. Let's find more like them. Exactly. And the thing is, you don't get an algorithm that's unfair. So there are these machine learning techniques and there's this community of researchers that try to look for ways like, can we improve the fairness of these systems? So we don't just optimize for accuracy. You also say, well, we want to make sure that subgroups are treated fairly and equally and stuff like that. So I dabble a little bit in this. There's this project I like to collaborate with.
Starting point is 00:19:06 I open source a couple of things with these people. It's called FairLearn. The main thing that I really like about the package is that it starts by saying fairness of AI systems is more than just running a few lines of code. It starts by acknowledging that. But they have mitigation techniques and algorithms and tools to help you measure the unfairness. It's scikit-learn compatible as well. stuff to like having said all that start here start like start with a checklist don't worry about the machine learning stuff just yet start here um but yeah
Starting point is 00:19:34 pretty cool before we move on connor first there in the live chat says i'm glad the conversation of ethics and data science is enlarging i think it's important about what we make. Yeah, I agree. Totally. Now, before we do move on though, let me tell you all about our sponsor for this episode, Sentry. So this episode is brought to you by Sentry. Thank you, Sentry. How would you like to remove
Starting point is 00:19:55 a little bit of stress from your life? Do you worry that users may be having difficulties or encountering errors with your app right now? And would you even know it until they sent you that support email? How much better would it be to have the errors and performance details immediately sent to you, including the call stack and values of local variables
Starting point is 00:20:12 and the active user recorded right in the report? With Sentry, it's not only possible, it's simple. We actually use Sentry on our websites. It's on Python by SetFM. It's on TalkPython training, all those things. And we've actually fixed a bug triggered by a user and had the upgrade ready to roll out as we got the support email.
Starting point is 00:20:30 They said, hey, I'm having a problem with the site. I can't do this or that. I said, actually, I already saw the error. I just pushed the fix to production. So just try it again. Imagine there's a surprise. So surprise in July. Your users, get your Sentry account
Starting point is 00:20:42 at Pythonbytes.fm slash Sentry. And when you sign up, there's a got a promo code. Redeem it. Make sure you put Python Bytes in that section or you won't get two months of free Sentry team plans and other features and they won't know it came from us. So use a promo code at pythonbytes.fm slash Sentry.
Starting point is 00:20:58 Yeah. Thanks. Thanks for supporting the show. Brian. Yeah. I like this one that you picked here. You like this? I like it a lot. It's very good. It has pictures, little animated things, and great looking tools. Yeah. So there's a, it was an article that was sent to us. I can't remember who sent it. So apologies.
Starting point is 00:21:14 But it's an article called three tools to track and visualize the execution of your Python code. And I was, I don't know why, executing your code just seems funny to me. I know it just means run it, but, you know, chop its head off or something. Anyway, so the three tools, the three tools he covers are law that we don't cover this very much because I don't know how to pronounce it. L.O.G. you are you. It's law guru or law guru. Not sure.
Starting point is 00:21:45 And then so law Guru is a pretty printer with better exceptions. So let's go and look at that. So it does exceptions like this. So it breaks out your exceptions into colors. And it's just kind of a really great way to visualize it. And I would totally use this for if I was teaching, like if I was teaching a class or something, this might be a good way to teach people how to look at trace logs and error logs.
Starting point is 00:22:12 This is fantastic. And if you're out there listening, not seeing it, you should definitely pull up this site because the pictures really are what you need to tell quickly. Yeah. That's one of the things I like about this article is that lots of great pictures. One thing out of curiosity so what I'm seeing here is that for example it says return number one divided by number two and then you actually see the numbers that were in those variables do you have to add like a decorator
Starting point is 00:22:35 or something to get this output or how does that work that's explained later maybe I don't remember where it's explained later I think yeah I think you just pull it in and it just does it, but I'm not sure. Okay. Interesting. Anyway.
Starting point is 00:22:51 So that's LogGuru. Then there's Snoop, which is kind of fun, that has... Hold on to Snoop. Should have had this already. Anyway, with Snoop, can see uh it prints lines of code being executed in a function so it just runs your code and then prints out each line uh in real time as it's going through it um a little you would hardly ever want this i think but when you do want it i think might be kind of kind of cool to watch um watch it go along and it it's a
Starting point is 00:23:23 uh you could also do this in a debugger. But if you didn't want a debugger, do a debugger. You can do this on the command line. One of the things that most debuggers have that is a little challenging is you'll see the state, and you'll see the state change, and you'll see it change again. But in your mind, you've got to remember, okay, that was a 7, and then it was a 5, and then it was a 3. Oh, right, yeah. Right?
Starting point is 00:23:50 And here, it'll actually reproduce each line, each block of code with the values over. If you're in a loop three times, it'll show going through the loop three times with all the values set. And that's pretty neat. Yeah. I would also argue, just for teaching recursion, I think this visualization is kind of nice. Because you actually see the indentation and the depth appear. So you can actually see this function is called inside of this other function and there's a timestamp uh so i would also argue this one's pretty good for teaching i i like it in fact connor on the live stream says i'm teaching my first python course tomorrow so
Starting point is 00:24:17 yeah thanks for the timely article and a real-time follow-up for the log guru you have to import logger and then you have to import logger, and then you've got to put a decorator on the function, and then it'll capture that super detailed output. And that's probably exactly what you want because you don't really want to do that for everything, probably. So it'll be something you're working on that you want to trace. So heart rate is the last tool that we want to talk about.
Starting point is 00:24:44 It's a way to visualize the execution of a Python program in real time. So this is something we have not covered before, but it's, I thought there was a little video. Yeah. It kind of goes through and does a little, like a heat map sort of thing on the side of your code. So when it's running, you can kind of see that different things get hit more than others so that's uh that's almost like a profiler sort of not speed though it's just number of hits yeah okay yeah i'm i'm kind of on the fence about this but it's pretty so see yeah same but uh the logger one amazing. I thought Loggeroo was also like a general logging tool.
Starting point is 00:25:26 Like it does more, I think, than just things for debugging. Yeah, I think it's a general logging tool as well. Okay. Okay. But I guess it logs errors really good. Logger.catchdecorator. Okay. Could probably do other things with the Logger then as well.
Starting point is 00:25:44 But having a good logging debugger catcher is always welcome. Yeah, absolutely. All right. Let's talk about ducks. I mean, Brian, you and I are in Oregon. Go ducks. Well, I know your daughter goes there. My daughter goes to OSU.
Starting point is 00:25:59 So go bees, I guess. Whatever, ducks. We're going to talk duck databases anyway. And data science. So Alex Monaghan sent over to us saying, hey, you should check out this article about DuckDB, which is a thing I'm now learning about. And it's integration.
Starting point is 00:26:14 It's direct integration with pandas. So instead of taking data from a database, loading it into a pandas data frame, doing stuff on it, and then getting the answer out, you basically put it into this embedded database duckdb which is sqlite like and then you know sorry you put it into a pandas data frame but then the the query engine of duckdb can query it directly without any data
Starting point is 00:26:37 exchange without transferring it back and forth between the two systems or formats that's pretty cool right so let me pull this oh that, that's honest. I know him. Nice. Yeah, he's from Amsterdam. Yeah, very cool. So here's the idea. We've got SQL on pandas, basically. If we had a data frame here, they have a really simple data frame, but just a, you know, a single array, but it could be a very complex data frame. And then what you can do is you can import DuckDB and you can say duckdb.query. And then what you can do is you can import DuckDB and you can say duckdb.query and then you write something like,
Starting point is 00:27:08 so one of the columns is called A in the data frame and you could say select sum of A from the data frame. How cool is that? I don't know. Is it cool? It's very cool. So then you can also, there's also a to data frame on the result.
Starting point is 00:27:24 So what happens here is this is parsed by DuckDB, which has an advanced query optimizer for things like joins and filtering and indexes and all that kind of stuff. And then it says, oh, okay. So you said there's a thing called myDF, which I'll just go look in the locals of my current call stack and see if i can find that oh yeah that is neat so you can write arbitrary sequel and this one looks pretty straightforward like yeah yeah okay interesting interesting but you can come down here and do more interesting things uh let's see i'll pull up some examples so they do a select aggregation group by thing so select these two things and then also do a sum min max and average on some part of the data frame. And then you pull it out of the data frame and you group by two of the elements. Right.
Starting point is 00:28:15 And they show also what that would look like if you did that in true pandas format. That's cool. And they say, well, it's about two to three times faster in the DuckDB version. That is interesting. That's interesting, right? But then they say, well, what if we wanted not to just group by, but we wanted a filter? Seems real simple, like where the ship date is less than 1998. No big deal. But because the way that this be really efficiently figured out by the query optimizer, it turns out to be much faster. So 0.6 seconds on single threaded, or it actually supports parallel execution as well. So
Starting point is 00:28:52 multi-threaded, they tested on a system that only had two cores, but it can be many, many cores. So it's faster 0.4 seconds when threaded versus 2.2 seconds, sorry, 3.5 seconds on regular pandas. But there's this more complicated, non-obvious thing you can do called a manual pushdown in pandas, which will help drive some of the efficiency before other work happens. And then they finally show one at the very end
Starting point is 00:29:18 where there's more stuff going on that query optimizer does. So the threaded one's 0.5 seconds, regular pandas is 15 seconds. So all that's cool. And what's really neat is it all just happens like on the data frame. Yeah, there's two things about that that are pretty interesting. Like one is we should underestimate how many people are still new to pandas, but do understand SQL. So just for that use case, I can imagine, you know, you're going to get a lot of people on board. But the fact
Starting point is 00:29:42 there's a query optimizer in there that's able to work on top of pandas that's also pretty neat because i'm assuming it's doing clever things like oh i need to filter data i should do that as early on as possible and my query plan is doing some of that logic internally um and the fact is you can paralyze it because parallel pandas doesn't paralyze easily it's also yeah i don't know that it paralyzes at all you gotta go to something like dask yeah i mean so mean, so there are some tricks that you could do, but they're tricks. They're not really natively supported. Right, right.
Starting point is 00:30:11 But just having a SQL interface is neat. Yeah, yeah, this is pretty neat. And also, now I learned about DuckDB, so apparently that's a thing, which is pretty awesome. So it's in process, just like SQLite. It's written in c++ 11 with no dependencies supposed to be super fast so this is also a cool thing that you know maybe i'll check out unrelated to querying pandas but the fact that you can i think it's pretty cool it's got a great
Starting point is 00:30:36 name yeah you know another database out there i hear a lot about but i've never used i have really an opinion about is cockroach db i'm not a huge fan of just on the the name although it has some interesting ideas i think it's like meant to communicate resiliency and it can't be killed because it's like geolocated and it's just going to survive but yeah ducks i'll go with ducks yeah i would agree yeah and then uh chat out in the out in the live stream chat christopher says so so DuckDB is querying on Pandas data frames or can you load the data method chain with DuckDB and reduce memory?
Starting point is 00:31:09 I believe you could do either. Like you could load data into it and then there's a two data frame option that probably could come out of it. But I think just very briefly. This is right on it. Yeah, go ahead. Doesn't, I might've just seen it briefly
Starting point is 00:31:22 while you were scrolling in the blog post, but I believe it also said that it supports the Parquet file format. It does. So the nice thing about Parquet is you can kind of index your data cleverly. Like you can index it by date on the file system. And then presumably, if you were to write the SQL query in DuckDB, it would only read the files of the appropriate date if you put a filter in there.
Starting point is 00:31:41 So I can imagine just because of that reason, DuckDB on its own might be more memory performant than Panas, I guess. Yeah, perhaps. That's very cool. Stuff like that you could do. Yeah. And then Nick Harvey also says, I wonder if it's read-only, if you can insert or update. I don't know for sure, but you can see in some of the places they are doing projections. So for example, they're doing a select sum min max average, like that's generating data that goes into it. And then the result is a data frame.
Starting point is 00:32:09 So you can just add into the data frame afterwards if you want to be more manual about it. Yeah. All right. Vincent, you got the last one? Yeah. So the thing is, I work for a company called Rasa.
Starting point is 00:32:20 We make software with Python to make virtual assistants easier to make in Python. And I was looking in our community showcase and I just found this project that just made me kind of feel hopeful. So this is a personal project, I think. So we have a name here, Amit and I'm hoping I'm pronouncing it correctly, Arvind. But what they did is they used Rasa kind of like a Lego brick, but they made this assistant, if you will, that you can send a text message to. Now, what it does,
Starting point is 00:32:48 I'll zoom in a little bit for people on YouTube that they might be able to see the GIF, but every 10 minutes, it scrapes the weather information, the fire hazard information, and I think evacuation information from local government in California meant to help people during wildfire season.
Starting point is 00:33:02 And they completely open-s sourced this project as well. So there's a linked GitHub project where you can just see how they implemented it. And it's a fairly simple implementation. They use Raza with a Twilio API. They're doing some neat little clever things here with like, if you misspelled your city, they're using like a fuzzy string matching library
Starting point is 00:33:22 to make sure that even if you misspell your city, they can still try to give you accurate information. But what they do is they just have this endpoint where you can send a text message to give me the update of San Francisco. And then it will tell you all the weather information, air quality information, and that sort of thing. And if you need to evacuate, it will also be able to tell you that. And what I just loved about this, if you look at the way that they described it, this was just two people who knew Python who were a little bit disappointed with the communication that was happening, but because the APIs were open, they just built their own solution. And thousands of people
Starting point is 00:33:54 use this. And what's even greater is that if your mobile coverage isn't great, watching a YouTube video or trying to get audio in can be tricky, but a text message is really low bandwidth. So for a lot of people, this is like a great way to communicate. And of course, I'm a little bit biased because I work for Raza and I think it's awesome that they use Raza to build this. But again, the whole thing is just open sourced. You can go to their GitHub and you can just, if I'm not mistaking, there's like the scraping job of the endpoints actually in here as well. but this is like exactly what you want just a couple of open apis and sort of citizen science building something
Starting point is 00:34:30 that's useful for the community it's great yeah i like it and text message is probably a really good way to communicate for disasters right yes possibly in a place where you know lte is crashed wi-fi is out right like if even if you're on edge you know text should still get there exactly unless you're on iMessage then you're out of luck no i don't know sort of well yeah i live in europe so i cannot comment on that of course but uh it's a little bit different here but no but like the the data service you can just look in here and this is like again i like these little projects that don't need anyone's permission to help people. Like that stuff, like, ah, this is good stuff.
Starting point is 00:35:07 And the thing that I also really like about it is it's really just sending you a text message with like air quality information and like enough information. And that's good. It's not like they're trying to make like a giant predictive model on top of this or anything like that, but just really doing enough and enough is plenty. Like that's the thing I really love about this little demo. And of course using R raza which is great but uh this is uh the kind of stuff that uh this is why i get up in the morning projects like this that's fantastic yeah i love it that's a
Starting point is 00:35:36 really good one brian is that it uh yeah that's it it's our six items um any extras that you want to talk about? I might have one. Okay. I'm totally tooting my own horn here, but this is a project I made a little while ago. But I think people might like it. At some point, it kind of struck me that people were making these machine learning algorithms and they're trying to, on a two-dimensional
Starting point is 00:36:01 plane, trying to separate the green dots from the red ones from the blue ones. I just started wondering, well, why do you need an algorithm if you can just maybe draw one? So very typically, you got these like clusters of red points and clusters of blue points. And I just started wondering, maybe all we need is like this little user interface element that you can load from a Jupyter notebook. And maybe once you've made a drawing, it'd be nice if we can just turn it into a scikit-learn
Starting point is 00:36:24 model. So there's this project called human learn that does exactly this. It's a tool of little buttons and like widgets that I've made to just make it easier for you to like do your domain knowledge thing and turn it into a model. So one of the things that it currently features is like the ability to draw a model, which is great because domain experts can just sort of put their knowledge in here. It can do outlier detection as well because if a point falls outside of one of your drawn circles, that also means that it's probably an outlier.
Starting point is 00:36:52 But it also has a tool in there that allows you to turn any Python function, like any custom Python written function, into a scikit-learn compatible tool as well. So if you can just declare logic in a Python function, that can also just be a machine learning model from now on. There's an extra fancy thing, if people are interested. I just made a little blog post about that, where I'm using a very advanced coloring technique using parallel coordinates. Very fancy technique. I won't
Starting point is 00:37:20 go into too much depth there, but what's really cool is that you can basically show that a drawn model can outperform um the model that's on the charis deep learning blog which i just thought was a very cool little feature as well um the project's called human learn it's just uh components for inside of your jupyter notebook to make sort of domain knowledge and human learning and all that good stuff better and also with the the fairness thing in mind, I really like the idea that people sort of can do the exploratory data analysis bit and at the same time also work on their first machine learning model as a benchmark. That's what human learn does.
Starting point is 00:37:54 So if people are sort of curious to play around with that, please do. It's open source, PIV installed. Please use it. I'm impressed. This is cool. This is really cool. Maddy out in the live stream asks, how does it handle ND data? And I guess it's This is cool. It is, right? It's really cool. Yeah. Matty, out in the livestream, asked, how does it handle ND data?
Starting point is 00:38:07 And I guess it's three or larger. Yeah, so you can make, like, so if you have four columns, you can make two charts with two dimensions. That's one way of dealing with it. And there's, like, a little trick where you can combine all of your drawings into one thing. If you go to the examples, though, the parallel coordinates chart that you see here, that has 30 columns, and it works just fine. I do think 30 is probably the limit. But the parallel coordinates chart, I mean, you can make a subselection across multiple dimensions. That just works. It's really hard to explain a parallel coordinates chart on a podcast, though. I'm sorry. Yeah, so this is like a super interactive visualization thing with lots
Starting point is 00:38:43 of colors and stuff happening. Sorry, you have to go to the docs to fully experience that, I guess. But again, also, let's say you work for a fraud office and someone asks you like, hey, without looking at any data, can you come up with rules that's probably fraud? And you can kind of go, yeah, if you're 12 and you earn over a million dollars, that's probably weird. Someone should just look at that. And the thing is, you can just write down rules that way. And that should already be, can already be turned into a machine learning model.
Starting point is 00:39:08 You don't always need data. And that's the thing I'm trying to cover here. Like, just make it easier for you to declare stuff like that. It's a more human approach. Nice. Brian, I cut you off.
Starting point is 00:39:17 Were you going to say something? Oh, one of the things, I don't know if we've covered this already, but we've talked about comcode.io a lot on this uh podcast and you're the person behind it right yeah i am yeah so it's uh it's been a fun little side project that i've been doing for a year now yeah yeah so nice videos i like how short they are so thanks no so the the i like to hear like people tell me that and that's also the thing that i was
Starting point is 00:39:41 kind of going for like i love the you know those when you watch a video, it's like a lightning talk and you learn something in five minutes. Yeah. Oh, that's an amazing feeling. Like that's the thing I'm trying to capture there a little bit. Like if, if it takes more than five minutes to get a point across, then I should go on to a different topic, but I'm happy to hear you like it. Cool.
Starting point is 00:39:59 Yeah. Very cool. How about you, Michael? Anything extras? Well, I had two. Now I have three because I was reading the source code of one of Vincent's projects there as we were talking. And I learned about Fuzzy Wuzzy. So Fuzzy Wuzzy was being used in that emergency disaster recovery awareness thing.
Starting point is 00:40:21 And it's fuzzy string matching in Python. And it says fuzzy string matching like a boss, which you got to love. So it was like slight misspellings and plural versus not plural and whatnot. And Brian even uses hypothesis, which is kind of interesting. Yeah. And PyTest. Yeah. And PyTest, of course. Yeah. Anyway, that's pretty cool. I just discovered that. So Fuzzy Wuzzy is a pretty cool tool. The only thing I don't like about it, and this is the one thing I do have to mention, it is my understanding that Fuzzy Wuzzy is a slur
Starting point is 00:40:50 in certain regions of the world. So in terms of naming a package, they could have done better there, but I think they only realized that in hindsight. Other than that, there's some cool stuff in there. Definitely. Just when I learned about this, I did make the comment to myself,
Starting point is 00:41:02 like, okay, I should always acknowledge it whenever I talk about the package. But yeah, it's definitely useful stuff in there. Quasi-string matching is a useful problem to have a tool for. Yeah, very cool. And PyCon, way out in the future, 2024, 2025 announcement is out.
Starting point is 00:41:18 So the next few PyCons are already theoretically in Salt Lake City. So hopefully we actually go to Salt Lake City and not just go and we'll virtually imagine it was there, right? Like this year. But last two years, because of the pandemic, Pittsburgh lost its opportunity to have PyCons. So not just once, but twice. So they are rescheduling the next one back into Pittsburgh.
Starting point is 00:41:41 So folks there will be able to go and be part of PyCon. That's pretty cool. Because of Corona, they've now been able to plan four years ahead of the way. Exactly. Everything's upside down now. And then also, I just want to give a quick shout out to an episode that I think is coming out this week on TalkPython. I'm pretty sure that's the schedule called CodeCarbon.io.
Starting point is 00:42:02 And it is a, let me pull it up here. It is a, both a dashboard that lets you look at the carbon generation, the CO2 footprint of your machine learning models, as you specifically around the training of the models. So what you do is you pip install someone here, you pip install this emission tracker, and then you just say, start tracking, train, stop tracking it. It uses your location, your data say, start tracking, train, stop tracking. It uses your location, your data center, the local energy grid, the sources of energy from all that. And it'll say like, oh, if you actually switch to, say, the Oregon AWS data center from Virginia, you'd be using more
Starting point is 00:42:38 hydroelectric rather than, I don't know, gas or whatever. We were talking about some of the ethics and cool things that we should be paying attention to. And I feel like the sort of energy impact of model training might be worth looking at as well. So I totally agree with model training. I've been wondering about this other thing, though, and that's testing on GitHub. Like, if you think about some of these CI pipelines, they can be big, too.
Starting point is 00:43:02 Like, I've heard projects that take like an hour on every commit. I'd be curious to run this on that stuff as well. Yeah. Well, you could turn on, you could employ this as part of your CI CD. It doesn't really have to do
Starting point is 00:43:15 with model training per se, but it does things like when you train models that use a GPU, it'll actually ask the GPU for the electrical current. Ah, right. So it goes down into the hardware.
Starting point is 00:43:27 That's a fancy feature. And it goes down to the CPU level voltage and all sorts of low... It's not just, well, it ran for this long, so it's this. It's really detailed. That said, I suspect you could actually answer the same question on a CI.
Starting point is 00:43:44 It would just say, well, it looks like you're training on a ci right it would just say well it looks like you're training on a cpu yeah yeah true no but so it's a nice way to be conscious about compute times and stuff so that's uh yeah and what's cool is it has the the dashboard that like actually lets you explore like well if i were to shift it to europe rather than train in the u.s which who really cares where it trains but then what difference would that have look at how green paraguay is we are hosting yeah that's incredible i suspect uh a lot of waterfalls yeah countries down there have insane amounts of hydro uh like chile maybe i can't remember exactly but yeah a lot of hydro and you see and you see iceland as well and it's probably because of the volcanoes and warmth and heat and yeah yeah the geo yeah okay interesting all right nice Brian you got anything no not this week how about we do
Starting point is 00:44:29 a joke sounds good so uh it's been a while since I've been to a strongest man competition world strongest man you know like maybe one of those things where you pick up like a telephone pole and you have to carry this throw it as far as you can or you lift like the heaviest barbells or like you carry huge rocks some distance. So here's one of those things. There's like three judges, a bunch of people who look way over pumped. They're all flexing, getting ready. The first one is this person carrying a huge rock, sweating clearly.
Starting point is 00:45:00 And the judges are, they're not super impressed. They give a five, a two and a six. Then there's another one lifting this, you know, 500 pound barbell over his head, does eight, seven and six as their score. And then there's this particularly not overly strong looking person here that says, I don't code, I don't use Google when coding.
Starting point is 00:45:17 Wow, so strong. The judges give him straight tens. And he's also being like really sincere, like his hand over his heart oh yeah like it's very humble yeah exactly all right well that that's what i got for you take it take it for what you will that's pretty good just stack overflow yeah yeah well i feel like stack overflow would be we give take it to 11 honestly i don't use stack overflow now you have a winner definitely that's funny well thanks for that um you're usually pretty good about finding our jokes i appreciate it
Starting point is 00:45:54 and uh thanks for coming on the show uh thanks for having me it's fun i think that's a wrap yeah that is thanks brian thank you bye vincent y'all have a good one

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.