Python Bytes - #235 Flask 2.0 Articles and Reactions
Episode Date: May 26, 2021Topics covered in this episode: Flask 2.0 articles and reactions Python 3.11 will be 2x faster? 3 Tools to Track and Visualize the Execution of your Python Code DuckDB + Pandas Extras Joke See t...he full show notes for this episode on the website at pythonbytes.fm/235
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
This is episode 235, recorded May 26, 2021.
And I'm Brian Ocken.
I'm Michael Kennedy.
And I'm Vincent Wormadam.
We talked about Vincent a while ago and got his name wrong.
And he told us a story that was good, that we accidentally pronounced his name, what, Wanderman.
Yes.
So sorry about that.
That's fine.
It's fine.
I was bragging to my wife that I was on the podcast and then I was announced as Vincent
Wanderman and she's still kind of philosophical about the whole thing.
But it was a fun introduction.
It's the best mispronunciation of my life.
Let me put it that way.
It's your alter ego.
It's like your spy name. I'll take it um well thanks for joining us today my pleasure should we jump into
the first step topic sure okay um well i think we we covered we mentioned uh last time that flask
2.0 was out and um and then michael had um you had you talked with somebody, didn't you?
I did.
I had David Lord and also Philip Jones on TalkPython to basically announce Blast 2.0 and talk about all their features.
Yeah.
And that was a great episode.
I listened to both of those.
I listened to that.
It was great.
What I wanted to cover was a couple articles
or an article and video so uh first off uh we've got a link to the change list so if um actually
lost the change list yeah there it is uh so you can read through that um and maybe that's exciting
to you but i i like a couple other ways so there's uh um article by Patrick Kennedy, Async in Flask 2.0.
And I really like this article.
It goes through kind of describing what it means to have Async in Flask
and how it works with some nice little diagrams.
Diagrams are always nice.
Yeah.
Oh, yeah.
Pictures.
Yes.
And then a description of the ASGI and why we don't need it yet.
And I'm not sure what the timeline is for Flask, if they're going to do it more.
But there is a discussion of that it's not completely async um yet there's gonna there was a lot of discussion with david and philip
that they may be leaving court to take the place of full-on okay asgi flask and the idea being that
there's there's a lot of stuff that kind of has to change especially around the extensions and
you get nearly that but not exactly that by using the g event async stuff that's in regular Flask.
And that integrates in,
if you just do an async def method in your regular Flask,
but if you want true async IO integration,
then they basically were saying for the time,
for the foreseeable future,
instead of import Flask and go in that,
just import core and wherever you see Flask,
replace it with the word core.
Okay.
But there's other cool stuff other than the async
that's coming into Flask 2.0.
So I appreciate it.
There's also a video from, we don't want it to play,
from Miguel Grinberg,
and talking about some of the new stuff in Flask.
And I really like this. One of the things that he covers right away is the new stuff in Flask. And I really like this.
One of the things that he covers right away
is the new route decorators.
Yeah, those are nice.
It might be just a syntax thing,
but it's really nice.
So you used to have to say app route
and then methods equals post
or list the method.
And now you can just say app post.
That's nice.
And then a really clean discussion of the WebSocket support with Flask.
And then he goes in to talk about the async.
And with that also does a little demo timing it.
And I was actually surprised at how easy it was to set up this demo of timing and showing that he showed that you could increase the
users and then still, and still get,
it doesn't really increase your response time or how many,
how many users per request per second doesn't increase because of the way
that Flask 2.0 is done, but it was nice.
And then he also talked about some of the extensions that he wrote to that work with Flask 2.0 was done but it was nice um and then he also uh talked about some of the extensions that he
he wrote to uh that work with flask 2.0 and stuff so it was definitely worth the listen
well that's always cool that's always the thing when you get like flask is like a pretty big
project so when there's like a new upgrade of that one of the things that people sometimes
forget is like oh like all the plugins do they kind of still work so it's nice if someone does
a little bit of the homework there and so it says, here's a list of stuff that I've checked
and that's at least compatible.
Well, he's mostly doing some...
So, for instance, one of the things is around which...
I don't know which...
Some of the WebSocket stuff has changed
and some of the other things have changed.
And he has some different shims that he was recommending
some things before, but now you don't have to do, you don't have to swap out some things.
So like, for instance, some of the extensions we're allowing for WebSockets
required you to swap out the server for a different server,
and you don't have to anymore.
Ah, like that, right.
Okay, cool.
Yeah, a couple of big other things that come to mind.
One, they've dropped Python to support and even 3.5 and below.
I mean, we're at this point where 3.5 is like old school legacy, which surprises me.
It still feels new.
Yeah, I remember when it came out.
Yeah, yeah.
Well, that was when async and await arrived, right?
So that was a big deal there.
But it doesn't have F string, so it's...
Yeah, that's the killer feature.
Yeah, so it's yeah that's the killer feature yeah yeah so that
there's that and they also said that you are not going to need to change your deployment
infrastructure if you want to run async flask you can just push a new version and it's good to go
so yeah a lot a lot of neat things there very good nice um what do we got next michael
well what if python were faster that would be nice that's
always good we actually talked about cinder remember cinder yeah based from the facebook
uh world so that's one really interesting thing that is happening around python and
there's a lot of cool stuff here but but remember, this is not supported. It's not meant to be a new runtime.
It's just there to give ideas and motivation and examples and basically to run Instagram.
On the other hand, Mike Driscoll tweeted out, hey, Python might get a two-time speedup of the next version of Python.
And you might want to check out Guido's slides from the Python Language Summit at the Virtual PyCon.
That's exciting, right?
Yes. I mean, if Guido is saying it, then, you know, odds of it happening increase, right?
Exactly. Exactly. So a while ago, we actually covered what has now become known as the Shannon
plan for making Python faster a little bit each time over five years, over the next four, at least, I guess,
four years at that point, and how to make that happen. So some of these ideas come from there.
And so here I'm pulling up the slides. And it says, can we make CPython faster? If so, by how
much? Could it be a factor of two? Could it be a factor of 10? And do we break people if we do
things like this? So the Shannon plan, which was posted last October and we covered, talks about how do we make it 1.5 times faster each year, but do that four times.
And because of compounding performance, I guess, yeah, it's five times faster.
All right.
So there's that.
Gita said, thank you to the pandemic.
Thank you to boredom.
I decided to apply at Microsoft.
And shocker, they hired him.
So as part of that, it's kind of just like, Hey, we think you're awesome. Why don't you just pick
something to work on that will contribute back? That'd be really cool. So his project at Microsoft
is around making Python faster, which I think is great. Cool. So yeah. So there's a team of folks,
Mark Shannon, Eric Snow, and Guido,
and possibly others who are working with the core devs at Microsoft to make it faster.
It's really cool. Everything will be done on the public GitHub repo. There's not like a secret
branch that will be then dropped on it. So it's all just going to be PRs to github.com
slash the Python slash CPython, whatever the URL is,
the public spot. And one of the main things they want to do is not break compatibility. So that's
important. Also said, what things could we change? Well, you can't change the base object, like
Py, what is it? Py OBJ, basically the base class, right? PyObject pointer. That's it, the PyObject class.
So that thing has to stay the same and it really needs to keep reference counting semantics because
so much is built on them. But they could change the bytecode that exists, the stack frame layout,
the compiler, the interpreter, maybe make it a JIT compiler to JIT compile the bytecode,
all of those types of things. So that's pretty cool.
And they said, how are we going to reach two times speed up in 3.11? An adaptive specialized bytecode interpreter that will be more performant around certain operations, optimize frame stacks,
faster calls, zero overhead exception handling, and things like integral internals. So maybe
treating numbers differently,
changing how PYC files.
So there's a lot of stuff going on.
Also putting the dunderdick for a class
always at a certain known location
because anytime you access a field,
you have to go to the dunderdick,
get the value out and then read it.
And I suspect the first thing that happens is,
well, go find the dunderdick pointer
and then go get the element out of it.
So if every access could just go, nope, it's always, you know, one certain byte off in memory from where the class starts.
That would save, you know, that sort of reversal there.
So some pretty neat things.
Yeah, I'm glad you explained that because I read it before and I'm like, why would that help at all?
I think you can traverse one fewer pointers yeah and in general doesn't matter but literally everything you ever
touch ever if you could cut in half the number of pointers you got to follow that'd be good
yeah yeah this is always one of those things that always struck me with um when you're using python
you don't think about these sorts of things it's when you when you're doing something in rust or
something then you are confronted with the fact that you really have to keep track of where
is the pointer pointing and memory and all that.
You take a lot of this stuff for granted, so it's great
that people are still sort of going at it
and looking for things to improve there.
Yeah, absolutely. You know, in C, you do
the arrow, you know, dash
greater than sort of thing every pointer, so you're
like, I'm following a pointer, I'm following a pointer. You know it,
right? Here, you just
write nice, clean code and magic happens. So let me round
this out with who will benefit. So who will benefit? If you're running CPU intensive
pure Python code, that will get faster because the Python execution
should be faster. Websites should be faster because a lot
of that code is running in the Python space. And who will happen to use Python?
Who will not benefit so much?
NumPy, TensorFlow,
Pandas, all the code that's written in C,
things that are IO bound. So if you're
waiting on something else, speeding up
the part that goes to wait,
doesn't really matter. Multithreaded code because of
the GIL at this point. But
Eric Snow is also working on the sub
interpreters, which may fix that and so on.
I like the last bullet.
Pretty neat stuff.
There's some peps out there.
I'll link to I link to the tweet by Mike Driscoll, but that'll take you straight to the GitHub repo, which has the PDF of the slides.
And people can check that out if they're interested.
I like the last bullet for the previous slide of things, people that not benefit, code that's algorithmically inefficient.
Otherwise, if your code already sucks,
it's not going to be better.
It may be better, but it
could be better. I was about to say,
theoretically, it actually would go faster.
Just not
as much better as it could, right?
Yeah, it would still be like n to the power
of 3 or something like that, but it would be
faster n to the power of 3. Yeah, it won but it would be faster N to the power of three.
Yeah, yeah.
It won't change the big O notation, but it might make it run quicker on wall time.
That's right.
Yeah.
Yeah.
And Christopher Tyler out there in the live stream says, I know I still need to improve my code, but this would be great, right? I mean, it used to be that we could just wait six months.
A new CPU would come out that's like twice as fast as what we ran on before.
Like, oh, now it's fast enough.
We're good.
That doesn't happen as much these days.
So it's cool that the runtimes
are getting faster.
Yeah, and I mean, let's be honest,
Python is also still used
for like just lots of script tasks.
Like, hey, I just need this thing
on the command line that does the thing
and I put that in cron.
And like a lot of that will be nice
if that just gets a little bit faster.
And it sounds like this will just
be right up that alley.
Yeah.
And one of the things that I know has been holding certain types of changes back has
been concern about slowing down the startup time.
Because if all you want to do is run Python to make a very small thing happen, but like
there's a big JIT overhead and all sorts of stuff, and it takes two seconds to start and
a nanosecond, microsecond to run, right? They
don't want to put those kinds of limitations and heal that use case either. So yeah, it's good to
point that out. All right, Vincent, you're up next. Cool. Yeah. So I dabble a little bit in
fairness algorithms. It's a big, important thing. So I get a lot of questions from people like,
hey, if I want to do like machine learning and fairness, where should I start? And I don't think you should start with algorithms. Instead,
what you should do is you should go check out this Python project called Deon.
And the project's really minimal. The main thing that it really just does is it gives you a
checklist of just stuff to check before you do a big data science project at a big company or an
enterprise or something like that. And they're sensible things they're they're sort of grouped together so like hey can i check off that i have informed consent and the
collection bias can i check all these things off uh the main themes are literally a checkbox you
can check them off in the page to sort of get the feel like oh yeah these are good it goes further
so the thing is this is an actual python project you can generate this as yaml for your github
profile so like for your github project you actually have this checklist that has to be checked in git so people so you
know that people signed off on it like you can actually see the checklist you can even maybe
in your git log see who checked it off um but what's really cool is two things like one you
can generate this checklist two you can also customize the checklist so if you are at a
specific company of certain legal requirements this tool actually kind of makes it easy to
customize this very specific checklist for data projects but the the real killer feature if you are at a specific company of certain legal requirements this tool actually kind of makes it easy to customize this very specific checklist for data projects but the
real killer feature if you ask me like again all of these comments are good like is the data security
well done is the analysis reproducible how do we do deployment like all of these things that are
usually like things that go wrong and were obvious in hindsight but the real killer feature is usually
you have to convince people to take this serious. So what the website offers is like an example list. So for every
single item that is on this checklist, they have one or two examples. Typically, these are like
newspaper articles of places where this has actually gone wrong in the past. So if you need
like a really good argument for your boss, like, hey, we got to take this serious. There's a
newspaper article you can just send along as well.
Oh, that's interesting.
Yeah, I like it.
And the fact you can also generate
Jupyter notebooks with this, you can customize
it a little bit. The people that made this,
the company I think is called Driven Data.
They host Kaggle competitions for good
causes. That's sort of a thing that they do there.
But Deon is just a really cool project.
I think if more people would just start with a sensible checklist and work from there,
a lot of projects would immediately be better for it.
Yeah, this is really cool. So things are, can you go to the very bottom of that page
that you're on? Yeah. Sorry, just the checklist.
Oh, right. Yeah. Yeah.
So there's some examples like, make sure that you've accounted for unintended use.
Have you taken steps to identify and prevent unintended uses and abuse?
So like you created a find my friends in pictures.
So like I want to find pictures my friends have taken of me.
You could put it up and it would show you all the pictures your friends took.
But maybe someone else is going to use that to, I don't know, try to fish you.
Like here's the picture of us together or I don't know, try to fish you. Like, here's the picture of us together. Or I don't know, some weird thing.
Use it for facial recognition and tracking
when it had no such intent, right?
Things like that.
I think for, and I might be,
so it doesn't have this example.
The best example of unintended use,
there used to be this geo lookup company
where you could give an IP address
and it would give you an actual address.
However, sometimes you don't know
where the IP address actually is, so you just give center point of like a u.s state or the country
so there used to be this house in the middle of kansas i think it was like the center point but
the thing is um this they will get like fbi trucks driving by and like doing raids and stuff because
they thought there were criminals there because the geo lookup service would always say like ah
the crooks at that ip address that's this latitude longitude place right right we had a cyber attack it was
from this ip address raid them boys and of course it was just some poor farmer in the midwest going
you know yeah no just the geographic center please stop raiding my farm yeah but like the story was
actually quite serious like i think the person who lived
there got death
threats at some
point as well
because of the
same mistake.
So this is stuff
to take serious.
The one thing
that I did like
is the solution.
I think now
instead of it
pointing to the
house in Kansas,
I think it points
to the center
of the three
big lakes
in Michigan.
I think it's
just the middle
of a puddle
of water
basically just
to make it obvious to the FBI squ like no it's not a person living there
yeah but like darn these submarines are they've moved underwater or or whatever but i mean but
that's why you want to have a checklist like this like you're not gonna the thing with unintended
use is you it's unintended so you cannot really imagine it but you at least should do the exercise
and that's what this list uh does in
a very sensible way and more people should just do it and there's interesting examples too you
just have a look and there's also a little community interesting there's a little community
around it as well of like collecting these examples and they have like a wiki page with
examples that didn't make the front page cut um so definitely recommend anyone interested in
fairness uh start here um i was curious you brushed by it fairly quickly of fairness
analysis fairness analysis is that what you do um so uh i just don't know what that means so could
you yeah so um oh man this is a longer like this topic deserves more time than i'll give it but the
idea is that you might be able we know that models aren't always fair, right? It can be that you have models that, for example, the Amazon was a nice example.
So they had like a resume parsing algorithm that basically favored men because they hired more men historically.
So the algorithm would prefer men.
Oh, okay.
That kind of fairness.
Okay.
Historical, these have been our good employees.
Let's find more like them.
Exactly. And the thing is, you don't get an algorithm that's unfair. So there are these
machine learning techniques and there's this community of researchers that try to look for
ways like, can we improve the fairness of these systems? So we don't just optimize for accuracy.
You also say, well, we want to make sure that subgroups are treated fairly and equally and
stuff like that. So I dabble a little bit in this. There's this project I like to collaborate with.
I open source a couple of things with these people.
It's called FairLearn.
The main thing that I really like about the package is that it starts by saying fairness
of AI systems is more than just running a few lines of code.
It starts by acknowledging that.
But they have mitigation techniques and algorithms and tools to help you measure the unfairness.
It's scikit-learn compatible as well. stuff to like having said all that start here start like start
with a checklist don't worry about the machine learning stuff just yet start here um but yeah
pretty cool before we move on connor first there in the live chat says i'm glad the conversation
of ethics and data science is enlarging i think it's important about what we make. Yeah, I agree.
Totally.
Now, before we do move on though,
let me tell you all about our sponsor for this episode, Sentry.
So this episode is brought to you by Sentry.
Thank you, Sentry.
How would you like to remove
a little bit of stress from your life?
Do you worry that users may be having difficulties
or encountering errors with your app right now?
And would you even know it
until they sent you that support email?
How much better would it be to have the errors
and performance details immediately sent to you,
including the call stack and values of local variables
and the active user recorded right in the report?
With Sentry, it's not only possible, it's simple.
We actually use Sentry on our websites.
It's on Python by SetFM.
It's on TalkPython training, all those things.
And we've actually fixed a bug triggered by a user
and had the upgrade ready to roll out
as we got the support email.
They said, hey, I'm having a problem with the site.
I can't do this or that.
I said, actually, I already saw the error.
I just pushed the fix to production.
So just try it again.
Imagine there's a surprise.
So surprise in July.
Your users, get your Sentry account
at Pythonbytes.fm slash Sentry.
And when you sign up, there's a
got a promo code. Redeem it. Make sure you put
Python Bytes in that section or you
won't get two months of free Sentry team
plans and other features and they won't
know it came from us. So use a promo code
at pythonbytes.fm slash Sentry.
Yeah. Thanks. Thanks for supporting the show.
Brian. Yeah. I like this one that you
picked here. You like this? I like it
a lot. It's very good. It has pictures, little animated things, and great looking tools.
Yeah.
So there's a, it was an article that was sent to us.
I can't remember who sent it.
So apologies.
But it's an article called three tools to track and visualize the execution of your Python code.
And I was, I don't know why, executing your code just seems funny to me.
I know it just means run it, but, you know,
chop its head off or something.
Anyway, so the three tools, the three tools he covers are
law that we don't cover this very much because I don't know how to pronounce it.
L.O.G. you are you.
It's law guru or law guru. Not sure.
And then so law Guru is a pretty printer
with better exceptions.
So let's go and look at that.
So it does exceptions like this.
So it breaks out your exceptions into colors.
And it's just kind of a really great way to visualize it.
And I would totally use this for if I was teaching,
like if I was teaching a class or something, this might be a good way to teach people how to look at trace logs and error logs.
This is fantastic.
And if you're out there listening, not seeing it, you should definitely pull up this site because the pictures really are what you need to tell quickly.
Yeah.
That's one of the things I like about this article is that lots of great pictures.
One thing out of curiosity so what I'm seeing here is that for example it says return
number one divided by number two and then you
actually see the numbers that were in those variables
do you have to add like a decorator
or something to get this output or
how does that work
that's explained later maybe I don't remember
where
it's explained later I think yeah
I think you just pull it in and it just does it, but I'm not sure.
Okay. Interesting.
Anyway.
So that's LogGuru.
Then there's Snoop, which is kind of fun, that has...
Hold on to Snoop.
Should have had this already.
Anyway, with Snoop, can see uh it prints lines
of code being executed in a function so it just runs your code and then prints out each line uh
in real time as it's going through it um a little you would hardly ever want this i think but when
you do want it i think might be kind of kind of cool to watch um watch it go along and it it's a
uh you could also do this in a debugger.
But if you didn't want a debugger, do a debugger.
You can do this on the command line.
One of the things that most debuggers have that is a little challenging is you'll see the state,
and you'll see the state change, and you'll see it change again.
But in your mind, you've got to remember, okay, that was a 7, and then it was a 5, and then it was a 3.
Oh, right, yeah.
Right?
And here, it'll actually reproduce each line, each block of code with the values over.
If you're in a loop three times, it'll show going through the loop three times with all the values set.
And that's pretty neat.
Yeah.
I would also argue, just for teaching recursion, I think this visualization is kind of nice. Because you actually see the indentation and the depth appear.
So you can actually see this function is called inside of this other
function and there's a timestamp uh so i would also argue this one's pretty good for teaching
i i like it in fact connor on the live stream says i'm teaching my first python course tomorrow so
yeah thanks for the timely article and a real-time follow-up for the log guru you have to import
logger and then you have to import logger,
and then you've got to put a decorator on the function,
and then it'll capture that super detailed output.
And that's probably exactly what you want
because you don't really want to do that for everything, probably.
So it'll be something you're working on that you want to trace.
So heart rate is the last tool that we want to talk about.
It's a way to visualize the execution of a Python program in real time.
So this is something we have not covered before, but it's, I thought there was a little video.
Yeah.
It kind of goes through and does a little, like a heat map sort of thing on the side of your code.
So when it's running, you can kind of see that different things
get hit more than others so that's uh that's almost like a profiler sort of not speed though
it's just number of hits yeah okay yeah i'm i'm kind of on the fence about this but it's pretty
so see yeah same but uh the logger one amazing. I thought Loggeroo was also like a general logging tool.
Like it does more, I think, than just things for debugging.
Yeah, I think it's a general logging tool as well.
Okay.
Okay.
But I guess it logs errors really good.
Logger.catchdecorator.
Okay.
Could probably do other things with the Logger then as well.
But having a good logging debugger catcher is always welcome.
Yeah, absolutely.
All right.
Let's talk about ducks.
I mean, Brian, you and I are in Oregon.
Go ducks.
Well, I know your daughter goes there.
My daughter goes to OSU.
So go bees, I guess.
Whatever, ducks.
We're going to talk duck databases anyway.
And data science.
So Alex Monaghan sent over to us saying,
hey, you should check out this article about DuckDB,
which is a thing I'm now learning about.
And it's integration.
It's direct integration with pandas.
So instead of taking data from a database,
loading it into a pandas data frame,
doing stuff on it,
and then getting the answer out,
you basically put it
into this embedded database duckdb which is sqlite like and then you know sorry you put it into a
pandas data frame but then the the query engine of duckdb can query it directly without any data
exchange without transferring it back and forth between the two systems or formats that's pretty
cool right so let me pull this oh that, that's honest. I know him.
Nice. Yeah, he's from Amsterdam. Yeah, very cool. So here's the idea. We've got SQL on pandas,
basically. If we had a data frame here, they have a really simple data frame, but just a,
you know, a single array, but it could be a very complex data frame. And then what you can do is
you can import DuckDB and you can say duckdb.query. And then what you can do is you can import DuckDB
and you can say duckdb.query
and then you write something like,
so one of the columns is called A in the data frame
and you could say select sum of A from the data frame.
How cool is that?
I don't know.
Is it cool?
It's very cool.
So then you can also,
there's also a to data frame on the result.
So what happens here is this is parsed by DuckDB, which has an advanced query optimizer for things like joins and filtering and indexes and all that kind of stuff. And then it says, oh, okay. So you said there's a thing called myDF, which I'll just go look in the locals of my current call stack and
see if i can find that oh yeah that is neat so you can write arbitrary sequel and this one looks
pretty straightforward like yeah yeah okay interesting interesting but you can come down
here and do more interesting things uh let's see i'll pull up some examples so they do a select
aggregation group by thing so select these two things and then also do a sum min max and average on some part of
the data frame.
And then you pull it out of the data frame and you group by two of the elements.
Right.
And they show also what that would look like if you did that in true pandas format.
That's cool.
And they say, well, it's about two to three times faster in the DuckDB version.
That is interesting.
That's interesting, right? But then they say, well, what if we wanted not to just group by,
but we wanted a filter? Seems real simple, like where the ship date is less than 1998.
No big deal. But because the way that this be really efficiently figured out by the query optimizer, it turns out to be much faster. So
0.6 seconds on single threaded, or it actually supports parallel execution as well. So
multi-threaded, they tested on a system that only had two cores, but it can be many, many cores.
So it's faster 0.4 seconds when threaded versus 2.2 seconds, sorry, 3.5 seconds on regular pandas.
But there's this more complicated,
non-obvious thing you can do
called a manual pushdown in pandas,
which will help drive some of the efficiency
before other work happens.
And then they finally show one at the very end
where there's more stuff going on
that query optimizer does.
So the threaded one's 0.5 seconds,
regular pandas is 15 seconds.
So all that's cool. And what's really neat is it all just happens like on the data frame.
Yeah, there's two things about that that are pretty interesting. Like one is we should
underestimate how many people are still new to pandas, but do understand SQL. So just for that
use case, I can imagine, you know, you're going to get a lot of people on board. But the fact
there's a query optimizer in there that's able to work on top of pandas that's also pretty neat because i'm assuming it's doing clever
things like oh i need to filter data i should do that as early on as possible and my query plan
is doing some of that logic internally um and the fact is you can paralyze it because parallel
pandas doesn't paralyze easily it's also yeah i don't know that it paralyzes at all you gotta
go to something like dask yeah i mean so mean, so there are some tricks that you could do,
but they're tricks.
They're not really natively supported.
Right, right.
But just having a SQL interface is neat.
Yeah, yeah, this is pretty neat.
And also, now I learned about DuckDB,
so apparently that's a thing, which is pretty awesome.
So it's in process, just like SQLite.
It's written in c++ 11 with no
dependencies supposed to be super fast so this is also a cool thing that you know maybe i'll check
out unrelated to querying pandas but the fact that you can i think it's pretty cool it's got a great
name yeah you know another database out there i hear a lot about but i've never used i have really
an opinion about is cockroach db i'm not a huge fan
of just on the the name although it has some interesting ideas i think it's like meant to
communicate resiliency and it can't be killed because it's like geolocated and it's just going
to survive but yeah ducks i'll go with ducks yeah i would agree yeah and then uh chat out in the
out in the live stream chat christopher says so so DuckDB is querying on Pandas data frames
or can you load the data method chain
with DuckDB and reduce memory?
I believe you could do either.
Like you could load data into it
and then there's a two data frame option
that probably could come out of it.
But I think just very briefly.
This is right on it.
Yeah, go ahead.
Doesn't, I might've just seen it briefly
while you were scrolling in the blog post,
but I believe it also said that it supports the Parquet file format.
It does.
So the nice thing about Parquet is you can kind of index your data cleverly.
Like you can index it by date on the file system.
And then presumably, if you were to write the SQL query in DuckDB,
it would only read the files of the appropriate date
if you put a filter in there.
So I can imagine just because of that reason,
DuckDB on its own might be more memory performant than Panas, I guess.
Yeah, perhaps. That's very cool. Stuff like that you could do.
Yeah. And then Nick Harvey also says, I wonder if it's read-only, if you can insert or update.
I don't know for sure, but you can see in some of the places they are doing projections. So for
example, they're doing a select sum min max average,
like that's generating data that goes into it.
And then the result is a data frame.
So you can just add into the data frame afterwards
if you want to be more manual about it.
Yeah.
All right.
Vincent, you got the last one?
Yeah.
So the thing is,
I work for a company called Rasa.
We make software with Python
to make virtual assistants easier to make in Python.
And I was looking in our community showcase and I just found this project that just made me kind
of feel hopeful. So this is a personal project, I think. So we have a name here, Amit and I'm
hoping I'm pronouncing it correctly, Arvind. But what they did is they used Rasa kind of like a
Lego brick, but they made this assistant, if you will,
that you can send a text message to.
Now, what it does,
I'll zoom in a little bit for people on YouTube
that they might be able to see the GIF,
but every 10 minutes,
it scrapes the weather information,
the fire hazard information,
and I think evacuation information
from local government in California
meant to help people during wildfire season.
And they completely open-s sourced this project as well.
So there's a linked GitHub project
where you can just see how they implemented it.
And it's a fairly simple implementation.
They use Raza with a Twilio API.
They're doing some neat little clever things here
with like, if you misspelled your city,
they're using like a fuzzy string matching library
to make sure that even if you misspell your city,
they can still try to give you accurate information.
But what they do is they just have this endpoint where you can send a text message to give me the update of San Francisco.
And then it will tell you all the weather information, air quality information, and that sort of thing.
And if you need to evacuate, it will also be able to tell you that.
And what I just loved about this, if you look at the way that they described it, this was just two people who knew
Python who were a little bit disappointed with the communication that was happening,
but because the APIs were open, they just built their own solution. And thousands of people
use this. And what's even greater is that if your mobile coverage isn't great,
watching a YouTube video or trying to get audio in can be tricky, but a text message is really
low bandwidth. So for a lot of people, this is like a great way to communicate. And of course,
I'm a little bit biased because I work for Raza and I think it's awesome that they use Raza to
build this. But again, the whole thing is just open sourced. You can go to their GitHub and you
can just, if I'm not mistaking, there's like the scraping job of the endpoints actually in here as
well. but this is
like exactly what you want just a couple of open apis and sort of citizen science building something
that's useful for the community it's great yeah i like it and text message is probably a really
good way to communicate for disasters right yes possibly in a place where you know lte is
crashed wi-fi is out right like if even if you're on edge you know text should still get
there exactly unless you're on iMessage then you're out of luck no i don't know sort of well
yeah i live in europe so i cannot comment on that of course but uh it's a little bit different here
but no but like the the data service you can just look in here and this is like again i like these
little projects that don't need anyone's permission to help people.
Like that stuff, like, ah, this is good stuff.
And the thing that I also really like about it is it's really just sending you a text
message with like air quality information and like enough information.
And that's good.
It's not like they're trying to make like a giant predictive model on top of this or
anything like that, but just really doing enough and enough is plenty.
Like that's the thing I really love about this little demo.
And of course using R raza which is great but uh this is uh the kind of stuff that uh
this is why i get up in the morning projects like this that's fantastic yeah i love it that's a
really good one brian is that it uh yeah that's it it's our six items um any extras that you want to talk about? I might have one. Okay.
I'm totally
tooting my own horn here, but this is a project
I made a little while ago.
But I think people might like it.
At some point, it kind of
struck me that people were making these machine learning algorithms
and they're trying to, on a two-dimensional
plane, trying to separate the green
dots from the red ones from the blue ones.
I just started wondering, well, why do you need an algorithm if you can just maybe
draw one?
So very typically, you got these like clusters of red points and clusters of blue points.
And I just started wondering, maybe all we need is like this little user interface element
that you can load from a Jupyter notebook.
And maybe once you've made a drawing, it'd be nice if we can just turn it into a scikit-learn
model. So there's this project called human learn that does exactly this. It's a tool of
little buttons and like widgets that I've made to just make it easier for you to like do your
domain knowledge thing and turn it into a model. So one of the things that it currently features
is like the ability to draw a model, which is great because domain experts can just sort of
put their knowledge in here.
It can do outlier detection as well
because if a point falls outside of one of your drawn circles,
that also means that it's probably an outlier.
But it also has a tool in there
that allows you to turn any Python function,
like any custom Python written function,
into a scikit-learn compatible tool as well.
So if you can just declare logic in a Python function,
that can also just be a machine learning model from now on. There's an extra fancy thing,
if people are interested. I just made a little blog post about that, where I'm using a very
advanced coloring technique using parallel coordinates. Very fancy technique. I won't
go into too much depth there, but what's really cool is that you can basically show that a drawn model can outperform um the model that's on the charis deep learning blog which i just
thought was a very cool little feature as well um the project's called human learn it's just uh
components for inside of your jupyter notebook to make sort of domain knowledge and human learning
and all that good stuff better and also with the the fairness thing in mind, I really like the idea that people sort of
can do the exploratory data analysis bit
and at the same time also work
on their first machine learning model as a benchmark.
That's what human learn does.
So if people are sort of curious to play around with that,
please do.
It's open source, PIV installed.
Please use it.
I'm impressed.
This is cool.
This is really cool.
Maddy out in the live stream asks, how does it handle ND data? And I guess it's This is cool. It is, right? It's really cool. Yeah. Matty, out in the livestream, asked, how does it handle ND data?
And I guess it's three or larger.
Yeah, so you can make, like, so if you have four columns, you can make two charts with two dimensions.
That's one way of dealing with it.
And there's, like, a little trick where you can combine all of your drawings into one thing.
If you go to the examples, though, the parallel coordinates chart that you see here, that has 30 columns, and it works just fine. I do think 30 is probably the limit. But the parallel coordinates chart,
I mean, you can make a subselection across multiple dimensions. That just works.
It's really hard to explain a parallel coordinates chart on a podcast, though.
I'm sorry. Yeah, so this is like a super interactive visualization thing with lots
of colors and stuff happening. Sorry, you have to go to the docs to fully experience that, I guess.
But again, also, let's say you work for a fraud office and someone asks you like, hey, without looking at any data, can you come up with rules that's probably fraud?
And you can kind of go, yeah, if you're 12 and you earn over a million dollars, that's probably weird.
Someone should just look at that.
And the thing is, you can just write down rules that way.
And that should already be,
can already be turned into
a machine learning model.
You don't always need data.
And that's the thing
I'm trying to cover here.
Like, just make it easier
for you to declare stuff like that.
It's a more human approach.
Nice.
Brian, I cut you off.
Were you going to say something?
Oh, one of the things,
I don't know if we've covered this already,
but we've talked about comcode.io
a lot on this uh podcast
and you're the person behind it right yeah i am yeah so it's uh it's been a fun little side
project that i've been doing for a year now yeah yeah so nice videos i like how short they are so
thanks no so the the i like to hear like people tell me that and that's also the thing that i was
kind of going for like i love the you know those when you watch a video, it's like a lightning talk and you
learn something in five minutes.
Yeah.
Oh, that's an amazing feeling.
Like that's the thing I'm trying to capture there a little bit.
Like if, if it takes more than five minutes to get a point across, then I should go on
to a different topic, but I'm happy to hear you like it.
Cool.
Yeah.
Very cool.
How about you, Michael?
Anything extras?
Well, I had two.
Now I have three because I was reading the source code of one of Vincent's projects there as we were talking.
And I learned about Fuzzy Wuzzy.
So Fuzzy Wuzzy was being used in that emergency disaster recovery awareness thing.
And it's fuzzy string matching in Python.
And it says fuzzy string matching like a
boss, which you got to love. So it was like slight misspellings and plural versus not plural and
whatnot. And Brian even uses hypothesis, which is kind of interesting. Yeah. And PyTest.
Yeah. And PyTest, of course. Yeah. Anyway, that's pretty cool. I just discovered that.
So Fuzzy Wuzzy is a pretty cool tool. The only thing I don't like about it,
and this is the one thing I do have to mention,
it is my understanding that Fuzzy Wuzzy is a slur
in certain regions of the world.
So in terms of naming a package,
they could have done better there,
but I think they only realized that in hindsight.
Other than that, there's some cool stuff in there.
Definitely.
Just when I learned about this,
I did make the comment to myself,
like, okay, I should always acknowledge it
whenever I talk about the package.
But yeah, it's definitely useful stuff in there.
Quasi-string matching is a useful problem
to have a tool for.
Yeah, very cool.
And PyCon, way out in the future,
2024, 2025 announcement is out.
So the next few PyCons are already
theoretically in Salt Lake City.
So hopefully we actually go to Salt Lake City and not just go and we'll virtually imagine
it was there, right?
Like this year.
But last two years, because of the pandemic, Pittsburgh lost its opportunity to have PyCons.
So not just once, but twice.
So they are rescheduling the next one back into Pittsburgh.
So folks there will be able to go and be part of PyCon.
That's pretty cool.
Because of Corona, they've now been able to plan four years ahead of the way.
Exactly.
Everything's upside down now.
And then also, I just want to give a quick shout out to an episode that I think is coming
out this week on TalkPython.
I'm pretty sure that's the schedule called CodeCarbon.io.
And it is a, let me pull it up here. It is a, both a dashboard that lets
you look at the carbon generation, the CO2 footprint of your machine learning models,
as you specifically around the training of the models. So what you do is you pip install
someone here, you pip install this emission tracker, and then you just say, start tracking,
train, stop tracking it. It uses your location, your data say, start tracking, train, stop tracking.
It uses your location, your data center, the local energy grid, the sources of energy from all that. And it'll say like, oh, if you actually
switch to, say, the Oregon AWS data center
from Virginia, you'd be using more
hydroelectric rather than, I don't know, gas or whatever.
We were talking about some of the ethics and cool things that we should be paying
attention to.
And I feel like the sort of energy impact of model training might be worth looking at
as well.
So I totally agree with model training.
I've been wondering about this other thing, though, and that's testing on GitHub.
Like, if you think about some of these CI pipelines, they can be big, too.
Like, I've heard projects that take like an hour on every commit.
I'd be curious to run this
on that stuff as well.
Yeah.
Well, you could turn on,
you could employ this
as part of your CI CD.
It doesn't really have to do
with model training per se,
but it does things like
when you train models
that use a GPU,
it'll actually ask the GPU
for the electrical current.
Ah, right.
So it goes down into the hardware.
That's a fancy feature.
And it goes down to the CPU level
voltage and all sorts of low...
It's not just, well, it ran for this long, so it's
this. It's really
detailed. That said, I suspect
you could actually answer the same question
on a CI.
It would just say, well, it looks like you're training on a ci right it would just say well it looks like
you're training on a cpu yeah yeah true no but so it's a nice way to be conscious about compute
times and stuff so that's uh yeah and what's cool is it has the the dashboard that like actually
lets you explore like well if i were to shift it to europe rather than train in the u.s which
who really cares where it trains but then what difference would that have look at how green paraguay is we are hosting yeah that's incredible i suspect uh a lot
of waterfalls yeah countries down there have insane amounts of hydro uh like chile maybe i
can't remember exactly but yeah a lot of hydro and you see and you see iceland as well and it's
probably because of the volcanoes and warmth and heat and yeah yeah the geo yeah okay interesting all right nice Brian you got anything no not this week how about we do
a joke sounds good so uh it's been a while since I've been to a strongest man competition world
strongest man you know like maybe one of those things where you pick up like a telephone pole
and you have to carry this throw it as far as you can or you lift like the heaviest barbells or like
you carry huge rocks some distance.
So here's one of those things.
There's like three judges, a bunch of people who look way over pumped.
They're all flexing, getting ready.
The first one is this person carrying a huge rock, sweating clearly.
And the judges are, they're not super impressed.
They give a five, a two and a six.
Then there's another one lifting this, you know,
500 pound barbell over his head,
does eight, seven and six as their score.
And then there's this particularly not overly strong
looking person here that says,
I don't code, I don't use Google when coding.
Wow, so strong.
The judges give him straight tens.
And he's also being like really sincere,
like his hand over his heart
oh yeah like it's very humble yeah exactly all right well that that's what i got for you
take it take it for what you will that's pretty good just stack overflow yeah yeah well i feel
like stack overflow would be we give take it to 11 honestly i don't use stack overflow now you have a winner definitely that's funny
well thanks for that um you're usually pretty good about finding our jokes i appreciate it
and uh thanks for coming on the show uh thanks for having me it's fun i think that's a wrap
yeah that is thanks brian thank you bye vincent y'all have a good one