Python Bytes - #276 Tracking cyber intruders with Jupyter and Python
Episode Date: March 23, 2022Topics covered in this episode: gensim.parsing.preprocessing DevDocs The Right Way To Compare Floats in Python Pypyr Extras Joke See the full show notes for this episode on the website at python...bytes.fm/276
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly
to your earbuds.
This is episode 276, recorded March 22nd, 2022.
So many twos.
I'm Michael Kennedy.
And I'm Brian Arkin.
And I'm Ian Helen.
Hey, Ian.
Welcome to the show.
It's great to have you here.
Thank you very much.
I listen to the show a lot and feel very privileged to appear on it.
It's our privilege to have you here.
Thank you so much for listening.
And I know you got some cool stuff to share.
So we're looking forward to hearing about that.
Also, I do want to say thank you to Fusion Auth for sponsoring the show.
I'll tell you more about them later.
Before we get into the topics, Ian, tell people a quick bit about yourself.
Sure.
I'm a developer in Microsoft in Microsoft threat intelligence
center, but with Microsoft quite a long time, only relatively recently, like four
years. So ago got into Python coding with Jupiter notebooks. So I work on
Jupiter notebooks for the Microsoft Sentinel project and own a modest open
source package that will called mystic pie which we'll cover a little bit later.
Takes most of my time.
Fantastic. The whole cybersecurity threat detection stuff, it's very interesting.
There's a lot of innovation there, but it's also, it's a challenging area to be working.
Yep. Yep. We're never sure of stuff to do.
I'm sure you're not. Well, Brian, how about you kick us off here?
Well, so I'm going to start off with a problem.
So I had a problem and I have a cool solution for it.
So my problem is on test and code, I've got titles and I want to end a show on MP3 file.
But I want to create show notes, automated show notes, or not show notes, transcripts. So one of the problems,
there's a lot of problems in doing this, trying to automate it, but one of them is the title.
I want to turn that into something that's a little bit, so something like, you know,
it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn
that into a- Things that URLs hate. Yeah, I want to turn that into a URL.
And one of the things is getting rid of stop words.
So there's a bunch of stuff like lowercasing.
I can do that easy.
But getting rid of stop words was a little hard.
So I ran across this thing called
Gensim parsing preprocessing.
So preprocessing. So pre-processing.
So GenSim is a larger sort of beast.
It's used for machine learning and stuff to generate models.
But I'm just really using one little piece of it,
the pre-processing part.
And it's really pretty cool.
I was looking, I actually found this article first.
There was an article called Removing Stop Words from Strings in Python.
And it has a discussion of NLTK and GenSim and Spacey.
I tried all of them out, actually.
And the one that really stuck best for me is using,
talked about using Remove Stop Words is exactly what I wanted, right? From, from GenSim.
So I went ahead and tried that and it worked really well, but I'm like, wait, I'm pulling
this in from the pre-processing library. I wonder what's, what else is in there? And there's all
sorts of really cool stuff in here. There's a lower, lowercase to Unicode. It turns it both
into lowercase and in unicode
that's pretty neat don't think i need it but that's neat um but then there was one that was
uh pre and i thought maybe this is exactly what i want is uh something called pre-processed string
and it has a whole bunch of filters built into it oh nice like strip strip yeah strip white space
strip punctuation i love it yeah and take
away multiple uh after it strips punctuation like you're gonna have if i go back i had a slash in my
title for one of the episodes if it takes that out i'm gonna have a space before and a space after so
i want to remove those so it'll strip multiple white space strips out numerics because i probably
don't want numbers in there.
And then remove stop words.
The one thing I don't want that I'll have to like customize how I'm calling this is a STEM text.
So STEM text,
I don't know what that did without playing with it,
but what it does is it will take things like twisted and turning and turn it
into twist.
That's,
that's really not right.
So definitely don't want that.
I don't want that. I don't mess it up, but I think I want everything else. So, uh, this gen
sim, uh, library has, you know, if you're doing machine learning, uh, coming up with models,
um, I think this is a great, uh, tool to look into, but if it's actually, I'm going to use it
just for, uh, removing to create these titles for, for, you know, my podcast. But the, I think it,
it feels a little weird.
It feels like I'm using this really big hammer to do this little tiny problem.
I guess I'm okay with it, but you know, do you have any other ideas?
What it could use or.
Well, I didn't know about this, so I wrote my own. Okay.
And it's, it's, it's kind of janky.
Like, it's a little bit recursive, iterative.
It's like, we'll take away all the punctuation now.
Turn all of your white spaces into single white spaces,
because there might have been, you know, dot space.
So now you've got two white spaces, but you've got to take away,
you know, there's like a bunch of weird steps and then put it back.
This looks cleaner.
It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I know about it. Ian, what do you think?
Is it a huge thing? I mean, dependency, but I always think of like ML like stuff,
but this is like just the pre-processing, right? Well, I'm actually pulling in all of
Gensim to get this. I don't know if I can pull in little bits, but it's not really part of my
application that I'm shipping it's just
a tool that i'm using on my laptop so i i guess downloading it once doesn't really bother me too
much even if it's a big thing pretty cool yeah i was thinking yeah that's a good that's a good
point if it's running local it's like a dev dependency who cares right it's like worrying
about how big pi test is like it doesn't really matter and i'm not well i kind of get care about
that because ci is going to pull it in all really matter. And I'm not, well, I kind of care about that
because CI is going to pull it in all the time for PyTest.
But they got fast networks.
It's not your bandwidth.
It'll be all right.
One of the things that struck me about this
that made me think of your situation
is like that lowercase to Unicode
and so many times in the security space,
it's about like, you're checking for this representation,
but what if there's another representation
that means the same thing?
Like you don't say go to this directory,
you say go dot, dot, and then over there,
you know, those kinds of non-canonical representations.
I wonder if there's any use of this kind of stuff for you.
Yeah, there's something I kind of touch on
in the pigment section later on,
which is like the attackers typically write scripted attacks
and try to obfuscate code using a mix of kind of uppercase
and putting random dots.
I'm just thinking that would be a nice,
potentially a nice way of kind of cleaning some of that stuff up.
Yeah, for sure.
There's been some interesting supply chain vulnerability stuff.
Remember the guy with the color and I think the faker stuff in JavaScript
that sabotaged his libraries. There was another one that maybe well-intentioned, I don't know.
It was some open source library. I don't believe it was Python. I can't remember what it was.
It could have been, but I'm pretty sure it was in JavaScript because that's where all,
most of the bad stuff was, it seems. Anyway, they wrote their, they taught their dependency to erase everybody's hard drive
who installed it, who was in Belarus and Russia, which, okay, maybe they're trying to contribute,
but like it ended up doing a bunch of bad things, even to places that were like trying to help,
say, people in the press and journalists do certain things and then like, you know,
connect with sources
and then erase like that database as well.
And what they did to make it
so that nobody would notice in the GitHub commit
before it went out to NPM
was base64 encode their changes.
So they basically put a base64 encoded string
and then like decode and then run that.
And, you know, it's like that kind of stuff.
I know this won't solve that problem but yeah you know that that sort of category of like weird representations yeah you need mystic pi for something like that it's one of the things
we yeah it's a common thing kind of basics before decoding before the obfuscating yeah yeah
interesting um yeah i thought maybe using something like that with... Because one of the problems we have is every script is kind of slightly different.
If you could use something like that to essentially kind of apply sentiment analysis to script,
I mean, this is a big problem.
This is not something I've particularly solved, but that might be a kind of useful thing,
just picking out certain things that indicate malicious like format you know
format drive exactly yeah you could certainly represent like this one does hard drive stuff
is this i thought it was parsing colors why is it doing things with the hard drive this is odd
you know like or with the network stuff like that cool all right well you know what you would really
want to check out if you were trying to research these things probably documentation so i want to check out if you were trying to research these things, probably documentation. So I want to tell you all about devdocs, devdocs.io. This is pretty cool. Now, when you get there,
it's an interesting, on my Firefox, it's just got like the mobile view, which is really odd.
If you go there with a full browser, it's what it believes is a full browser, I guess. It's like a
slightly different view that's pretty similar, but not the same. So there's, if you open up a whole bunch of programming technologies, let's say, not just Python or JavaScript or something,
but there's also Vue.js, there's Vexoig, for example, like some of the foundation of Flash,
and you can pick the particular versions and stuff. You can go and like enable these different
things. So maybe I care about Vue. I can go over here and enable that one. Let's, we definitely want some Python.
Let me go find some Python.
And it gives you all the versions.
I'll take that.
And let's say I'm also working with Postgres.
So I'll enable that documentation.
And then I might be working with Engine X
for the front end, which is somewhere right here.
So you can go enable that.
And then it will be up near the top somewhere here.
You can see these are either the default ones
or the ones that I checked on.
So then you can open them up and say,
I want to go and see the Nginx guide about a debugging log.
And then it takes you to the documentation
for that technology.
So it's like a meta documentation repository
for all of these things all at once,
which is pretty cool, right?
So I can go up here and search. I want to know about like, let's go about like media tags or
something. So you can see the stuff in HTML5. You can see the stuff in, when you say media,
it looks like median. So you can see that in the statistics module for Python, some stuff for CSS,
or you could come over and say, look, I just want to search for CSS.
And then you get like using media queries and how to do that kind of stuff.
So it's kind of a, what you do is you turn on the pieces that are relevant to you, and then you can search across those technologies.
Cool, right?
Wow.
Yeah.
And then if you're on the move, you can come over here and turn on offline data,
and it'll download all of that
as an app so that then you're the coffee shopper and you're playing. You now have all the
documentation for Python 3.10, Vue.js, VerxOI, Nginx, et cetera, et cetera, that you can use,
which is pretty cool. And this is something that drives me crazy about Firefox. They had it and
they took it away, and I don't understand why, because I'm
feeling as Firefox is about what the web. So they took away the ability to do progressive web apps
in Firefox, but all the Chromium browsers support it. So you can actually go and install this as a
dedicated application on your system. So you, if you have no web, you just click that open. It's
its own window. You can up, you know, alt tab, command tab between it.
It's super easy.
And then turn on the offline mode.
And you basically have an app that has offline documentation for all the programming technologies that you care about.
So this is my new coffee shop buddy.
Does the search go across the thing you've selected then?
So if I search for like replace or something, it's the things I've selected?
Yeah.
So if you turn on like javascript and python
it would look for that in both languages oh okay yeah so basically the ones you turn on there's a
ton of them right and you pick that you say these are interesting to me and then search and stuff
from what i can tell only applies to the technologies you say you care about because
like if you don't use java you really don't want to see the documentation for java search right
that would be useless yeah one of the things i like about this is it also has versions. So
if you're using a, like an older version of Postgres, you can just enable that version.
Right. Sometimes it doesn't matter very much, but other times it matters massively like Bootstrap
three and Bootstrap five, they're like fully incompatible basically. Like they're totally
different keywords and grid systems and you don't
want just the latest if you've got an old app you're working on something like that python's
more forgiving about that kind of stuff it doesn't break as often i was amused that the the list
though is uh uh it has like three nine three eight for python and it has three ten at the bottom
because one is obviously because it's alphabetically sorted. How interesting.
Ian, what do you think of this?
That's very cool.
I'm amazed.
Is somebody at DevDocs kind of manually maintaining
all of the links to these,
like the original source documentation?
Yeah, where are they getting it from, right?
I mean, because they're super disparate.
It's like Matplotlib and Markdown and MariaDB.
These are all... It's unlikely they're all stored in the same basic system, right?
I don't know how they get them, actually.
Yeah, it's very cool.
I mean, I normally have solved the same problem by having like 130 tabs open to different bits of Python docs and pandas.
Exactly.
Exactly.
Yeah, I'm pretty sure they got pandas in here.
They got numpy as its own thing. We saw matplotlib, there's pandas,
and there's even versions of pandas across there.
Single-term solution. Brilliant.
Yeah, it looks pretty good to me.
All right.
Ian, want to tell us about what you got for your first item?
Okay, sure.
So as I mentioned earlier, I own a package called MysticPy.
And first thing to sort out with it is the spelling, because I suffer from this on a
daily basis, mistyping it, even though I've owned it for like three or four years.
So it's M-S-T-I-C, standard for Microsoft Threat Intelligence Center.
There's no Y or anything like that in there.
So it's a tool set for cybersecurity investigations and hunting in
Python, mainly in Jupyter Notebooks. So there are a couple of questions to ask about that.
So firstly, what is cybersecurity hunting and investigation and why are Jupyter Notebooks useful?
So the first one, cybersec investigation is really responding to alerts or other kinds of
threat intelligence and trawling through typically large amounts of security logs from cloud services, hosts, account services to determine whether this is a real threat or not.
That's one of the huge problems, right?
Is you've got all these different systems.
How are you going to know if someone, if you don't have a tool like this, how are you going to know that someone's in there rooting around, right?
Yeah.
Yeah.
And there are a couple of things that usually trigger this kind of search.
So one of them is an alert may be coming from your SIEM.
And that stands for Security Information Event Management.
So like a console, like ArcSight is a traditional one or Microsoft Sentinel is a cloud-based one.
So you get an alert based on a rule and you need to go in a fairly managed process.
Somebody needs to go and investigate, is this a real threat or is this just noise?
Or there might be something like the solar winds a year ago, the log4j, like something
in the press or something from a threat Intel kind of alert says this
kind of threat is around and that's a more ad hoc process kind of hunting, like, do we
see this in our organization?
So that's kind of what MysticPy is trying to address the needs of that.
And the second question is why Jupyter Notebooks?
Why would you do any Jupyter Notebook rather than in your existing SOC tools? I mean, I
think there's a lot in common. This kind of activity is a lot in common with like big
science data, sorry, big data science. I mean, something like astronomy where you're kind
of hunting for an adversary activity is a little bit like trying to find an exoplanet
in kind of gigabytes of data or a new quasar or something like that.
Yeah. 100,000 stars or 100,000 lines of log file and you're hunting for some patterns
and stuff.
And you've got a few photons you're trying to determine, or these different... Something
like atmosphere activity is a little bit like that, like millions and millions of events
and you're trying to find the bad stuff. Traditional SOC tools are kind of you know can be really excellent and i
work with one that i think is is really good but uh but they all have limitations what's a a sock
tool a stock tool yeah security operations center so so something like a you know a console that
fires alerts and tells you that you have a bunch of analysts engineers looking at this output of
this and deciding and that's the trigger for their investigations they're like is it like failed
login to sql server yes something like that or you know it could be a more sophisticated thing
like uh something's exit you know trying to access the kind of password data on this what
looks like it's trying to access password data on this host or has made a weird configuration change to mailbox settings.
So all those things can trigger alerts and investigations. But you are limited in most
operations center environments. Notebooks allow you to break out of some of the constraints
of that. So firstly, you can get data from anywhere. You're not just limited
by what's in your logs. You could go to virus total, so you can bring data from anywhere.
You can use customized analysis. So write your own or get things from PyPI. Lots of
people have written this stuff. You control the workflow, so you don't have to follow
what the tool says. You can reorder things, you can backtrack, redo things. And the workflow so you don't have to follow what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable. So if you get a similar kind of
issue again or similar kind of alert, you can fish out an old notebook and
rerun the same kind of analysis. And you end up with a nice kind of shareable document
that describes your investigation a bit like the results of a
scientific investigation. It's like, here are all the steps I took and these are the results.
And this is what they, this is what we determined to be the bad, you know, the bad activity.
Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last
bit of computed information and then you can go you know change a cell ask the question again
without rerunning the whole thing and like that's parsing tons of logs or pulling them over ssh or
whatever that not doing that again is nice yeah and it's brilliant if you don't like doing lots
of queries in different browser tabs and your browser crashes they've all gone what do you do
yeah it's all any jubilant notebook to which is like second by second after you do it,
you can just go back and you can go back to things that you may have done months ago.
Yeah, absolutely.
Yeah.
So, so when I started all of this, I kind of thought a lot of this stuff for cyber
investigations would be available on PyPI.
I thought a great Jupyter Notebook seemed like brilliant and there's going to be
process tree viewer and there's going to be an event timeline and all this kind of stuff.
And I found out there wasn't, at least I couldn't find it.
So I decided to stop and everything, need to start writing this stuff.
So it turns out that things like visualizations you need for detecting exoplanets are a bit
different from ones you need to detect bad actors.
So we started building
this thing originally me, but there's now Pete Bryan and Ashwin Patil also kind of working
on it to my colleagues and a bunch of people in the community. It's got four main functional
sections. There's data querying, how you get data in, how you do templated queries. There's
enrichment. So for example, if you have something like an IP address,
you might have a bunch of questions about it as an analyst,
like which geographical location is this IP address from?
Are there any malware reports about it?
Third area is analysis, things like anomaly identification,
like the thing you were talking about, a spike in failed logon events,
unusual spike in failed logon events, that kind of thing. The final layer is visualizations,
and these are more specialized. I've got a couple of examples in the show notes.
This is an anomaly identification pattern. This is one of the custom, we use Boca,
which is a really nice visualization package to allow you to view data in a way that analyst expects
you to see it.
So they're more this kind of visualization
than more traditional graphs.
I would much rather look at this than log files,
or event logs, or whatever.
Yeah, that's the whole thing about you
may have thousands of events, and you
need to get down to the few that are the interesting thing.
So one of the areas that we've tried to focus on currently, because we wrote all this stuff,
and you have hundreds of functions that you could use, but it's kind of difficult to discover them.
And they all, because they evolved a little bit organically,
they all work in a little bit of a different way,
different set of parameters.
So the work we're currently doing
is trying to make this
all a bit more accessible.
So all of the functions
that relate to say an IP address,
all the questions you want to ask about it
are kind of dynamically attached
to a class called IP address.
So they're all like things like-
Oh, interesting.
So you don't have to work
just with a raw string or just some raw IP representation but you can ask it questions like its location
well it's not quite that intelligent it's even a bit less intelligent than alexa but uh but it's uh
but it's more like you know there might be things like geolocation of an ip address
threat intel lookups different queries that might be have ip addresses like a parameter
and previously you'd have to go and find all of these things and import them separately and run
them but now they're all kind of dynamically attached as methods to the fact they use ip
address as a parameter means that you just have one object to import and then you can do all of
these different operations uh on this single item.
There's some things that don't work with that.
Some things like the visualizations, for example, they're not IP address or host or account
specific.
They work on big blocks of data.
So the other area we're working on is try to, anything that takes a bunch of data as
an input, we're writing those as pandas accessors.
So they appear as methods to a data frame.
So you do kind of dataframe.mpplot.timeline, and it would produce your timeline as long
as it's the right kind of data.
Or so, yeah, that's one of the challenges of writing this kind of thing organically is you
end up with a lot of stuff, but nobody knows it's there and nobody knows how to import it.
So try to make it as accessible so that it just becomes a very intuitive thing.
Oh, I have an IP address.
What functions can I do?
I can do this.
You know, it's all tab completable, that kind of thing.
Yeah.
I think it's really cool you've taken this Python data stack view of cybersecurity and
threat detection.
Yeah.
Yeah.
Brian, what do you think?
Well, it's definitely a complicated area.
And trying to,
one of the things I like about this story
is just talking about the complexities
in API design and discoverability
that applies to like lots of different fields.
But yeah.
Yeah, it's one of those things
you should have thought about at the beginning,
but even at the end,
you can tidy things up. Yeah yeah so um famous last words so yeah we're definitely open for like
other people collaborating contributing stuff uh because there's a lot of ground to cover
yeah for sure it's on github i saw yep one final question before we move on is it just for azure
or is is this a thing that more broadly works across different systems?
No, I think I should have mentioned that a little bit earlier on.
We recently built it for Microsoft Sentinel notebooks,
but it supports Splunk, Defender.
We're working on an Elastic provider.
So really, anything you can get into a Pandas data frame,
you can use most of the functionality.
So even if we don't have a provider ourselves, if you've got something like PySpark and you
can get a data frame, then all of our functions take data frame.
You know, we use pandas as our universal data interchange format.
Yeah, indeed, indeed.
Kim Van Wick out in the audience likes it.
It's way like a much nicer way to glean info and logs and complex grip.
I'm right there with you.
All right.
Now, before we move on, Brian, let me tell you about our sponsor for this episode.
This episode of Python Bytes is brought to you by Fusion Auth.
Fusion Auth is an authentication and authorization platform built by devs for devs.
It solves the problem of building essential user security
without adding risk or distracting from the primary application. Fusion Auth has all the
features you need with great support and a price that won't break the bank. And you can either
self-host it or get the fully managed solution hosted in any AWS region. Do you have a side
project that needs custom login and registration, multi-factor authentication, social logins, or user management? Download Fusion Auth Community Edition for free.
The best part is you get unlimited users and there's no credit card or subscription required.
Learn more and get started at pythonbytes.fm slash Fusion Auth. The link's in your show notes.
Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one, Brian? Number numbers, something every computer scientist should
know. Yes. Floating point arithmetic is complicated. And so when I started, uh, started working in
professionally, one of the things that was recommended reading was, uh, an article called
what every computer scientist should know about floating point arithmetic.
And don't worry, it's only like a really long paper with lots of math. So I am not telling you to read this, although it is an interesting read. What I would like you to read is this article by
David Amos called the right way to compare floats in Python, because there's a few things that we
need to know about floats when we're using
them and floating points is, and he covers all of this in the article without going through
tons of scary math, is the floating point numbers have to be represented in a way that
the computer can store them and use them and manipulate them, even though some numbers
are huge and won't fit normally.
So we have to do things like accept that there's error and rounding.
So there's a little bit of a discussion there that he talks about.
One of the things that surprises people sometimes when they first come into Python,
but it's not just Python, it's most languages,
is somewhere there's going to be something obvious that doesn't work like in in andy or david's example uh 0.1
plus 0.2 equals our comparison equals uh 0.3 and that will show up as false because they don't and
this is weird they obviously are so crazy that that doesn't work but but it's not just equals
you can also do comparisons like uh you know less than or greater than so it's not just equals. You can also do comparisons like, uh, you know, less than or greater than. So it's not only is that, are they not equal? They're not like 0.1 plus 0.2
is not even less than or equal to 0.3. It's weird. Uh, um, so, so what do you do? You don't,
the gist of it is don't compare things with a normal math comparisons if there's floating points involved.
So what you want to do instead is, and here's a little tiny bit of math, way less than the example.
The thesis, the dissertation.
Yeah.
So there's a whole bunch of stuff built into Python that you can, to work with comparisons.
And one of the most common ones i'm trying to get there
is um math is close so there's a math library that's it's that with an is close function
that it's used to just say hey i've got two values are these close close enough um and uh
we when if you're using if you have to compare floats something like this is is great and
underneath the scene behind the scenes what it does is it's uh it's taking the two values and if you're using if you have to compare floats something like this is is great and be underneath
the scene behind the scenes what it does is it's uh it's taking the two values and subtracting them
and figuring out if the delta is or the absolute value of the delta is below some tolerance some
reasonable tolerance like close enough and what that tolerance is is either a relative or absolute
tolerance and um you most of the time
you can kind of get away with not caring about that, but if you do care about it, you can control
that. You can pass in what tolerance you expect things to be closer to. I use stuff like this all
the time with, uh, uh, with test equipment, because I definitely want to know, um, control over the
tolerance levels. So, um, yeah, for sure. So there's math is close, but then there's also,
I'm not going to scroll all the way down here,
but he also covers NumPy.
So NumPy's got a couple of these that are really great.
One of them is close also, but it works on arrays,
and it'll give you an array of true and false values.
But you can also use all close, which just says you've got
two arrays. And if all of the pairs are close enough, it'll match those up. Also covered,
which we use during testing a lot is PyTest procs, which is a little bit of a different beast. But
David covers that. So basically this is a semi-regular reminder
to anybody using floating point math in Python
that you should be careful with it
or any other language.
Yeah, it's not a Python thing.
It's just representing things that don't fit.
Now there's some things,
sometimes where you have to be very exact,
you need to be very precise.
And in those cases,
Python does have the decimal and fraction types.
And David covers these in the article,
which are cool.
They're cool things to know about,
like definitely around people using money
or other very high precision.
But if you're also,
so those are covered,
you get some sort of a hit for those.
But if you really care about like the precision
and want to do things exactly right,
then you probably should read that larger article
because there's things that you have to do
like certain operations before other operations
to try to keep the error from accumulating too high.
So it gets messy.
I think I'm fundamentally disturbed
by the idea that zero isn't zero.
So my approach to floating point numbers is normally convert them to ints.
Yeah, I was thinking that sometimes that is the way to do it, right?
I was thinking this kind of stuff maybe applies a lot to the project that you're working on
if you're trying to come up with ratios that represent how you know, how risky something is and things like that.
Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah,
I was being a bit flippant before.
It's just as fun. It's like, I'm a very platonic at heart, I think.
So like zeroed one should be zero one, not nearly one or nearly zero.
There should be a perfect square and a perfect circle.
Like how can they not exist in our language?
Is it really zero or negative zero?
Henry on the audience, hey, Henry, also points out that PyTest Approximate also works on NumPy arrays as well.
Nice.
Which is pretty cool.
Cool.
You can put that all together.
All right.
Let me tell you all about Piper.
I think that might be the way you pronounce it. all together. All right. Let me tell you all about Piper. I think that might be the representation,
the way you pronounce it.
Everything needs its own description,
its own like little phonetic bit.
So this is a simple way to create scripts
that run and do stuff on your computer using Python.
And what's cool about it
is it has a real simple way to define the steps.
Some of those steps can be optional,
but then you can also piece together things like other programming. So you can combine commands,
different scripts and different languages and applications all into one sequence of events
that happens on your computer. So it's basically a task runner where you define stuff in YAML.
And probably the best way to see is to go check out the docs. And there's a whole bunch of docs. The docs are really nice here, actually. So for
example, if you go to getting started and come down here and run your first pipeline, I really
like the way the docs here look, how they look. But the way you define it, here's like a one,
one step one is you just say the steps and it's all YAML and give a step a name so you can refer to it.
And then you have inputs and outputs and, you know, you do the little curly string interpolation types of things.
Or you can have more complex ones like with different steps.
And you can even have little comments.
There's a way to put a comment in your YAML file as well.
So there's also conditional.
Let's see if I can find a good conditional one down here.
Here's one that goes and works with like,
this one is just an echo statement and the ping command.
But whatever you want to do,
you can basically pass command line arguments
to the YAML file or to the workflow, the pipeline.
And it'll take those and feed them into the steps.
So for example, when you call it,
you can say like count equals one and IP equals that.
And those will become the little string interpolated pieces
that go in there.
So you can just combine whatever,
basically whatever commands are available to the shell,
be that Python or POSIX or Windows or PowerShell
or whatever you're looking to do.
Pretty cool, huh?
That's pretty neat.
I might need this for my job of automating my show notes.
Oh, yeah, there you go.
If you can find this, go do that and so on.
Like here's one that sort of uses the truthiness.
So it says there's a bunch of different steps,
and you can use the run flag.
So here it says run if there's a value for A on this one. And this one says run
if there's a value for B. And then there's an example where it says, okay, we run it by itself.
Those don't run. But if you pass A, then it runs that A step. If you pass B, it does the B step,
or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these
tools like this feel like they're pretty complicated. You know, you're sort of like
your example with the Genshin, Brian, where you're like,
is this thing too heavy weight for what I'm trying to ask it to do?
You know, and this seems like a real simple thing and I don't have to learn about make
or any of those kinds of things.
Yeah.
GitHub actions or.
Yeah.
Yeah.
Yeah.
It's got a bit of a GitHub actions feel to it.
That's, it seems like a nicer kind of declarative.
That's pretty cool.
Indeed.
Yeah, if you were not into programming or you didn't want your steps to be programming.
But of course, what happens at each step, you could call a Python app or script that's
going to do something complicated, right, if it needs to.
But the orchestration of that, you don't have to make complicated.
Is it just a command line tool or can you evoke it from Python?
It might be a bit interesting. I'm sure there's a way to import it and make it do a command line tool or can you invoke it from Python? It might be a bit interesting.
I'm sure there's a way to import it and make it do a thing.
It's probably just a Python package with an entry point in this package.
So I would think so.
Yeah, because it would be nice to be able to do that
rather than just using subprocess to invoke a lot of things.
Oh, interesting.
I hadn't really thought about it as a replacement for subprocess.
But yeah, because a lot of times when you're trying to orchestrate stuff,
like it talks about here being part of the shell or being another app or another language,
you would just use subprocess on it, right?
Yeah.
Cool.
Well, there it is.
Piper.
Piper.io.
And people can check that out.
It looks pretty interesting.
Nice.
All right.
Ian, you want to take us out with your final item here?
Ah, pigments. Okay. All right, Ian, you want to take us out with your final item here? Ah, pigments.
Okay, so this is a package.
I mean, if you're a developer, there's a very good chance that you have been using this for years like me without knowing about it.
You might have seen it being installed as like a dependency.
It's like, what is that thing?
That was my thought, Ian.
I'm like, I know I see this all the time in my dependencies, and I just never really bothered to look into what it does.
Yeah.
So I haven't until recently. So if you use Jupyter Notebook Markdown, you
can do like three backticks and then a block of code. And you can actually put like Python
or Bash or something and it will intelligently highlight it. So the thing that's doing that
intelligent highlighting is pigments.
GitHub marked down the same kind of thing,
although I'm not sure whether GitHub uses pigments.
And if you do developer docs like Reader Docs and Sphinx,
that also uses pigments to kind of color code your code samples.
And I know there's a lot of writing kind of blog posts
and stuff like that.
There are quite a few services out there where you can take a chunk of code and it will intelligently highlight it and give you a JPEG or a PNG back.
And that's kind of nice, but then you can't copy and paste the code from those samples.
So I don't like that really.
I think if you're going to put code in an article, you're probably intended for people to be able to copy and paste it.
Yeah.
That's the most likely thing you are to copy and paste.
Yeah.
Right?
Because you want that code over here.
Yeah.
You don't want an image of you.
You could use OCR to reinterpret it, but it's all there.
Yeah.
And then maybe Brian's gen sim to tidy it up.
So with Pigments, you can use it as a standalone package and it can do this kind
of rendering and it can render to like HTML with like CSS style sheets for all of the
coding.
It also rendered to like ANSI terminal, latex, a few other kind of things.
So if you're using, if you want a kind of nicely formatted piece of code in a document or you're doing developer docs,
it's certainly kind of useful.
I mean, I came across it, which I should say one thing, it also supports, maybe I can just
switch, supports lots and lots of languages.
So it's very simple to use.
It has a highlight function and then you import Alexa, which is like the thing that understands
the tokens in a language and a formatter for the output type you want.
I think there's hundreds of these things.
There are a lot of languages in there. No kidding.
More than half of these I've never heard of.
It also supports as well as things like,
you'd expect Python.
It supports Python trace back,
so it has separate Alexa for color coding trace backs.
All the usual languages you'd expect, but also some things like data formats So it has a separate Lexa for color coding tracebacks.
All the usual languages you'd expect, but also some things like data formats, like TOML,
JSON, XML.
Okay.
Interesting.
Like a lot of the files that we might run across, you can syntax highlight them.
Yeah.
And so it's very easy to use.
And the reason I came across it is because I recently,
so a lot of attacker code tends to be deliberately obfuscated.
So it's kind of base 64 encoded, but then even once you decode it,
it's kind of munged in a way to make it as unreadable as possible.
So one of the things that we try to do is pull that code back,
like decode it, try to clean it, de-obfuscate it.
But if you can present it in as close to the way
a developer would write it as possible,
it makes it much quicker for an analyst
to determine what is this doing.
So we use it now in MysticPy to kind of color display
things like malicious PowerShell script
or Bash or something like that that so that's how i
came actually rather than just seeing it go past as part of a pip install actually have to invoke
it directly so so i kind of big shout out to the uh the developers maintainers of pigments it's one
of those packages that probably millions of people benefit from but like very few people kind of know
about it or you know you can and it's just super easy to use they like very few people kind of know about it or, you know, you can't.
And it's just super easy to use.
They seem to be adding kind of lexers all the time.
So, great.
Yeah, this is amazing.
I didn't realize that it did all of this.
This is way more advanced than I thought.
Brian, did you know?
No, I just thought it was something
that magically syntax, did syntax highlighting,
so I didn't have to care about it.
Yeah, exactly. I got a call call that's yeah little example in the in the show notes as well i pasted it has a dark theme yeah yeah uh yeah and you you probably want to include this no
background equals true if you're using a jupyter notebooks because if if you select a theme it just
flips the whole notebooks kind of css theme so that tells it
just not to mess with what what's in the background okay um yeah that looks great yeah thanks thanks
for pointing out how useful that can be that's that's cool like i said i've seen it go by all
the time i just never really it's yeah paid that much attention to it it's probably a pretty
minority use but like if you need it it's great great. Yeah, it's incredibly powerful. Fantastic.
Well, that's all of our main items.
Brian, you got any extras?
Just one extra.
Actually, one of the things when I was doing the first topic with GenSim, it doesn't have very many dependencies, but one of the dependencies is this library called SmartOpen.
And I'm like, what?
I open things and I want to be smart about it.
So I wanted to check this out and it's pretty neat.
I don't know if we've covered this before,
but it's a,
it basically mimics the interface of open normal Python open,
but you can pass it really anything.
And it does like a transparent on thethe-fly reading of things,
efficient streaming of large files
from like S3 or Azure or over the web.
Even straight, just HTTP.
Yeah.
If you just have a link to a large file on a web server.
Yeah.
And then just the code for it is just like super nice.
You know, you import open from smart open
and you got like for line in open this thing and
Just you can work from each line there. It's pretty cool. I love it. That's a that's a great one. It'd be nice
In you got any extras you want to shout out while we're here. I don't I'm afraid
I
Have I have two real quick ones to just quickly talk about.
Last time, Emily Morehouse spoke about using AutoSquash, which was really cool.
So Adam, let me get the attribution correct here.
Adam Parkin sent in a follow-up to say, hey, you should check out this article over here called Fixing Commits with Git Commit dash dash fix up and Git Rebase dash dash auto squash.
The long and the short of it is talks about doing a lot of things that Emily said was
pretty cool.
But in the end, setting up your Git config to auto squash equals true and then adding
an alias so you can just type Git space fix up.
And when you type that, it actually does Git log and shows the last 50 true and then add in an alias so you can just type git space fix up and when you type
that it actually does git log and shows the last 50 items and then allows you to go back and work
with those and basically it's just a real quick way to get back into the scenario where you mark
different elements for fix up so people can check that out if they were following emily's advice
but they want it to be like one line.
They don't have to remember. There you go. That's cool. And then Python 3.10.3 is out as of about a
week ago, I suppose. So there are many changes amongst here. You know, I would love, there's
like so many great changes here. I don't know. How many do you think that is? Probably a hundred,
maybe a little bit less. It would be great if there was like a, these are critically important at the front.
Like there's a security problem that was fixed
or there's a thing we've taken out is no longer here.
They're kind of all the same priority.
But nonetheless, there's a bunch of changes
that people can check out
and upgrade to the newer version of Python 3.10.
Different people care about different stuff though.
I know, I don't want to impose my importance of Python 3.10. Different people care about different stuff, though. I know.
I don't want to impose my importance
on other people's importance.
So it's funny, when I first came across Python,
you'd be like, why is it so slow
between the major versions coming out?
And then suddenly it's like a Python developer.
It's like, why are the versions coming out so quickly?
I can't keep up.
Yeah, it's definitely true. There's a ton of change. This is just, you know, some minor
version change that has these, all these changes in here, which is pretty cool. Well, we also used
to be on an 18 month cycle and now we're on a yearly cycle. So yeah. Yeah. It's Lucas Schlinga's
fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to close out the show? That'd be great. Yeah.
So here's a good tweet.
And it's this sort of perplexed, I think in a good way, character wearing all these,
are these prizes?
I don't know.
Anyway, Python developers, when someone asks what their secret is, and this person just
says, I just keep writing pseudocode and it just keeps working.
It's a little bit like that joke where they have some code, pseudocode in a text file.
They're like, just rename it to dot pi and try to run it. See what happens.
Anyway, that's the joke.
Nice.
Thank you, Brian, as always.
And Ian, thanks for being part of the show.
Thank you.
Great to have you here.
Thank you very much, both.
Been a real pleasure.
Yeah, it sure has.
See y'all.