Python Bytes - #276 Tracking cyber intruders with Jupyter and Python

Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 276, recorded March 22nd, 2022. So many twos. I'm Michael Kennedy. And I'm Brian Arkin. And I'm Ian Helen. Hey, Ian.

Starting point is 00:00:16 Welcome to the show. It's great to have you here. Thank you very much. I listen to the show a lot and feel very privileged to appear on it. It's our privilege to have you here. Thank you so much for listening. And I know you got some cool stuff to share. So we're looking forward to hearing about that.

Starting point is 00:00:32 Also, I do want to say thank you to Fusion Auth for sponsoring the show. I'll tell you more about them later. Before we get into the topics, Ian, tell people a quick bit about yourself. Sure. I'm a developer in Microsoft in Microsoft threat intelligence center, but with Microsoft quite a long time, only relatively recently, like four years. So ago got into Python coding with Jupiter notebooks. So I work on Jupiter notebooks for the Microsoft Sentinel project and own a modest open

Starting point is 00:01:01 source package that will called mystic pie which we'll cover a little bit later. Takes most of my time. Fantastic. The whole cybersecurity threat detection stuff, it's very interesting. There's a lot of innovation there, but it's also, it's a challenging area to be working. Yep. Yep. We're never sure of stuff to do. I'm sure you're not. Well, Brian, how about you kick us off here? Well, so I'm going to start off with a problem. So I had a problem and I have a cool solution for it.

Starting point is 00:01:30 So my problem is on test and code, I've got titles and I want to end a show on MP3 file. But I want to create show notes, automated show notes, or not show notes, transcripts. So one of the problems, there's a lot of problems in doing this, trying to automate it, but one of them is the title. I want to turn that into something that's a little bit, so something like, you know, it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into a- Things that URLs hate. Yeah, I want to turn that into a URL. And one of the things is getting rid of stop words. So there's a bunch of stuff like lowercasing.

Starting point is 00:02:14 I can do that easy. But getting rid of stop words was a little hard. So I ran across this thing called Gensim parsing preprocessing. So preprocessing. So pre-processing. So GenSim is a larger sort of beast. It's used for machine learning and stuff to generate models. But I'm just really using one little piece of it,

Starting point is 00:02:38 the pre-processing part. And it's really pretty cool. I was looking, I actually found this article first. There was an article called Removing Stop Words from Strings in Python. And it has a discussion of NLTK and GenSim and Spacey. I tried all of them out, actually. And the one that really stuck best for me is using, talked about using Remove Stop Words is exactly what I wanted, right? From, from GenSim.

Starting point is 00:03:07 So I went ahead and tried that and it worked really well, but I'm like, wait, I'm pulling this in from the pre-processing library. I wonder what's, what else is in there? And there's all sorts of really cool stuff in here. There's a lower, lowercase to Unicode. It turns it both into lowercase and in unicode that's pretty neat don't think i need it but that's neat um but then there was one that was uh pre and i thought maybe this is exactly what i want is uh something called pre-processed string and it has a whole bunch of filters built into it oh nice like strip strip yeah strip white space strip punctuation i love it yeah and take

Starting point is 00:03:45 away multiple uh after it strips punctuation like you're gonna have if i go back i had a slash in my title for one of the episodes if it takes that out i'm gonna have a space before and a space after so i want to remove those so it'll strip multiple white space strips out numerics because i probably don't want numbers in there. And then remove stop words. The one thing I don't want that I'll have to like customize how I'm calling this is a STEM text. So STEM text, I don't know what that did without playing with it,

Starting point is 00:04:17 but what it does is it will take things like twisted and turning and turn it into twist. That's, that's really not right. So definitely don't want that. I don't want that. I don't mess it up, but I think I want everything else. So, uh, this gen sim, uh, library has, you know, if you're doing machine learning, uh, coming up with models, um, I think this is a great, uh, tool to look into, but if it's actually, I'm going to use it

Starting point is 00:04:40 just for, uh, removing to create these titles for, for, you know, my podcast. But the, I think it, it feels a little weird. It feels like I'm using this really big hammer to do this little tiny problem. I guess I'm okay with it, but you know, do you have any other ideas? What it could use or. Well, I didn't know about this, so I wrote my own. Okay. And it's, it's, it's kind of janky. Like, it's a little bit recursive, iterative.

Starting point is 00:05:08 It's like, we'll take away all the punctuation now. Turn all of your white spaces into single white spaces, because there might have been, you know, dot space. So now you've got two white spaces, but you've got to take away, you know, there's like a bunch of weird steps and then put it back. This looks cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I know about it. Ian, what do you think? Is it a huge thing? I mean, dependency, but I always think of like ML like stuff,

Starting point is 00:05:33 but this is like just the pre-processing, right? Well, I'm actually pulling in all of Gensim to get this. I don't know if I can pull in little bits, but it's not really part of my application that I'm shipping it's just a tool that i'm using on my laptop so i i guess downloading it once doesn't really bother me too much even if it's a big thing pretty cool yeah i was thinking yeah that's a good that's a good point if it's running local it's like a dev dependency who cares right it's like worrying about how big pi test is like it doesn't really matter and i'm not well i kind of get care about that because ci is going to pull it in all really matter. And I'm not, well, I kind of care about that

Starting point is 00:06:05 because CI is going to pull it in all the time for PyTest. But they got fast networks. It's not your bandwidth. It'll be all right. One of the things that struck me about this that made me think of your situation is like that lowercase to Unicode and so many times in the security space,

Starting point is 00:06:24 it's about like, you're checking for this representation, but what if there's another representation that means the same thing? Like you don't say go to this directory, you say go dot, dot, and then over there, you know, those kinds of non-canonical representations. I wonder if there's any use of this kind of stuff for you. Yeah, there's something I kind of touch on

Starting point is 00:06:41 in the pigment section later on, which is like the attackers typically write scripted attacks and try to obfuscate code using a mix of kind of uppercase and putting random dots. I'm just thinking that would be a nice, potentially a nice way of kind of cleaning some of that stuff up. Yeah, for sure. There's been some interesting supply chain vulnerability stuff.

Starting point is 00:07:00 Remember the guy with the color and I think the faker stuff in JavaScript that sabotaged his libraries. There was another one that maybe well-intentioned, I don't know. It was some open source library. I don't believe it was Python. I can't remember what it was. It could have been, but I'm pretty sure it was in JavaScript because that's where all, most of the bad stuff was, it seems. Anyway, they wrote their, they taught their dependency to erase everybody's hard drive who installed it, who was in Belarus and Russia, which, okay, maybe they're trying to contribute, but like it ended up doing a bunch of bad things, even to places that were like trying to help, say, people in the press and journalists do certain things and then like, you know,

Starting point is 00:07:46 connect with sources and then erase like that database as well. And what they did to make it so that nobody would notice in the GitHub commit before it went out to NPM was base64 encode their changes. So they basically put a base64 encoded string and then like decode and then run that.

Starting point is 00:08:03 And, you know, it's like that kind of stuff. I know this won't solve that problem but yeah you know that that sort of category of like weird representations yeah you need mystic pi for something like that it's one of the things we yeah it's a common thing kind of basics before decoding before the obfuscating yeah yeah interesting um yeah i thought maybe using something like that with... Because one of the problems we have is every script is kind of slightly different. If you could use something like that to essentially kind of apply sentiment analysis to script, I mean, this is a big problem. This is not something I've particularly solved, but that might be a kind of useful thing, just picking out certain things that indicate malicious like format you know

Starting point is 00:08:45 format drive exactly yeah you could certainly represent like this one does hard drive stuff is this i thought it was parsing colors why is it doing things with the hard drive this is odd you know like or with the network stuff like that cool all right well you know what you would really want to check out if you were trying to research these things probably documentation so i want to check out if you were trying to research these things, probably documentation. So I want to tell you all about devdocs, devdocs.io. This is pretty cool. Now, when you get there, it's an interesting, on my Firefox, it's just got like the mobile view, which is really odd. If you go there with a full browser, it's what it believes is a full browser, I guess. It's like a slightly different view that's pretty similar, but not the same. So there's, if you open up a whole bunch of programming technologies, let's say, not just Python or JavaScript or something, but there's also Vue.js, there's Vexoig, for example, like some of the foundation of Flash,

Starting point is 00:09:36 and you can pick the particular versions and stuff. You can go and like enable these different things. So maybe I care about Vue. I can go over here and enable that one. Let's, we definitely want some Python. Let me go find some Python. And it gives you all the versions. I'll take that. And let's say I'm also working with Postgres. So I'll enable that documentation. And then I might be working with Engine X

Starting point is 00:09:56 for the front end, which is somewhere right here. So you can go enable that. And then it will be up near the top somewhere here. You can see these are either the default ones or the ones that I checked on. So then you can open them up and say, I want to go and see the Nginx guide about a debugging log. And then it takes you to the documentation

Starting point is 00:10:16 for that technology. So it's like a meta documentation repository for all of these things all at once, which is pretty cool, right? So I can go up here and search. I want to know about like, let's go about like media tags or something. So you can see the stuff in HTML5. You can see the stuff in, when you say media, it looks like median. So you can see that in the statistics module for Python, some stuff for CSS, or you could come over and say, look, I just want to search for CSS.

Starting point is 00:10:48 And then you get like using media queries and how to do that kind of stuff. So it's kind of a, what you do is you turn on the pieces that are relevant to you, and then you can search across those technologies. Cool, right? Wow. Yeah. And then if you're on the move, you can come over here and turn on offline data, and it'll download all of that as an app so that then you're the coffee shopper and you're playing. You now have all the

Starting point is 00:11:11 documentation for Python 3.10, Vue.js, VerxOI, Nginx, et cetera, et cetera, that you can use, which is pretty cool. And this is something that drives me crazy about Firefox. They had it and they took it away, and I don't understand why, because I'm feeling as Firefox is about what the web. So they took away the ability to do progressive web apps in Firefox, but all the Chromium browsers support it. So you can actually go and install this as a dedicated application on your system. So you, if you have no web, you just click that open. It's its own window. You can up, you know, alt tab, command tab between it. It's super easy.

Starting point is 00:11:46 And then turn on the offline mode. And you basically have an app that has offline documentation for all the programming technologies that you care about. So this is my new coffee shop buddy. Does the search go across the thing you've selected then? So if I search for like replace or something, it's the things I've selected? Yeah. So if you turn on like javascript and python it would look for that in both languages oh okay yeah so basically the ones you turn on there's a

Starting point is 00:12:11 ton of them right and you pick that you say these are interesting to me and then search and stuff from what i can tell only applies to the technologies you say you care about because like if you don't use java you really don't want to see the documentation for java search right that would be useless yeah one of the things i like about this is it also has versions. So if you're using a, like an older version of Postgres, you can just enable that version. Right. Sometimes it doesn't matter very much, but other times it matters massively like Bootstrap three and Bootstrap five, they're like fully incompatible basically. Like they're totally different keywords and grid systems and you don't

Starting point is 00:12:45 want just the latest if you've got an old app you're working on something like that python's more forgiving about that kind of stuff it doesn't break as often i was amused that the the list though is uh uh it has like three nine three eight for python and it has three ten at the bottom because one is obviously because it's alphabetically sorted. How interesting. Ian, what do you think of this? That's very cool. I'm amazed. Is somebody at DevDocs kind of manually maintaining

Starting point is 00:13:12 all of the links to these, like the original source documentation? Yeah, where are they getting it from, right? I mean, because they're super disparate. It's like Matplotlib and Markdown and MariaDB. These are all... It's unlikely they're all stored in the same basic system, right? I don't know how they get them, actually. Yeah, it's very cool.

Starting point is 00:13:31 I mean, I normally have solved the same problem by having like 130 tabs open to different bits of Python docs and pandas. Exactly. Exactly. Yeah, I'm pretty sure they got pandas in here. They got numpy as its own thing. We saw matplotlib, there's pandas, and there's even versions of pandas across there. Single-term solution. Brilliant. Yeah, it looks pretty good to me.

Starting point is 00:13:55 All right. Ian, want to tell us about what you got for your first item? Okay, sure. So as I mentioned earlier, I own a package called MysticPy. And first thing to sort out with it is the spelling, because I suffer from this on a daily basis, mistyping it, even though I've owned it for like three or four years. So it's M-S-T-I-C, standard for Microsoft Threat Intelligence Center. There's no Y or anything like that in there.

Starting point is 00:14:21 So it's a tool set for cybersecurity investigations and hunting in Python, mainly in Jupyter Notebooks. So there are a couple of questions to ask about that. So firstly, what is cybersecurity hunting and investigation and why are Jupyter Notebooks useful? So the first one, cybersec investigation is really responding to alerts or other kinds of threat intelligence and trawling through typically large amounts of security logs from cloud services, hosts, account services to determine whether this is a real threat or not. That's one of the huge problems, right? Is you've got all these different systems. How are you going to know if someone, if you don't have a tool like this, how are you going to know that someone's in there rooting around, right?

Starting point is 00:15:05 Yeah. Yeah. And there are a couple of things that usually trigger this kind of search. So one of them is an alert may be coming from your SIEM. And that stands for Security Information Event Management. So like a console, like ArcSight is a traditional one or Microsoft Sentinel is a cloud-based one. So you get an alert based on a rule and you need to go in a fairly managed process. Somebody needs to go and investigate, is this a real threat or is this just noise?

Starting point is 00:15:34 Or there might be something like the solar winds a year ago, the log4j, like something in the press or something from a threat Intel kind of alert says this kind of threat is around and that's a more ad hoc process kind of hunting, like, do we see this in our organization? So that's kind of what MysticPy is trying to address the needs of that. And the second question is why Jupyter Notebooks? Why would you do any Jupyter Notebook rather than in your existing SOC tools? I mean, I think there's a lot in common. This kind of activity is a lot in common with like big

Starting point is 00:16:10 science data, sorry, big data science. I mean, something like astronomy where you're kind of hunting for an adversary activity is a little bit like trying to find an exoplanet in kind of gigabytes of data or a new quasar or something like that. Yeah. 100,000 stars or 100,000 lines of log file and you're hunting for some patterns and stuff. And you've got a few photons you're trying to determine, or these different... Something like atmosphere activity is a little bit like that, like millions and millions of events and you're trying to find the bad stuff. Traditional SOC tools are kind of you know can be really excellent and i

Starting point is 00:16:45 work with one that i think is is really good but uh but they all have limitations what's a a sock tool a stock tool yeah security operations center so so something like a you know a console that fires alerts and tells you that you have a bunch of analysts engineers looking at this output of this and deciding and that's the trigger for their investigations they're like is it like failed login to sql server yes something like that or you know it could be a more sophisticated thing like uh something's exit you know trying to access the kind of password data on this what looks like it's trying to access password data on this host or has made a weird configuration change to mailbox settings. So all those things can trigger alerts and investigations. But you are limited in most

Starting point is 00:17:35 operations center environments. Notebooks allow you to break out of some of the constraints of that. So firstly, you can get data from anywhere. You're not just limited by what's in your logs. You could go to virus total, so you can bring data from anywhere. You can use customized analysis. So write your own or get things from PyPI. Lots of people have written this stuff. You control the workflow, so you don't have to follow what the tool says. You can reorder things, you can backtrack, redo things. And the workflow so you don't have to follow what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable. So if you get a similar kind of issue again or similar kind of alert, you can fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of shareable document

Starting point is 00:18:21 that describes your investigation a bit like the results of a scientific investigation. It's like, here are all the steps I took and these are the results. And this is what they, this is what we determined to be the bad, you know, the bad activity. Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last bit of computed information and then you can go you know change a cell ask the question again without rerunning the whole thing and like that's parsing tons of logs or pulling them over ssh or whatever that not doing that again is nice yeah and it's brilliant if you don't like doing lots of queries in different browser tabs and your browser crashes they've all gone what do you do

Starting point is 00:19:01 yeah it's all any jubilant notebook to which is like second by second after you do it, you can just go back and you can go back to things that you may have done months ago. Yeah, absolutely. Yeah. So, so when I started all of this, I kind of thought a lot of this stuff for cyber investigations would be available on PyPI. I thought a great Jupyter Notebook seemed like brilliant and there's going to be process tree viewer and there's going to be an event timeline and all this kind of stuff.

Starting point is 00:19:26 And I found out there wasn't, at least I couldn't find it. So I decided to stop and everything, need to start writing this stuff. So it turns out that things like visualizations you need for detecting exoplanets are a bit different from ones you need to detect bad actors. So we started building this thing originally me, but there's now Pete Bryan and Ashwin Patil also kind of working on it to my colleagues and a bunch of people in the community. It's got four main functional sections. There's data querying, how you get data in, how you do templated queries. There's

Starting point is 00:20:03 enrichment. So for example, if you have something like an IP address, you might have a bunch of questions about it as an analyst, like which geographical location is this IP address from? Are there any malware reports about it? Third area is analysis, things like anomaly identification, like the thing you were talking about, a spike in failed logon events, unusual spike in failed logon events, that kind of thing. The final layer is visualizations, and these are more specialized. I've got a couple of examples in the show notes.

Starting point is 00:20:34 This is an anomaly identification pattern. This is one of the custom, we use Boca, which is a really nice visualization package to allow you to view data in a way that analyst expects you to see it. So they're more this kind of visualization than more traditional graphs. I would much rather look at this than log files, or event logs, or whatever. Yeah, that's the whole thing about you

Starting point is 00:21:00 may have thousands of events, and you need to get down to the few that are the interesting thing. So one of the areas that we've tried to focus on currently, because we wrote all this stuff, and you have hundreds of functions that you could use, but it's kind of difficult to discover them. And they all, because they evolved a little bit organically, they all work in a little bit of a different way, different set of parameters. So the work we're currently doing

Starting point is 00:21:28 is trying to make this all a bit more accessible. So all of the functions that relate to say an IP address, all the questions you want to ask about it are kind of dynamically attached to a class called IP address. So they're all like things like-

Starting point is 00:21:41 Oh, interesting. So you don't have to work just with a raw string or just some raw IP representation but you can ask it questions like its location well it's not quite that intelligent it's even a bit less intelligent than alexa but uh but it's uh but it's more like you know there might be things like geolocation of an ip address threat intel lookups different queries that might be have ip addresses like a parameter and previously you'd have to go and find all of these things and import them separately and run them but now they're all kind of dynamically attached as methods to the fact they use ip

Starting point is 00:22:17 address as a parameter means that you just have one object to import and then you can do all of these different operations uh on this single item. There's some things that don't work with that. Some things like the visualizations, for example, they're not IP address or host or account specific. They work on big blocks of data. So the other area we're working on is try to, anything that takes a bunch of data as an input, we're writing those as pandas accessors.

Starting point is 00:22:44 So they appear as methods to a data frame. So you do kind of dataframe.mpplot.timeline, and it would produce your timeline as long as it's the right kind of data. Or so, yeah, that's one of the challenges of writing this kind of thing organically is you end up with a lot of stuff, but nobody knows it's there and nobody knows how to import it. So try to make it as accessible so that it just becomes a very intuitive thing. Oh, I have an IP address. What functions can I do?

Starting point is 00:23:11 I can do this. You know, it's all tab completable, that kind of thing. Yeah. I think it's really cool you've taken this Python data stack view of cybersecurity and threat detection. Yeah. Yeah. Brian, what do you think?

Starting point is 00:23:24 Well, it's definitely a complicated area. And trying to, one of the things I like about this story is just talking about the complexities in API design and discoverability that applies to like lots of different fields. But yeah. Yeah, it's one of those things

Starting point is 00:23:40 you should have thought about at the beginning, but even at the end, you can tidy things up. Yeah yeah so um famous last words so yeah we're definitely open for like other people collaborating contributing stuff uh because there's a lot of ground to cover yeah for sure it's on github i saw yep one final question before we move on is it just for azure or is is this a thing that more broadly works across different systems? No, I think I should have mentioned that a little bit earlier on. We recently built it for Microsoft Sentinel notebooks,

Starting point is 00:24:13 but it supports Splunk, Defender. We're working on an Elastic provider. So really, anything you can get into a Pandas data frame, you can use most of the functionality. So even if we don't have a provider ourselves, if you've got something like PySpark and you can get a data frame, then all of our functions take data frame. You know, we use pandas as our universal data interchange format. Yeah, indeed, indeed.

Starting point is 00:24:39 Kim Van Wick out in the audience likes it. It's way like a much nicer way to glean info and logs and complex grip. I'm right there with you. All right. Now, before we move on, Brian, let me tell you about our sponsor for this episode. This episode of Python Bytes is brought to you by Fusion Auth. Fusion Auth is an authentication and authorization platform built by devs for devs. It solves the problem of building essential user security

Starting point is 00:25:05 without adding risk or distracting from the primary application. Fusion Auth has all the features you need with great support and a price that won't break the bank. And you can either self-host it or get the fully managed solution hosted in any AWS region. Do you have a side project that needs custom login and registration, multi-factor authentication, social logins, or user management? Download Fusion Auth Community Edition for free. The best part is you get unlimited users and there's no credit card or subscription required. Learn more and get started at pythonbytes.fm slash Fusion Auth. The link's in your show notes. Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one, Brian? Number numbers, something every computer scientist should know. Yes. Floating point arithmetic is complicated. And so when I started, uh, started working in

Starting point is 00:25:56 professionally, one of the things that was recommended reading was, uh, an article called what every computer scientist should know about floating point arithmetic. And don't worry, it's only like a really long paper with lots of math. So I am not telling you to read this, although it is an interesting read. What I would like you to read is this article by David Amos called the right way to compare floats in Python, because there's a few things that we need to know about floats when we're using them and floating points is, and he covers all of this in the article without going through tons of scary math, is the floating point numbers have to be represented in a way that the computer can store them and use them and manipulate them, even though some numbers

Starting point is 00:26:41 are huge and won't fit normally. So we have to do things like accept that there's error and rounding. So there's a little bit of a discussion there that he talks about. One of the things that surprises people sometimes when they first come into Python, but it's not just Python, it's most languages, is somewhere there's going to be something obvious that doesn't work like in in andy or david's example uh 0.1 plus 0.2 equals our comparison equals uh 0.3 and that will show up as false because they don't and this is weird they obviously are so crazy that that doesn't work but but it's not just equals

Starting point is 00:27:21 you can also do comparisons like uh you know less than or greater than so it's not just equals. You can also do comparisons like, uh, you know, less than or greater than. So it's not only is that, are they not equal? They're not like 0.1 plus 0.2 is not even less than or equal to 0.3. It's weird. Uh, um, so, so what do you do? You don't, the gist of it is don't compare things with a normal math comparisons if there's floating points involved. So what you want to do instead is, and here's a little tiny bit of math, way less than the example. The thesis, the dissertation. Yeah. So there's a whole bunch of stuff built into Python that you can, to work with comparisons. And one of the most common ones i'm trying to get there

Starting point is 00:28:05 is um math is close so there's a math library that's it's that with an is close function that it's used to just say hey i've got two values are these close close enough um and uh we when if you're using if you have to compare floats something like this is is great and underneath the scene behind the scenes what it does is it's uh it's taking the two values and if you're using if you have to compare floats something like this is is great and be underneath the scene behind the scenes what it does is it's uh it's taking the two values and subtracting them and figuring out if the delta is or the absolute value of the delta is below some tolerance some reasonable tolerance like close enough and what that tolerance is is either a relative or absolute tolerance and um you most of the time

Starting point is 00:28:46 you can kind of get away with not caring about that, but if you do care about it, you can control that. You can pass in what tolerance you expect things to be closer to. I use stuff like this all the time with, uh, uh, with test equipment, because I definitely want to know, um, control over the tolerance levels. So, um, yeah, for sure. So there's math is close, but then there's also, I'm not going to scroll all the way down here, but he also covers NumPy. So NumPy's got a couple of these that are really great. One of them is close also, but it works on arrays,

Starting point is 00:29:18 and it'll give you an array of true and false values. But you can also use all close, which just says you've got two arrays. And if all of the pairs are close enough, it'll match those up. Also covered, which we use during testing a lot is PyTest procs, which is a little bit of a different beast. But David covers that. So basically this is a semi-regular reminder to anybody using floating point math in Python that you should be careful with it or any other language.

Starting point is 00:29:52 Yeah, it's not a Python thing. It's just representing things that don't fit. Now there's some things, sometimes where you have to be very exact, you need to be very precise. And in those cases, Python does have the decimal and fraction types. And David covers these in the article,

Starting point is 00:30:08 which are cool. They're cool things to know about, like definitely around people using money or other very high precision. But if you're also, so those are covered, you get some sort of a hit for those. But if you really care about like the precision

Starting point is 00:30:24 and want to do things exactly right, then you probably should read that larger article because there's things that you have to do like certain operations before other operations to try to keep the error from accumulating too high. So it gets messy. I think I'm fundamentally disturbed by the idea that zero isn't zero.

Starting point is 00:30:44 So my approach to floating point numbers is normally convert them to ints. Yeah, I was thinking that sometimes that is the way to do it, right? I was thinking this kind of stuff maybe applies a lot to the project that you're working on if you're trying to come up with ratios that represent how you know, how risky something is and things like that. Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah, I was being a bit flippant before. It's just as fun. It's like, I'm a very platonic at heart, I think. So like zeroed one should be zero one, not nearly one or nearly zero.

Starting point is 00:31:19 There should be a perfect square and a perfect circle. Like how can they not exist in our language? Is it really zero or negative zero? Henry on the audience, hey, Henry, also points out that PyTest Approximate also works on NumPy arrays as well. Nice. Which is pretty cool. Cool. You can put that all together.

Starting point is 00:31:41 All right. Let me tell you all about Piper. I think that might be the way you pronounce it. all together. All right. Let me tell you all about Piper. I think that might be the representation, the way you pronounce it. Everything needs its own description, its own like little phonetic bit. So this is a simple way to create scripts that run and do stuff on your computer using Python.

Starting point is 00:31:59 And what's cool about it is it has a real simple way to define the steps. Some of those steps can be optional, but then you can also piece together things like other programming. So you can combine commands, different scripts and different languages and applications all into one sequence of events that happens on your computer. So it's basically a task runner where you define stuff in YAML. And probably the best way to see is to go check out the docs. And there's a whole bunch of docs. The docs are really nice here, actually. So for example, if you go to getting started and come down here and run your first pipeline, I really

Starting point is 00:32:35 like the way the docs here look, how they look. But the way you define it, here's like a one, one step one is you just say the steps and it's all YAML and give a step a name so you can refer to it. And then you have inputs and outputs and, you know, you do the little curly string interpolation types of things. Or you can have more complex ones like with different steps. And you can even have little comments. There's a way to put a comment in your YAML file as well. So there's also conditional. Let's see if I can find a good conditional one down here.

Starting point is 00:33:04 Here's one that goes and works with like, this one is just an echo statement and the ping command. But whatever you want to do, you can basically pass command line arguments to the YAML file or to the workflow, the pipeline. And it'll take those and feed them into the steps. So for example, when you call it, you can say like count equals one and IP equals that.

Starting point is 00:33:28 And those will become the little string interpolated pieces that go in there. So you can just combine whatever, basically whatever commands are available to the shell, be that Python or POSIX or Windows or PowerShell or whatever you're looking to do. Pretty cool, huh? That's pretty neat.

Starting point is 00:33:44 I might need this for my job of automating my show notes. Oh, yeah, there you go. If you can find this, go do that and so on. Like here's one that sort of uses the truthiness. So it says there's a bunch of different steps, and you can use the run flag. So here it says run if there's a value for A on this one. And this one says run if there's a value for B. And then there's an example where it says, okay, we run it by itself.

Starting point is 00:34:10 Those don't run. But if you pass A, then it runs that A step. If you pass B, it does the B step, or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these tools like this feel like they're pretty complicated. You know, you're sort of like your example with the Genshin, Brian, where you're like, is this thing too heavy weight for what I'm trying to ask it to do? You know, and this seems like a real simple thing and I don't have to learn about make or any of those kinds of things. Yeah.

Starting point is 00:34:35 GitHub actions or. Yeah. Yeah. Yeah. It's got a bit of a GitHub actions feel to it. That's, it seems like a nicer kind of declarative. That's pretty cool. Indeed.

Starting point is 00:34:45 Yeah, if you were not into programming or you didn't want your steps to be programming. But of course, what happens at each step, you could call a Python app or script that's going to do something complicated, right, if it needs to. But the orchestration of that, you don't have to make complicated. Is it just a command line tool or can you evoke it from Python? It might be a bit interesting. I'm sure there's a way to import it and make it do a command line tool or can you invoke it from Python? It might be a bit interesting. I'm sure there's a way to import it and make it do a thing. It's probably just a Python package with an entry point in this package.

Starting point is 00:35:13 So I would think so. Yeah, because it would be nice to be able to do that rather than just using subprocess to invoke a lot of things. Oh, interesting. I hadn't really thought about it as a replacement for subprocess. But yeah, because a lot of times when you're trying to orchestrate stuff, like it talks about here being part of the shell or being another app or another language, you would just use subprocess on it, right?

Starting point is 00:35:35 Yeah. Cool. Well, there it is. Piper. Piper.io. And people can check that out. It looks pretty interesting. Nice.

Starting point is 00:35:42 All right. Ian, you want to take us out with your final item here? Ah, pigments. Okay. All right, Ian, you want to take us out with your final item here? Ah, pigments. Okay, so this is a package. I mean, if you're a developer, there's a very good chance that you have been using this for years like me without knowing about it. You might have seen it being installed as like a dependency. It's like, what is that thing? That was my thought, Ian.

Starting point is 00:35:59 I'm like, I know I see this all the time in my dependencies, and I just never really bothered to look into what it does. Yeah. So I haven't until recently. So if you use Jupyter Notebook Markdown, you can do like three backticks and then a block of code. And you can actually put like Python or Bash or something and it will intelligently highlight it. So the thing that's doing that intelligent highlighting is pigments. GitHub marked down the same kind of thing, although I'm not sure whether GitHub uses pigments.

Starting point is 00:36:30 And if you do developer docs like Reader Docs and Sphinx, that also uses pigments to kind of color code your code samples. And I know there's a lot of writing kind of blog posts and stuff like that. There are quite a few services out there where you can take a chunk of code and it will intelligently highlight it and give you a JPEG or a PNG back. And that's kind of nice, but then you can't copy and paste the code from those samples. So I don't like that really. I think if you're going to put code in an article, you're probably intended for people to be able to copy and paste it.

Starting point is 00:37:05 Yeah. That's the most likely thing you are to copy and paste. Yeah. Right? Because you want that code over here. Yeah. You don't want an image of you. You could use OCR to reinterpret it, but it's all there.

Starting point is 00:37:15 Yeah. And then maybe Brian's gen sim to tidy it up. So with Pigments, you can use it as a standalone package and it can do this kind of rendering and it can render to like HTML with like CSS style sheets for all of the coding. It also rendered to like ANSI terminal, latex, a few other kind of things. So if you're using, if you want a kind of nicely formatted piece of code in a document or you're doing developer docs, it's certainly kind of useful.

Starting point is 00:37:47 I mean, I came across it, which I should say one thing, it also supports, maybe I can just switch, supports lots and lots of languages. So it's very simple to use. It has a highlight function and then you import Alexa, which is like the thing that understands the tokens in a language and a formatter for the output type you want. I think there's hundreds of these things. There are a lot of languages in there. No kidding. More than half of these I've never heard of.

Starting point is 00:38:15 It also supports as well as things like, you'd expect Python. It supports Python trace back, so it has separate Alexa for color coding trace backs. All the usual languages you'd expect, but also some things like data formats So it has a separate Lexa for color coding tracebacks. All the usual languages you'd expect, but also some things like data formats, like TOML, JSON, XML. Okay.

Starting point is 00:38:32 Interesting. Like a lot of the files that we might run across, you can syntax highlight them. Yeah. And so it's very easy to use. And the reason I came across it is because I recently, so a lot of attacker code tends to be deliberately obfuscated. So it's kind of base 64 encoded, but then even once you decode it, it's kind of munged in a way to make it as unreadable as possible.

Starting point is 00:38:58 So one of the things that we try to do is pull that code back, like decode it, try to clean it, de-obfuscate it. But if you can present it in as close to the way a developer would write it as possible, it makes it much quicker for an analyst to determine what is this doing. So we use it now in MysticPy to kind of color display things like malicious PowerShell script

Starting point is 00:39:22 or Bash or something like that that so that's how i came actually rather than just seeing it go past as part of a pip install actually have to invoke it directly so so i kind of big shout out to the uh the developers maintainers of pigments it's one of those packages that probably millions of people benefit from but like very few people kind of know about it or you know you can and it's just super easy to use they like very few people kind of know about it or, you know, you can't. And it's just super easy to use. They seem to be adding kind of lexers all the time. So, great.

Starting point is 00:39:51 Yeah, this is amazing. I didn't realize that it did all of this. This is way more advanced than I thought. Brian, did you know? No, I just thought it was something that magically syntax, did syntax highlighting, so I didn't have to care about it. Yeah, exactly. I got a call call that's yeah little example in the in the show notes as well i pasted it has a dark theme yeah yeah uh yeah and you you probably want to include this no

Starting point is 00:40:17 background equals true if you're using a jupyter notebooks because if if you select a theme it just flips the whole notebooks kind of css theme so that tells it just not to mess with what what's in the background okay um yeah that looks great yeah thanks thanks for pointing out how useful that can be that's that's cool like i said i've seen it go by all the time i just never really it's yeah paid that much attention to it it's probably a pretty minority use but like if you need it it's great great. Yeah, it's incredibly powerful. Fantastic. Well, that's all of our main items. Brian, you got any extras?

Starting point is 00:40:48 Just one extra. Actually, one of the things when I was doing the first topic with GenSim, it doesn't have very many dependencies, but one of the dependencies is this library called SmartOpen. And I'm like, what? I open things and I want to be smart about it. So I wanted to check this out and it's pretty neat. I don't know if we've covered this before, but it's a, it basically mimics the interface of open normal Python open,

Starting point is 00:41:16 but you can pass it really anything. And it does like a transparent on thethe-fly reading of things, efficient streaming of large files from like S3 or Azure or over the web. Even straight, just HTTP. Yeah. If you just have a link to a large file on a web server. Yeah.

Starting point is 00:41:36 And then just the code for it is just like super nice. You know, you import open from smart open and you got like for line in open this thing and Just you can work from each line there. It's pretty cool. I love it. That's a that's a great one. It'd be nice In you got any extras you want to shout out while we're here. I don't I'm afraid I Have I have two real quick ones to just quickly talk about. Last time, Emily Morehouse spoke about using AutoSquash, which was really cool.

Starting point is 00:42:13 So Adam, let me get the attribution correct here. Adam Parkin sent in a follow-up to say, hey, you should check out this article over here called Fixing Commits with Git Commit dash dash fix up and Git Rebase dash dash auto squash. The long and the short of it is talks about doing a lot of things that Emily said was pretty cool. But in the end, setting up your Git config to auto squash equals true and then adding an alias so you can just type Git space fix up. And when you type that, it actually does Git log and shows the last 50 true and then add in an alias so you can just type git space fix up and when you type that it actually does git log and shows the last 50 items and then allows you to go back and work

Starting point is 00:42:51 with those and basically it's just a real quick way to get back into the scenario where you mark different elements for fix up so people can check that out if they were following emily's advice but they want it to be like one line. They don't have to remember. There you go. That's cool. And then Python 3.10.3 is out as of about a week ago, I suppose. So there are many changes amongst here. You know, I would love, there's like so many great changes here. I don't know. How many do you think that is? Probably a hundred, maybe a little bit less. It would be great if there was like a, these are critically important at the front. Like there's a security problem that was fixed

Starting point is 00:43:30 or there's a thing we've taken out is no longer here. They're kind of all the same priority. But nonetheless, there's a bunch of changes that people can check out and upgrade to the newer version of Python 3.10. Different people care about different stuff though. I know, I don't want to impose my importance of Python 3.10. Different people care about different stuff, though. I know. I don't want to impose my importance

Starting point is 00:43:47 on other people's importance. So it's funny, when I first came across Python, you'd be like, why is it so slow between the major versions coming out? And then suddenly it's like a Python developer. It's like, why are the versions coming out so quickly? I can't keep up. Yeah, it's definitely true. There's a ton of change. This is just, you know, some minor

Starting point is 00:44:08 version change that has these, all these changes in here, which is pretty cool. Well, we also used to be on an 18 month cycle and now we're on a yearly cycle. So yeah. Yeah. It's Lucas Schlinga's fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to close out the show? That'd be great. Yeah. So here's a good tweet. And it's this sort of perplexed, I think in a good way, character wearing all these, are these prizes? I don't know. Anyway, Python developers, when someone asks what their secret is, and this person just

Starting point is 00:44:40 says, I just keep writing pseudocode and it just keeps working. It's a little bit like that joke where they have some code, pseudocode in a text file. They're like, just rename it to dot pi and try to run it. See what happens. Anyway, that's the joke. Nice. Thank you, Brian, as always. And Ian, thanks for being part of the show. Thank you.

Starting point is 00:44:59 Great to have you here. Thank you very much, both. Been a real pleasure. Yeah, it sure has. See y'all.

Python Bytes - #276 Tracking cyber intruders with Jupyter and Python

Topics covered in this episode: gensim.parsing.preprocessing DevDocs The Right Way To Compare Floats in Python Pypyr Extras Joke See the full show notes for this episode on the website at python...bytes.fm/276

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.