Python Bytes - #261 Please re-enable spacebar heating
Episode Date: December 3, 2021Topics covered in this episode: rClone check-wheel-contents xarray JetBrains Remote Development The XY Problem kerchunk - Making data access fast and invisible Extras Joke See the full show note...s for this episode on the website at pythonbytes.fm/261
Transcript
Discussion (0)
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly
to your earbuds. This is episode 261, recorded December 2nd, 2021, and I am Brian Ocken.
I'm Michael Kennedy.
I'm Shel Genthemann.
Welcome, Shel. Could you let us know a little bit about yourself?
Yeah, I'm a research oceanographer, so I study the sea from space. And I've been doing
oceanographic research for NASA for a little over
20 years. I do almost everything using satellite data. So I never have to leave the comfort of my
used to be office, but now office at home. That sounds so fascinating. Is it fun?
Super fun. Cool. It's like math and physics and computers all mushed together it's like all my favorite things
and oceans yeah it's fantastic and oceans yeah so that sounds like such a cool job welcome to the
show well michael what do you got for us to start well let's talk about our clone so this one was
sent in to us by mark pender now our clone itself i believe it's written in rust or something. It's not Python. So
the story here is not, Oh, here's a cool thing created with Python, but it is a cool library
that I think will be useful for Python developers. Okay. So this, our clone thing syncs your files
to cloud storage. I, let me basically see if I can summarize it. So imagine you wanted to put some
files in AWS S3, or you wanted to store something in Azure blob storage, or there's actually 40
different places where this can go. So like Backblaze, Backup, Box, Citrix, Sharefile, Dropbox,
Google Drive, let's see some stuff with OpenStack, pCloud, all these different places and formats,
even just web dev and whatnot. So if you want to either read or write files to that location,
what you can do with our clone here is it will basically mount those different locations as
just something on your hard drive, right? So if you want to write to S3, you can just, you know, write, do a file, like a with open slash S3 slash wherever it goes, and then write to it with Python
or set up some kind of cron job that moves stuff. So if you're trying to move like large data for
data analysis up to the cloud, so then you can connect it to a notebook, or you're trying to
move files that are the backend of your website or your API
through S3 or somewhere, then you can just copy files over, sync different locations, like I said,
mount it as a drive. And it has a lot of cool support for things like if the file transfer
gets interrupted, it'll fall back to the last one that was working and then continue uploading. So
it can be kind of interrupted and unstable and whatnot. This is so cool. This is like, you know, when I
first moved to the cloud, it was so frustrating having to figure out whether I was using S3 or
the, you know, all the Google commands or the Amazon commands. And all I wanted to do,
get my data to where I could use it on the cloud. I am so with you.
And sometimes it's like, well, how do I copy files here?
Here's our API.
I don't want an API.
I want to go to the Finder or to the Windows Explorer and draggy-droppy the file.
Can I do that?
They're like, no, no, no, you can't do that.
No way.
You can run our app maybe.
Yeah, so this is in theory.
This is so cool.
Yeah, I'm glad you like it.
It's awesome. I think it'll allow people to move data around from, especially it seems relevant to scientists
who need to put a bunch of data in the cloud and run it, but then they might be want that
data locally and keep it in sync and stuff.
And it's really frustrating when your expertise is in something else.
It's not in computer science.
And like everything I pick up is because I'm only forced to learn it.
And I don't want to learn the Amazon API and I don't want to learn the Google
API. This like gives me maybe one tool that I can just be caught agnostic and
move my data around in a way that I'm already comfortable with.
Yeah, I agree.
Yeah.
Here's the, um, uh, the thing I was looking at.
Yeah.
So the virtual backends wrap local and cloud file systems and play encryption,
compression, chunking, hashing, and joining.
And it looks after your data.
It preserves the timestamps, verifies checksums all the time,
transferred over limited bandwidth, intermittent connections.
It can be restarted, checks the integrity of your files, all those kind of things.
So, you know, like if you're out, I know you don't leave the house anymore,
but if you're out doing research and like on a boat and you needed,
you had this rickety connection, you know, maybe you could get stuff uploaded.
Well, this way. So I think it's neat.
How do you like configure it with you have to put in all your cloud stuff?
I suspect you when you set it up, you have to give it like let's see if you're Amazon.
That's Amazon Drive. I forgot that that existed. Okay.
Let's see.
Yeah, you've got to give it like your AWS keys and stuff, of course.
But yeah, they have a whole configuration section on what you give it here to set it up.
It looks like you create a config file for it, I think.
But yeah, pretty neat.
Excellent.
Yeah, Brian, what do you think?
Well, so I'm trying to figure out like even for something for a mental model is this like a Dropbox without version control or
that do is a completely different space well I mean it does have some tie-ins to there right
it's got like Backblaze and things like that which is just a pure backup system I think it's
just trying to match like how do I move files around to the cloud?
And you can also,
you can move it between the cloud, right?
You can mount two places and copy from one to the other.
Like I can copy from Citrix share file over to Box
and either of which I really know how to do.
Oh, it even has Dropbox as one of the configs.
So, yeah.
But different, this is actually pretty cool.
I like it.
Yep.
Very cool. Let's see. Yeah, very cool.
Let's see.
Kim out in the live audience says,
I like this.
Very few people really need to know or care that S3 doesn't have real files and directories, for example.
And Sam says, it's funny.
My group was just talking about how to transfer
a huge amount of training data to our compute resources
earlier today.
I'm guessing that's machine learning training.
Very cool.
When you still have to go to Amazon or Google
and set up the bucket, right?
So you're not spared that particular pain.
Just like try to click public until it's public,
but not too public.
That's my approach.
You still have to do that.
But this seems like a really nice solution.
Yeah, for sure it does.
I guess over to you, Brian. Yeah um so this has been suggested several times by several listeners so thank you
everyone that sent this in um oh i'm on the wrong thing aren't i i wanted to talk about check wheels
um so check wheels is a uh or check wheel contents so um the idea around it is that there's so i'm i'm often
using flit and it kind of does all this for me but there if there's other back ends that you can
use for building wheels and if you um if you configure something wrong it might get the wrong
stuff in there um so by wrong stuff it's you, you might have like a PyCache in there, or,
or you might deliver your tests with your wheels. And you don't, you know, that's just extra space,
you don't necessarily need that. Maybe your documentation doesn't should be there. But
maybe it shouldn't be depending on that. I don't think that actually, I went on a tangent with the
documentation, I don't think this checks for that. But so there's a, it's just a pip installable tool. And then you can run check wheel contents and you can give it a
wheel, but wheels are often long. So I just, when I've been trying it out, I've been just giving it
my dist directory and it just looks all at all the wheels in there and checks things. So what
does it check for though? So it's checking for things like making sure that you don't have any
PYC or PYO files in there because you you shouldn't have those in your wheels checks for duplicate files. Cause maybe you've got,
I don't know, copies of your directories or something. Um, and there's actually, I don't
know, like 10, 12, 13, 14, 15 checks or something like that. I'm counting really quickly. Um, but
there's, uh, what I really love about one of the things I like about this is there's a lot of things that you,
like if you configured it totally wrong and your wheel's empty, it'll check for things like that.
And yeah, you probably could test this and try it, but it'd be nice to actually have something in your pipeline
to automatically check for these things, and it's really fast.
The other thing I like is the readme for this project lists of has a very good description
of all the all the checks and why something like that could go wrong.
So if, for instance, you happen to have your tests in there, but you don't want them in
there, how do you fix that?
Or it also says if you actually do want your tests in there, how to go about putting it
in there.
So the check passes.
So interesting project.
Yeah, this looks really neat.
I think if you're going to be creating a package, you definitely don't want to be releasing things that are not intended to be in there.
I was looking through it.
I wonder if it's possible to say, you know, check for certain files files make sure that they don't get in there like
i'm thinking like a settings file that has some sort of key like an aws key like we were talking
about or something but nice can you so i don't make lots of packages so what's the wheel when
you're using that term like what does that mean um it's it's the thing that you pip install. They used to be just tarballs.
They used to be tar.gzs and whatever.
But what we do now, for the most part, or hopefully, is wheels are not just...
If it's just pure Python, it'll be the same for everything.
And hopefully it will be.
But it can also specify that it runs on Python 2 or 3. and that some of those sorts of things can be built into the name and what
operating system because if you've if you're building on like say just simplifying the world
a couple versions of unix um and uh and linux and um maybe uh windows and mac and maybe Windows and Mac,
and then also the new Mac with the different architecture,
those will all be different wheels.
But when you, so when you pip install it,
PyPI and pip will download the correct wheel for your operating system.
And that makes it so that when you're installing something,
none of, you don't have to compile anything.
It just brings it all down.
So it's a cool format.
Yeah, it's especially important for the scientific community
because there's so many weird libraries
that have to get compiled with things like Fortran,
as we were joking about.
And so wheels will basically contain the pre-compiled version
so you don't have to have a Fortran compiler on your machine
to pip install it or whatever.
It just downloads and unzips really quickly without all that steps.
I was told a simple mental model of the difference of old and new is the old style with setup
tools and stuff would often have a whole bunch of stuff that you download, and then you run
setup to build some things and redo things, Whereas a wheel is closer to mostly just a zip file that just unpacks things and throws it in your save packages.
Nice.
And Sam also adds, you can also package extension modules in wheels, which is their greatest strength.
Very cool.
Cool.
All right.
Brian, is that it for the check wheel contents?
Yeah, I'm done there.
Right on.
All right, Shell, take it away.
All right.
So I thought we would talk a little bit about weather and climate data and Python.
And we're really trying to get more Python programmers involved in weather and climate research.
And the data, I think, it used to be really hard to get weather and climate data.
It was in these really weird, obscure formats that only scientists knew how to read.
And they only wrote Fortran routines to read them.
But now with Python, it's becoming really, really easy to get these data.
So the first thing is like, where do you get the data?
So I'm just going to show the open data at Amazon, at AWS.
But really, you know, Google has the equivalent in the Earth engine and Google has all sorts of open data at Amazon, at AWS. But really, you know, Google has the equivalent in the Earth engine,
and Google has all sorts of open data sets. And that means that they're free egress. So most of
these you can get, you know, you can access data for free. And Microsoft has the planetary computer,
and they're building up the same thing. And like, you can see lots of people are putting data on
here. Like NASA has a Space Act agreement. There's the NOAA,
which is our weather agency, the big data program. And so like you can look for data.
And one of the biggest data sets that I work with is ERA-5. And if you just sort of type in here
and it brings up the data set and you can click on that and see they have it in these two different formats. So one is ZAR and one is NetCDF.
And most people in sort of data science work with, you know, SQL databases or maybe they're doing CVS files or tabular data.
So weather and climate data is a little different because it's three dimensional.
And so there's these different data formats.
And really almost all of the weather and climate data now is currently in this net CDF format.
The goal is let's just write a Python library and make it so you don't care about the format.
Right. The data formats, the people who produce the data should care about it.
But as a user, what we want is we want anybody to be able to use it and do anything they can think of.
And so that's the sort of X-Array. So X-Array is a Python library
that is designed for sort of
three-dimensional structured data.
And all the data has labels
and it has these things called data sets
so that it organizes your data for you.
And to read it, you just sort of say open data set.
Nice. And it understands these formats?
Yeah. And like, these formats? Yeah.
And like, I'm going to bring up a little example here, but this ERA5, I mean, this is like,
I think it's 35 terabytes of data.
So I took this off of the AWS.
Why did I take it off?
I ran it on AWS and I sub-sampled it.
Because where are you going to put it, right?
Like, how are you going to hard drive that big?
I mean, it used to be that like to get this data set,
you had to write a script
and then you would download it for like three months.
And now it's just on AWS,
which is like mind-blowing, right?
Like I log on and a few minutes later,
I actually have access to all this data,
which is so cool.
So like with X-Ray, I'm going to run this cell.
And basically I just import X-ray as xr to read the
data i just say like open data set that's it and it figures it out and it'll read almost it'll read
a lot of different formats and uh and it just has your data and so this is like a really big data
set and it tells you all about it and you can look at the different data that it has
and you know sort of the goal with this is to make it really really easy for anybody like let's say
you want to look at you know sales patterns in san francisco or you want to work at ship traffic or
you want to look at how weather is evolving at your location like you don't need to know about
the data anymore yeah fantastic. Fantastic. Just,
just know how to work with NumPy like X-ray stuff in your notebook and that's all you got to know.
Yeah. Yeah. It's all built around pandas and NumPy. And like, if you want to, like, let me
find a really easy example. Like what if I want to plot the data set? You know, I just type dot plot, right?
Oh, wow.
And then it like labels everything and you understand what you're looking at and what
day it is. And you can use cell and I cell and just sort of like pandas.
It almost looks like an ocean just right there. It's
Yeah.
Longitude and then I guess temperature, right?
Yeah. And so this is like you just typed plot
and it actually tells you exactly what you're doing
and what it's plotting and what the color bar.
So what do these different colors mean?
And you could do a spatial plot like this
where you do it in time
or let's just pick a particular latitude and longitude.
And the nice thing is that you can actually just tell it your latitude and
longitude,
and you can use Google map to look up your latitude and longitude and then
bought it. And it says, Oh, I'll make a time series.
That's pretty cool.
Wow. Yeah.
I remember just struggling so much getting into programming and having to
work with custom file formats out of like research projects.
You're like, what do you mean I have to read this binary file? This is going to be so hard. Okay, here we go.
Yeah. And then like you wanted to read a different binary file, like start from scratch,
write all that code again. And like X-Ray sort of took all of the backend work that all the people
at the data archives did with like getting everything in the same format and labeling
all the data nicely. It sort
of took all that work and just said, well, we'll write one library that builds on all of that and
can read anything. Yeah. Awesome. Great recommendation. A couple of pieces of real-time
follow-up. Sam Morley out in the live stream says, x-ray is great. I did an example of using it to
open a net CDF file in my book. i'm learning about his book applying math with python practical
recipes for solving computational math problems using python programming and its libraries that's
awesome that looks like fun actually yeah yeah and it's already linked to like sci-fi and it has all
a lot of statistics and math built into it so you can actually compute trends in one line and all
of that yeah nice great i also have
one other piece of follow-up here brian i don't want to panic you all but um right here in portland
we have panic the software company and i just want to give a quick shout out to this thing called
transmit here this is what i actually use to get stuff up into into and out of s3 and it also will
let you talk to backblaze box drop, Dropbox, Azure, Google Drive,
all these places as well. And it's basically like an old school FTP program where like on one half
it has your computer and the other half it has whatever cloud storage is that you're working
with there. And maybe you could even put the other half, not just your computer, but somewhere else
as well. So if you want just like a UI, not something like Rclone, but just a UI,
I'd strongly recommend this thing.
They don't sponsor the show or anything,
but I definitely love it.
I use it all the time.
Neat.
Neat.
All right.
Am I up next, actually?
I guess I am.
Yeah, I think so.
I am.
I am.
Number four would be,
I want to talk about this announcement from JetBrains,
being one of the bigger tool companies, tool builders for the Python world.
They came up with this thing called JetBrains Remote Development.
And buried at the end of this is actually what I think is the lead.
Got quite buried here, but we'll see.
So they introduced something that I was not aware of called Remote Development.
So the whole idea of this is basically what if instead of running like PyCharm, this works
for any of the IntelliJ stuff, but let's say PyCharm, instead of running PyCharm locally
on your machine, you could just give it an SSH destination, let's say, and it will go
over there and run PyCharm, the server or the sort of logic bits over there, but just
have a light front end
to your computer here.
So like a lightweight, if you're on some really wimpy laptop and you wanted to access like
a better server at work or in the cloud or in like Shell's example, near some massive
data set instead of far away from some massive data set.
So you could just directly talk to it and so on so yeah it's super cool you just basically um give it some ssh thing
they also say it's good for things like if your laptop gets stolen what data goes with it you know
if you just keep the data somewhere else right then like just revoke the ssh key and nothing's
nothing's bad you can also set it up so that it'll create
pre-configured environments like when you connect to it it'll automatically give you something with
like let's say conda set up and all the right libraries pre-installed and that one weird
c thing you got a you know apt install to make sure it works like it starts with that just all
configured from different things so anyway that seems all pretty cool to me i thought it was pretty neat that does look neat i think it's free if you set up your own server
but then i think it costs money if they provide you the server right so kind of just like firing
up a vm for you on your behalf all right you're ready for the buried lead scroll scroll so here
you can see as an example of just like connect over ssh or you can go to jetbrains space and
they'll create one for you. Right.
But here's the buried lead. They announced this thing called JetBrains fleet, which is as far as I can tell, unrelated. I think it'll connect one of these things, but is, is another
thing. So if you click down at the bottom or is there something about learn more? And if you go
to this, it is a complete rewrite of the whole IDE story over at JetBrains.
And basically think VS Code, but from JetBrains.
Yeah, I'm interested in watching this.
I just heard about this last week.
And they're doing it invite only, sort of a,
not invite only, but you have to like-
Early access, get approved sort of thing.
Yeah, get approved sort of thing.
They're trying to limit, basically limit the feedback
so that they can deal with the feedback. Yeah, so it'sapproved sort of thing. They're trying to limit, basically limit the feedback so that they can deal with the feedback.
Yeah, so it's like super fast to open. It doesn't have a project structure in the same sense that like PyCharm or IntelliJ would. It just opens files and it doesn't even have the IDE features unless you click this little like make it smarter button and then it'll like fire up all the high high end stuff that takes, you know, five seconds to start.
The other thing that's cool is you can see on the screen right here is
there's like three people typing all at the same time.
Actually,
no,
there's five people typing.
So it's like Google docs where you can all like collaborate on it and
parallel like right within it.
So I think those are all super neat developments in the whole editor space,
which,
you know,
we all write a lot of code and kind of deal with these tools editor as a service is something that is happening and i'm it it it
is a hard thing for me to wrap my head around because my brain thinks i want all my editor
stuff locally but there's a lot of times where you don't so yeah you just like the group cody
yeah i know i think that's really neat as well i think that would be really valuable to some people
on teams instead of you know we've all been in those screen share meetings like, no, could you go over there? Could you type this? No, no, no, no, not after that. Inside the parentheses. It's like, please, no.
That's exactly what you're doing.
No, no, no, to the left. No, a little more to the left.
Exactly.
Wait, not a pen. Exactly. And so, yeah, let's see a bunch of people out there really like this, uh, RJL and Sam and so on. But Kim has an interesting comment. We've come full circle ish back to talking to the one mighty mainframe over a lightweight terminal circa 1985 or, you know, for me, 90, like 95 and like X x11 x windows like is your x windows set up so
you can talk to the server yeah yep i'm just thinking the same thing yeah definitely but
these are interesting ideas you know for me personally i love to use pycharm for working
on projects but if i've got just a json file or even a python file i just want to look at the
file i probably won't open it in PyCharm because it's
going to create all this project goo that's going to
be stuck in that folder and it's going to expect
it's going to complain. There's no interpreter.
I just want to look at it, you know, and so tools like this, I think
are going to be really neat. Yeah.
Yeah. And Brandon's
support suggesting something crazy out
there like mobs might run in and
no mob programming where you like working as a
group. I think it's fun. Yeah. And I'll be we we should play with this though yeah i think it'd be fun to see uh
what what all the interactions feel like and stuff so i totally agree yep all right over to you um
i you know i i'm trying to remember how i came across the xy problem and and i was doing some
research last week and uh and i think i was doing some research last week and, uh, and I think I was
down some rabbit hole of link, follow link, follow link sort of thing. And I ran across this, uh,
problem and the XY problem and probably everybody else knows about this already, but I, it was,
the concept was new to me and I don't know the XY problem. Okay. And I studied math. Come on.
Well, so it isn't really that mathy.
So the X, Y problem is essentially you're trying to solve problem X, and you think of a solution Y that would help work to solve that.
And you get down to trying to solve all the details of why,
and you get stuck.
So you ask about why,
but what you're really trying to do is X.
And that's sort of nebulous.
An example kind of highlights it.
So, and we've got this example in the show notes that I pulled out of one of the links,
is how do I, if somebody asks,
how do I get the last three characters of a file name?
And somebody says, oh, you just like do,
and this is a shell command.
You just do like, if it's in the variable foo,
you just do dollar curly bracket foo
and then do a colon and then negative three,
just grabs the last three characters.
But also why do you want the last three characters?
Is it because you are trying to do, uh, trying to pull off the extension?
Somebody goes, yeah, that's what I'm trying to do.
And they're like, oh, well then you don't want the last three characters.
Cause it might be a two character or a four character extension.
So teach them how to do the real problem.
And, uh, in one of the, uh, I'm going to link to a couple, a couple like forum answers and stuff in there,
because I think it's interesting to it's, there's a lot of verbiage around the XY problem that sort
of blames the asker for asking a stupid question. And I think it's important to not do that because
we do this all the time. We break problems in software. We break problems down. If I want to do A, then I need to do B and C.
But to do B, I got to do D and E.
And then also F and G.
And then way down into the rabbit hole, I get to get into the X and Y problem.
But how far back do you back up to give enough context to somebody else?
So it's hard to avoid.
You'll run into it. And then I really like there was one
forum that had some great advice, both on asking questions and on answering questions. So when
asking questions, state the problem that you're trying to solve, but also state the higher level
thing that you're trying to achieve, if appropriate. And then also how that fits into the wider design. And then it also
brought up if you've thought of other solutions that you've eliminated for some reason or another,
go ahead and list those because somebody might give you one of those as an answer and you've
already eliminated that. So give the reason why. And then I think what's most important is giving answers to
what XY problems or giving answers to problems. Because although I think everyone that's on this
podcast and also listening is probably an expert in some fields and a novice in other fields. So
we're going to be on both sides of the fence. So when answering questions and you think,
oh, somebody is just trying to get
the extension, I'll just tell them how to do that. That's not necessarily helpful. So a great,
there's a great three-part thing to do. And our example follows those is go ahead and answer the
question directly, but also ask some questions about the problem. Say, just curious, why are
you trying to do this? Is it
because you're trying to do this other thing? If so, the thing I just told you might not be
appropriate. And then once you figure out really what the real problem is, then you can help and
give the final answer. So it isn't helpful to just say, oh, you're probably getting the extension.
Go ahead and just do that.
Anyway, I thought this was an interesting thought process around answering and asking questions.
Yeah, absolutely.
It seems to be very relevant to Stack Overflow type places.
Because you're looking for help.
You say, I'm trying to do this.
But a lot of times people will give you very specific answers.
And the answer could be, well, why don't you just do this library that already understands that format?
Like Shel mentioned earlier,
like why don't you use X-Ray
instead of trying to understand
how to parse this thing?
Just use that.
Oh, well, that's way better.
Thank you.
I see that a lot on Stack Overflow,
that exact.
It reminds me also of my,
like when I went to school
and you're trying to ask a question to your professor or to get help on anything, right?
You're like, this is my problem.
They're like, what really is your problem?
Please tell me about it.
And like, that's what you're asking, right?
Like, tell me what the actual problem is.
And if you can do that, clearly, you're going to get a much better answer.
Yeah, absolutely.
And a lot of people just don't i mean it's also just a different
perspective thing they know that they know they have the toolbox of things they know how to solve
and ways they've solved them and if a new problem and this is a related thing is people don't
sometimes don't even think that there's a really simple solution out there like oh that tool you're
using it already has a flag that does exactly what you want but
you didn't know the flag was there so it took me when i started learning python and i was so used
to fortran 77 where there was never any help they just don't even try um that when i started learning
python it took three or four months before i finally just said anything i want to do someone
has done better yes and they are out there i just have to
find out how to ask the question correctly to find them because it's true like everyone has
worked you know most people have tried to solve the same problem there's someone out there who's
worked on the same problem in all likelihood yeah there's so many libraries with pip or conda
that you can if you knew it existed it would do no one knew it
existed exactly yeah exactly all right okay so i guess i'm am i next you are next okay so what i
wanted to show this library that is called kerchunk um it's a great name yeah brand new so
can you see my snail screen yeah yeah we see this now so we
had this problem where um like as noah and nasa everyone's starting to throw all these net cdf
files or all these different files onto the cloud and then it turned out that access in s3 was
really really slow and so people got really frustrated uh because like the cloud's supposed to be fast,
right? This is going to transform science. We're going to do it better now.
That's the promise. Yeah.
That's the promise. But the grass isn't always greener. So this is this library that I think
has really maybe some broad applications. It's being developed right now. And the idea behind
it is like we have all these data formats that we're sort of stuck with.
There's lots of data, but sometimes it's slow on S3.
So is there a way that we can fix this?
And the idea is that you create a reference file system.
And so you do this by going to each of your files and just taking the data that you need for that file, like just the metadata.
So like what size is it, what its dimensions and coordinates are, what variables does it contain?
So you just take those little bits and you pull them out into a JSON file. And so then you have
this reference file that just contains the important information, but it's really small.
And so that makes it faster to access. And then you construct this JSON file
and I have some benchmark tests in here, but then you construct a mega JSON file and you basically
virtually aggregate all of your data so that in one call, again, you could just get access to
everything. And because you might not need actually the data, you might need to know,
well, what timeframe is this? So I, do I need to read in that file or not? Right. Yeah. And in some
ways, because you're doing a lot of what, one of the things with X-Ray back to that other library
is it does the lazy loading. So like this is a 16 terabyte data set that I'm loading here,
but I'm just loading the data about
the file. I'm not actually loading any data until I need to touch it. And so I can load this giant
data set in a little bit over, you know, less than two minutes by doing this virtual aggregation with
CryptChunk. And so all it's doing is it's reading these aggregated JSON files. And right now it works for three or four
different types of data sets.
So if you have big collections of data
that are going on to S3,
they have lots of different little files.
This is a way to sort of virtually aggregate them
into one big data set that you can then subset.
Oh, that's really cool.
It seems like this is one of those
that comes as part of the FS spec project,
which we talked about pretty recently as well.
Yeah, and so this is part of FS spec
and it's Kerchunk.
It was just released
and it's a unified way to represent
compressed data formats
and it creates this virtual data set.
So that's where it's located.
Yeah, super cool.
See, Kim has a question.
Do you keep the individual JSON files with the data?
You can.
So the nice thing about this, the data can be anywhere.
And again, this is the idea to make data invisible and easy to access so that you don't have to care what format it's in or where it's at.
You can, as long as they make the little, you can either create them yourself and just keep the little JSON files public. And then you just make the one aggregated
JSON file public. And then anybody could actually use that JSON file to access the data this way.
Yeah. Fantastic. This looks really helpful for working with large data.
Yeah. Yeah. I think it's cool.
Yeah. It looks awesome. All right. Brian, does that bring us to the extras? Yeah, I guess it does. How many, how many extras you got today? I just have one
entertaining extra. I thought, um, as, uh, some people have amusingly noticed, um, I am attempting
to grow my hair out. Um, and I went to Florida last week and it's very humid in Florida and I looked like a cotton swab.
It just like poofed.
Anyway, that's it was amusing to me.
But you should have sent us some pictures or something.
Yeah.
I mean, those are the pictures you don't really want out there.
But yeah.
Yeah.
So I wish I could have seen like that because I was I was at Disney World and we're doing like rides and stuff.
And I really wish I could have seen like the the flowing hair in the on the roller coaster or something like that so
perfect i love the hair nice nice uh uh let's see what's got shells got first okay so what are
extras just something that we did last week well just whatever you want to also just give a shout
out to uh we're here before we call it uh i think i'm pretty good
i'm really excited like uh nasa starting a big transformation to open science which is exciting
um they started a new they announced just last month a new 40 million dollar initiative to try
and help scientists move to open practices and python's a big part of that because and a lot of
this was the open community that Python helped develop over the last
decade and all of the tools that now is making, it's not just science easier, it's making it
easier for more people to participate in science. I think there's a lot of synergies and similarities
between the scientific goal of spreading knowledge and publishing your work and so on and open source.
Yeah. Because it used to be like
scientists, like you would share your knowledge, right? You'd publish paper and that was it. And
if you like, that's what graduate, like I remember in graduate school, you would go through and
they'd be like, okay, derive the equations in this paper. Cause they wouldn't show you all the steps
and you would do that. And then if you wanted to code it up, you would just open up a new window
and start coding. And now, you know, people are up, you would just open up a new window and start coding.
And now, you know, people are starting to publish their code so that you can actually reproduce their results and then build on them and move faster.
The whole reproducible science thing as well. Fantastic.
Yeah.
Awesome. Sam in the audience says, yes, more open reproducible science is great for everyone.
Yeah.
All right. I got some extras as well, as you can imagine.
Surprise.
I don't remember when I was going on.
Maybe this was actually in TalkPython, but I was going on and on that Visual Basic 6, I want to drag a few things on the screen and write a little bit of code,
made it so easier for people to build apps.
Robert Livingston out there said you know what kojo kojo x o j o or zojo i
don't know is this replacement thing so if you're trying to build some desktop apps and you want to
do a bunch of draggy droppy stuff boy if it worked with python or somebody could build a python
integrated thing behind those events there i would love to try to work on some integration between
those things but uh currently no there's a little demo where in like six minutes, seven minutes, they build
a web browser, which is kind of neat.
So very visual, basic feeling.
So is it Python?
It's not Python.
No, it's not Python.
It's more of VB6 feeling.
I don't know if it's actually VB6, which is even worse.
It's sort of kind of, but not exactly.
I just did a webcast 10 reasons you
love high charm even more in 2021 with jet brains and uh paul everett we just did five reasons so
i'll link to that people care about that and then who doesn't love a little good uh tech shock and
awe and um being um i don't know outrage i guess is the word I'm looking for. So Microsoft Edge is this browser that's sort of Chrome-based
and they just announced like a Linux version
and it runs on macOS, which all these things surprised me.
And there was getting a lot of traction
and there's this whole thing where Microsoft,
the team at Edge just added like a buy now, pay later thing
built into the browser from some third party company
not as an extension but like integrated into the browser that you can't not get when you go
shopping it says would you like to use this like for payment program it's almost like adding like
payday loans like baked into the browser it's insane that's so there's i know it's such a bad
idea so there's a Ars Technica article.
It says users revolt as Microsoft bolts on short-term financing app into edge. That's like
30% borrowing. And one of the quotes is this all feels extremely unnecessary for a browsing
experience. And the comments are, you go to the comments, they are really, there's 256 comments, which is an awesome number of comments for the moment. But there's just almost nothing but like, why? Why is it this is unbelievable to me? I can't believe this is so it just makes it feel so shady and trashy, right? Like the next thing you're gonna do is get like bail bonds offerings inside your browser if you get your browser just weird stuff so anyway i thought people might enjoy just uh reading
through this and uh taking a little bit of that in it it must work right because we all have this
experience where you i mean there's been this has been going on for 20 years like with their
browser remember it used to install all this stuff on your machine you have
to delete it all and then that was ruled illegal so they had to take it they had to separate them
out and they just keep finding ways to get back in yeah there's some really interesting stuff you
know they're um they're now sort of putting ads in the start menu and stuff and then the ads are
forced to open in edge not your default browser it's just like there's layers of like really like why are you doing it?
It makes me happy that I'm not using Windows 11 at the moment.
Whereas I've been actually looking forward to using say like the new terminal and oh
my Posh shell on Windows and stuff, which looks amazing.
So I think there's this sort of like different groups.
So this is definitely a different group than say the VS Code group of people.
This is again going to take us back to 1995 and we're just going to be using a terminal window to access anything so we don't
get annoyed by all of that. There's no ads in the Linux browser. There's no ads in the Linux browser.
Yeah, exactly. Now, if they could just get the ad companies to be able to just collect your credit
card information and then instead of showing you the ad just buy it for you
and stay up on a payment plan that was just shared like we already know who you are just
click here if you want it okay great or just send it to you anyway and just charge you later
exactly so i feel like this almost could be the joke but i've got a different joke for you okay
all right so the joke for this week comes from a solid source, XKCD, as you may know.
And this is about workflows and changing software.
So here's the one that says workflow, and it's just in the change log or some sort of
conversation flow, maybe a GitHub release or something.
It says, changes in version 10.17.
The CPU no longer overheats when you hold down the space bar. And then there's a
frustrated user comment. It says longtime user four writes, this update broke my workflow. My
control key is hard to reach. So I hold the space bar instead. And I've configured Emacs to interpret
a rapid temperature rise as pressing control. The admin writes, that's horrifying. The user
writes, look, my setup works for me just add an
option to re-enable spacebar heating oh i remember like enabling all the weird emacs things that only
you would know about exactly exactly and the subtitle is every change breaks someone's workflow
i love it yeah um actually and i it's interesting because python's
even like more so like that because of the introspection and everything's really open
unless you really work hard to make it i mean you can't really hide too much stuff with python
so somebody even if you tell people even if you have a comment around a function or an access
point to say um this is not part of the API.
This is subject to change.
You can change it and it will break somebody.
Because somebody has reached inside and used the thing you told them not to use.
Yep.
Those double underscores and single underscores,
they're just there to slow you down.
That's just there so you notice what you're not supposed to do.
Those are where the interesting parts are.
Exactly. They wouldn't give me the feature, but I can just do it right here. to do. Those are where the interesting parts are. Exactly.
They wouldn't give me the feature, but I can just do it right here.
Awesome.
All right.
Well, I think that's it, Brian.
Yeah.
It was a good episode.
So thanks, everybody, for showing up.
Yeah.
Thanks, everyone.
Yeah.
Thanks, Shel, for being here.
Great to have you on the show.
Thanks, Michael.
Thanks, Brian.
Take care.
Bye, everyone.