Python Bytes - #261 Please re-enable spacebar heating

Episode Date: December 3, 2021

Topics covered in this episode: rClone check-wheel-contents xarray JetBrains Remote Development The XY Problem kerchunk - Making data access fast and invisible Extras Joke See the full show note...s for this episode on the website at pythonbytes.fm/261

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is episode 261, recorded December 2nd, 2021, and I am Brian Ocken. I'm Michael Kennedy. I'm Shel Genthemann. Welcome, Shel. Could you let us know a little bit about yourself? Yeah, I'm a research oceanographer, so I study the sea from space. And I've been doing oceanographic research for NASA for a little over 20 years. I do almost everything using satellite data. So I never have to leave the comfort of my
Starting point is 00:00:31 used to be office, but now office at home. That sounds so fascinating. Is it fun? Super fun. Cool. It's like math and physics and computers all mushed together it's like all my favorite things and oceans yeah it's fantastic and oceans yeah so that sounds like such a cool job welcome to the show well michael what do you got for us to start well let's talk about our clone so this one was sent in to us by mark pender now our clone itself i believe it's written in rust or something. It's not Python. So the story here is not, Oh, here's a cool thing created with Python, but it is a cool library that I think will be useful for Python developers. Okay. So this, our clone thing syncs your files to cloud storage. I, let me basically see if I can summarize it. So imagine you wanted to put some
Starting point is 00:01:25 files in AWS S3, or you wanted to store something in Azure blob storage, or there's actually 40 different places where this can go. So like Backblaze, Backup, Box, Citrix, Sharefile, Dropbox, Google Drive, let's see some stuff with OpenStack, pCloud, all these different places and formats, even just web dev and whatnot. So if you want to either read or write files to that location, what you can do with our clone here is it will basically mount those different locations as just something on your hard drive, right? So if you want to write to S3, you can just, you know, write, do a file, like a with open slash S3 slash wherever it goes, and then write to it with Python or set up some kind of cron job that moves stuff. So if you're trying to move like large data for data analysis up to the cloud, so then you can connect it to a notebook, or you're trying to
Starting point is 00:02:21 move files that are the backend of your website or your API through S3 or somewhere, then you can just copy files over, sync different locations, like I said, mount it as a drive. And it has a lot of cool support for things like if the file transfer gets interrupted, it'll fall back to the last one that was working and then continue uploading. So it can be kind of interrupted and unstable and whatnot. This is so cool. This is like, you know, when I first moved to the cloud, it was so frustrating having to figure out whether I was using S3 or the, you know, all the Google commands or the Amazon commands. And all I wanted to do, get my data to where I could use it on the cloud. I am so with you.
Starting point is 00:03:05 And sometimes it's like, well, how do I copy files here? Here's our API. I don't want an API. I want to go to the Finder or to the Windows Explorer and draggy-droppy the file. Can I do that? They're like, no, no, no, you can't do that. No way. You can run our app maybe.
Starting point is 00:03:19 Yeah, so this is in theory. This is so cool. Yeah, I'm glad you like it. It's awesome. I think it'll allow people to move data around from, especially it seems relevant to scientists who need to put a bunch of data in the cloud and run it, but then they might be want that data locally and keep it in sync and stuff. And it's really frustrating when your expertise is in something else. It's not in computer science.
Starting point is 00:03:40 And like everything I pick up is because I'm only forced to learn it. And I don't want to learn the Amazon API and I don't want to learn the Google API. This like gives me maybe one tool that I can just be caught agnostic and move my data around in a way that I'm already comfortable with. Yeah, I agree. Yeah. Here's the, um, uh, the thing I was looking at. Yeah.
Starting point is 00:03:57 So the virtual backends wrap local and cloud file systems and play encryption, compression, chunking, hashing, and joining. And it looks after your data. It preserves the timestamps, verifies checksums all the time, transferred over limited bandwidth, intermittent connections. It can be restarted, checks the integrity of your files, all those kind of things. So, you know, like if you're out, I know you don't leave the house anymore, but if you're out doing research and like on a boat and you needed,
Starting point is 00:04:24 you had this rickety connection, you know, maybe you could get stuff uploaded. Well, this way. So I think it's neat. How do you like configure it with you have to put in all your cloud stuff? I suspect you when you set it up, you have to give it like let's see if you're Amazon. That's Amazon Drive. I forgot that that existed. Okay. Let's see. Yeah, you've got to give it like your AWS keys and stuff, of course. But yeah, they have a whole configuration section on what you give it here to set it up.
Starting point is 00:04:58 It looks like you create a config file for it, I think. But yeah, pretty neat. Excellent. Yeah, Brian, what do you think? Well, so I'm trying to figure out like even for something for a mental model is this like a Dropbox without version control or that do is a completely different space well I mean it does have some tie-ins to there right it's got like Backblaze and things like that which is just a pure backup system I think it's just trying to match like how do I move files around to the cloud?
Starting point is 00:05:26 And you can also, you can move it between the cloud, right? You can mount two places and copy from one to the other. Like I can copy from Citrix share file over to Box and either of which I really know how to do. Oh, it even has Dropbox as one of the configs. So, yeah. But different, this is actually pretty cool.
Starting point is 00:05:43 I like it. Yep. Very cool. Let's see. Yeah, very cool. Let's see. Kim out in the live audience says, I like this. Very few people really need to know or care that S3 doesn't have real files and directories, for example. And Sam says, it's funny.
Starting point is 00:05:56 My group was just talking about how to transfer a huge amount of training data to our compute resources earlier today. I'm guessing that's machine learning training. Very cool. When you still have to go to Amazon or Google and set up the bucket, right? So you're not spared that particular pain.
Starting point is 00:06:12 Just like try to click public until it's public, but not too public. That's my approach. You still have to do that. But this seems like a really nice solution. Yeah, for sure it does. I guess over to you, Brian. Yeah um so this has been suggested several times by several listeners so thank you everyone that sent this in um oh i'm on the wrong thing aren't i i wanted to talk about check wheels
Starting point is 00:06:38 um so check wheels is a uh or check wheel contents so um the idea around it is that there's so i'm i'm often using flit and it kind of does all this for me but there if there's other back ends that you can use for building wheels and if you um if you configure something wrong it might get the wrong stuff in there um so by wrong stuff it's you, you might have like a PyCache in there, or, or you might deliver your tests with your wheels. And you don't, you know, that's just extra space, you don't necessarily need that. Maybe your documentation doesn't should be there. But maybe it shouldn't be depending on that. I don't think that actually, I went on a tangent with the documentation, I don't think this checks for that. But so there's a, it's just a pip installable tool. And then you can run check wheel contents and you can give it a
Starting point is 00:07:28 wheel, but wheels are often long. So I just, when I've been trying it out, I've been just giving it my dist directory and it just looks all at all the wheels in there and checks things. So what does it check for though? So it's checking for things like making sure that you don't have any PYC or PYO files in there because you you shouldn't have those in your wheels checks for duplicate files. Cause maybe you've got, I don't know, copies of your directories or something. Um, and there's actually, I don't know, like 10, 12, 13, 14, 15 checks or something like that. I'm counting really quickly. Um, but there's, uh, what I really love about one of the things I like about this is there's a lot of things that you, like if you configured it totally wrong and your wheel's empty, it'll check for things like that.
Starting point is 00:08:11 And yeah, you probably could test this and try it, but it'd be nice to actually have something in your pipeline to automatically check for these things, and it's really fast. The other thing I like is the readme for this project lists of has a very good description of all the all the checks and why something like that could go wrong. So if, for instance, you happen to have your tests in there, but you don't want them in there, how do you fix that? Or it also says if you actually do want your tests in there, how to go about putting it in there.
Starting point is 00:08:44 So the check passes. So interesting project. Yeah, this looks really neat. I think if you're going to be creating a package, you definitely don't want to be releasing things that are not intended to be in there. I was looking through it. I wonder if it's possible to say, you know, check for certain files files make sure that they don't get in there like i'm thinking like a settings file that has some sort of key like an aws key like we were talking about or something but nice can you so i don't make lots of packages so what's the wheel when
Starting point is 00:09:17 you're using that term like what does that mean um it's it's the thing that you pip install. They used to be just tarballs. They used to be tar.gzs and whatever. But what we do now, for the most part, or hopefully, is wheels are not just... If it's just pure Python, it'll be the same for everything. And hopefully it will be. But it can also specify that it runs on Python 2 or 3. and that some of those sorts of things can be built into the name and what operating system because if you've if you're building on like say just simplifying the world a couple versions of unix um and uh and linux and um maybe uh windows and mac and maybe Windows and Mac,
Starting point is 00:10:08 and then also the new Mac with the different architecture, those will all be different wheels. But when you, so when you pip install it, PyPI and pip will download the correct wheel for your operating system. And that makes it so that when you're installing something, none of, you don't have to compile anything. It just brings it all down. So it's a cool format.
Starting point is 00:10:27 Yeah, it's especially important for the scientific community because there's so many weird libraries that have to get compiled with things like Fortran, as we were joking about. And so wheels will basically contain the pre-compiled version so you don't have to have a Fortran compiler on your machine to pip install it or whatever. It just downloads and unzips really quickly without all that steps.
Starting point is 00:10:47 I was told a simple mental model of the difference of old and new is the old style with setup tools and stuff would often have a whole bunch of stuff that you download, and then you run setup to build some things and redo things, Whereas a wheel is closer to mostly just a zip file that just unpacks things and throws it in your save packages. Nice. And Sam also adds, you can also package extension modules in wheels, which is their greatest strength. Very cool. Cool. All right.
Starting point is 00:11:20 Brian, is that it for the check wheel contents? Yeah, I'm done there. Right on. All right, Shell, take it away. All right. So I thought we would talk a little bit about weather and climate data and Python. And we're really trying to get more Python programmers involved in weather and climate research. And the data, I think, it used to be really hard to get weather and climate data.
Starting point is 00:11:44 It was in these really weird, obscure formats that only scientists knew how to read. And they only wrote Fortran routines to read them. But now with Python, it's becoming really, really easy to get these data. So the first thing is like, where do you get the data? So I'm just going to show the open data at Amazon, at AWS. But really, you know, Google has the equivalent in the Earth engine and Google has all sorts of open data at Amazon, at AWS. But really, you know, Google has the equivalent in the Earth engine, and Google has all sorts of open data sets. And that means that they're free egress. So most of these you can get, you know, you can access data for free. And Microsoft has the planetary computer,
Starting point is 00:12:17 and they're building up the same thing. And like, you can see lots of people are putting data on here. Like NASA has a Space Act agreement. There's the NOAA, which is our weather agency, the big data program. And so like you can look for data. And one of the biggest data sets that I work with is ERA-5. And if you just sort of type in here and it brings up the data set and you can click on that and see they have it in these two different formats. So one is ZAR and one is NetCDF. And most people in sort of data science work with, you know, SQL databases or maybe they're doing CVS files or tabular data. So weather and climate data is a little different because it's three dimensional. And so there's these different data formats.
Starting point is 00:13:02 And really almost all of the weather and climate data now is currently in this net CDF format. The goal is let's just write a Python library and make it so you don't care about the format. Right. The data formats, the people who produce the data should care about it. But as a user, what we want is we want anybody to be able to use it and do anything they can think of. And so that's the sort of X-Array. So X-Array is a Python library that is designed for sort of three-dimensional structured data. And all the data has labels
Starting point is 00:13:33 and it has these things called data sets so that it organizes your data for you. And to read it, you just sort of say open data set. Nice. And it understands these formats? Yeah. And like, these formats? Yeah. And like, I'm going to bring up a little example here, but this ERA5, I mean, this is like, I think it's 35 terabytes of data. So I took this off of the AWS.
Starting point is 00:13:57 Why did I take it off? I ran it on AWS and I sub-sampled it. Because where are you going to put it, right? Like, how are you going to hard drive that big? I mean, it used to be that like to get this data set, you had to write a script and then you would download it for like three months. And now it's just on AWS,
Starting point is 00:14:13 which is like mind-blowing, right? Like I log on and a few minutes later, I actually have access to all this data, which is so cool. So like with X-Ray, I'm going to run this cell. And basically I just import X-ray as xr to read the data i just say like open data set that's it and it figures it out and it'll read almost it'll read a lot of different formats and uh and it just has your data and so this is like a really big data
Starting point is 00:14:40 set and it tells you all about it and you can look at the different data that it has and you know sort of the goal with this is to make it really really easy for anybody like let's say you want to look at you know sales patterns in san francisco or you want to work at ship traffic or you want to look at how weather is evolving at your location like you don't need to know about the data anymore yeah fantastic. Fantastic. Just, just know how to work with NumPy like X-ray stuff in your notebook and that's all you got to know. Yeah. Yeah. It's all built around pandas and NumPy. And like, if you want to, like, let me find a really easy example. Like what if I want to plot the data set? You know, I just type dot plot, right?
Starting point is 00:15:27 Oh, wow. And then it like labels everything and you understand what you're looking at and what day it is. And you can use cell and I cell and just sort of like pandas. It almost looks like an ocean just right there. It's Yeah. Longitude and then I guess temperature, right? Yeah. And so this is like you just typed plot and it actually tells you exactly what you're doing
Starting point is 00:15:48 and what it's plotting and what the color bar. So what do these different colors mean? And you could do a spatial plot like this where you do it in time or let's just pick a particular latitude and longitude. And the nice thing is that you can actually just tell it your latitude and longitude, and you can use Google map to look up your latitude and longitude and then
Starting point is 00:16:10 bought it. And it says, Oh, I'll make a time series. That's pretty cool. Wow. Yeah. I remember just struggling so much getting into programming and having to work with custom file formats out of like research projects. You're like, what do you mean I have to read this binary file? This is going to be so hard. Okay, here we go. Yeah. And then like you wanted to read a different binary file, like start from scratch, write all that code again. And like X-Ray sort of took all of the backend work that all the people
Starting point is 00:16:39 at the data archives did with like getting everything in the same format and labeling all the data nicely. It sort of took all that work and just said, well, we'll write one library that builds on all of that and can read anything. Yeah. Awesome. Great recommendation. A couple of pieces of real-time follow-up. Sam Morley out in the live stream says, x-ray is great. I did an example of using it to open a net CDF file in my book. i'm learning about his book applying math with python practical recipes for solving computational math problems using python programming and its libraries that's awesome that looks like fun actually yeah yeah and it's already linked to like sci-fi and it has all
Starting point is 00:17:17 a lot of statistics and math built into it so you can actually compute trends in one line and all of that yeah nice great i also have one other piece of follow-up here brian i don't want to panic you all but um right here in portland we have panic the software company and i just want to give a quick shout out to this thing called transmit here this is what i actually use to get stuff up into into and out of s3 and it also will let you talk to backblaze box drop, Dropbox, Azure, Google Drive, all these places as well. And it's basically like an old school FTP program where like on one half it has your computer and the other half it has whatever cloud storage is that you're working
Starting point is 00:17:58 with there. And maybe you could even put the other half, not just your computer, but somewhere else as well. So if you want just like a UI, not something like Rclone, but just a UI, I'd strongly recommend this thing. They don't sponsor the show or anything, but I definitely love it. I use it all the time. Neat. Neat.
Starting point is 00:18:12 All right. Am I up next, actually? I guess I am. Yeah, I think so. I am. I am. Number four would be, I want to talk about this announcement from JetBrains,
Starting point is 00:18:23 being one of the bigger tool companies, tool builders for the Python world. They came up with this thing called JetBrains Remote Development. And buried at the end of this is actually what I think is the lead. Got quite buried here, but we'll see. So they introduced something that I was not aware of called Remote Development. So the whole idea of this is basically what if instead of running like PyCharm, this works for any of the IntelliJ stuff, but let's say PyCharm, instead of running PyCharm locally on your machine, you could just give it an SSH destination, let's say, and it will go
Starting point is 00:18:58 over there and run PyCharm, the server or the sort of logic bits over there, but just have a light front end to your computer here. So like a lightweight, if you're on some really wimpy laptop and you wanted to access like a better server at work or in the cloud or in like Shell's example, near some massive data set instead of far away from some massive data set. So you could just directly talk to it and so on so yeah it's super cool you just basically um give it some ssh thing they also say it's good for things like if your laptop gets stolen what data goes with it you know
Starting point is 00:19:36 if you just keep the data somewhere else right then like just revoke the ssh key and nothing's nothing's bad you can also set it up so that it'll create pre-configured environments like when you connect to it it'll automatically give you something with like let's say conda set up and all the right libraries pre-installed and that one weird c thing you got a you know apt install to make sure it works like it starts with that just all configured from different things so anyway that seems all pretty cool to me i thought it was pretty neat that does look neat i think it's free if you set up your own server but then i think it costs money if they provide you the server right so kind of just like firing up a vm for you on your behalf all right you're ready for the buried lead scroll scroll so here
Starting point is 00:20:18 you can see as an example of just like connect over ssh or you can go to jetbrains space and they'll create one for you. Right. But here's the buried lead. They announced this thing called JetBrains fleet, which is as far as I can tell, unrelated. I think it'll connect one of these things, but is, is another thing. So if you click down at the bottom or is there something about learn more? And if you go to this, it is a complete rewrite of the whole IDE story over at JetBrains. And basically think VS Code, but from JetBrains. Yeah, I'm interested in watching this. I just heard about this last week.
Starting point is 00:20:54 And they're doing it invite only, sort of a, not invite only, but you have to like- Early access, get approved sort of thing. Yeah, get approved sort of thing. They're trying to limit, basically limit the feedback so that they can deal with the feedback. Yeah, so it'sapproved sort of thing. They're trying to limit, basically limit the feedback so that they can deal with the feedback. Yeah, so it's like super fast to open. It doesn't have a project structure in the same sense that like PyCharm or IntelliJ would. It just opens files and it doesn't even have the IDE features unless you click this little like make it smarter button and then it'll like fire up all the high high end stuff that takes, you know, five seconds to start. The other thing that's cool is you can see on the screen right here is
Starting point is 00:21:27 there's like three people typing all at the same time. Actually, no, there's five people typing. So it's like Google docs where you can all like collaborate on it and parallel like right within it. So I think those are all super neat developments in the whole editor space, which,
Starting point is 00:21:43 you know, we all write a lot of code and kind of deal with these tools editor as a service is something that is happening and i'm it it it is a hard thing for me to wrap my head around because my brain thinks i want all my editor stuff locally but there's a lot of times where you don't so yeah you just like the group cody yeah i know i think that's really neat as well i think that would be really valuable to some people on teams instead of you know we've all been in those screen share meetings like, no, could you go over there? Could you type this? No, no, no, no, not after that. Inside the parentheses. It's like, please, no. That's exactly what you're doing. No, no, no, to the left. No, a little more to the left.
Starting point is 00:22:20 Exactly. Wait, not a pen. Exactly. And so, yeah, let's see a bunch of people out there really like this, uh, RJL and Sam and so on. But Kim has an interesting comment. We've come full circle ish back to talking to the one mighty mainframe over a lightweight terminal circa 1985 or, you know, for me, 90, like 95 and like X x11 x windows like is your x windows set up so you can talk to the server yeah yep i'm just thinking the same thing yeah definitely but these are interesting ideas you know for me personally i love to use pycharm for working on projects but if i've got just a json file or even a python file i just want to look at the file i probably won't open it in PyCharm because it's going to create all this project goo that's going to be stuck in that folder and it's going to expect
Starting point is 00:23:10 it's going to complain. There's no interpreter. I just want to look at it, you know, and so tools like this, I think are going to be really neat. Yeah. Yeah. And Brandon's support suggesting something crazy out there like mobs might run in and no mob programming where you like working as a group. I think it's fun. Yeah. And I'll be we we should play with this though yeah i think it'd be fun to see uh
Starting point is 00:23:31 what what all the interactions feel like and stuff so i totally agree yep all right over to you um i you know i i'm trying to remember how i came across the xy problem and and i was doing some research last week and uh and i think i was doing some research last week and, uh, and I think I was down some rabbit hole of link, follow link, follow link sort of thing. And I ran across this, uh, problem and the XY problem and probably everybody else knows about this already, but I, it was, the concept was new to me and I don't know the XY problem. Okay. And I studied math. Come on. Well, so it isn't really that mathy. So the X, Y problem is essentially you're trying to solve problem X, and you think of a solution Y that would help work to solve that.
Starting point is 00:24:26 And you get down to trying to solve all the details of why, and you get stuck. So you ask about why, but what you're really trying to do is X. And that's sort of nebulous. An example kind of highlights it. So, and we've got this example in the show notes that I pulled out of one of the links, is how do I, if somebody asks,
Starting point is 00:24:43 how do I get the last three characters of a file name? And somebody says, oh, you just like do, and this is a shell command. You just do like, if it's in the variable foo, you just do dollar curly bracket foo and then do a colon and then negative three, just grabs the last three characters. But also why do you want the last three characters?
Starting point is 00:25:04 Is it because you are trying to do, uh, trying to pull off the extension? Somebody goes, yeah, that's what I'm trying to do. And they're like, oh, well then you don't want the last three characters. Cause it might be a two character or a four character extension. So teach them how to do the real problem. And, uh, in one of the, uh, I'm going to link to a couple, a couple like forum answers and stuff in there, because I think it's interesting to it's, there's a lot of verbiage around the XY problem that sort of blames the asker for asking a stupid question. And I think it's important to not do that because
Starting point is 00:25:38 we do this all the time. We break problems in software. We break problems down. If I want to do A, then I need to do B and C. But to do B, I got to do D and E. And then also F and G. And then way down into the rabbit hole, I get to get into the X and Y problem. But how far back do you back up to give enough context to somebody else? So it's hard to avoid. You'll run into it. And then I really like there was one forum that had some great advice, both on asking questions and on answering questions. So when
Starting point is 00:26:13 asking questions, state the problem that you're trying to solve, but also state the higher level thing that you're trying to achieve, if appropriate. And then also how that fits into the wider design. And then it also brought up if you've thought of other solutions that you've eliminated for some reason or another, go ahead and list those because somebody might give you one of those as an answer and you've already eliminated that. So give the reason why. And then I think what's most important is giving answers to what XY problems or giving answers to problems. Because although I think everyone that's on this podcast and also listening is probably an expert in some fields and a novice in other fields. So we're going to be on both sides of the fence. So when answering questions and you think,
Starting point is 00:27:03 oh, somebody is just trying to get the extension, I'll just tell them how to do that. That's not necessarily helpful. So a great, there's a great three-part thing to do. And our example follows those is go ahead and answer the question directly, but also ask some questions about the problem. Say, just curious, why are you trying to do this? Is it because you're trying to do this other thing? If so, the thing I just told you might not be appropriate. And then once you figure out really what the real problem is, then you can help and give the final answer. So it isn't helpful to just say, oh, you're probably getting the extension.
Starting point is 00:27:43 Go ahead and just do that. Anyway, I thought this was an interesting thought process around answering and asking questions. Yeah, absolutely. It seems to be very relevant to Stack Overflow type places. Because you're looking for help. You say, I'm trying to do this. But a lot of times people will give you very specific answers. And the answer could be, well, why don't you just do this library that already understands that format?
Starting point is 00:28:07 Like Shel mentioned earlier, like why don't you use X-Ray instead of trying to understand how to parse this thing? Just use that. Oh, well, that's way better. Thank you. I see that a lot on Stack Overflow,
Starting point is 00:28:18 that exact. It reminds me also of my, like when I went to school and you're trying to ask a question to your professor or to get help on anything, right? You're like, this is my problem. They're like, what really is your problem? Please tell me about it. And like, that's what you're asking, right?
Starting point is 00:28:34 Like, tell me what the actual problem is. And if you can do that, clearly, you're going to get a much better answer. Yeah, absolutely. And a lot of people just don't i mean it's also just a different perspective thing they know that they know they have the toolbox of things they know how to solve and ways they've solved them and if a new problem and this is a related thing is people don't sometimes don't even think that there's a really simple solution out there like oh that tool you're using it already has a flag that does exactly what you want but
Starting point is 00:29:05 you didn't know the flag was there so it took me when i started learning python and i was so used to fortran 77 where there was never any help they just don't even try um that when i started learning python it took three or four months before i finally just said anything i want to do someone has done better yes and they are out there i just have to find out how to ask the question correctly to find them because it's true like everyone has worked you know most people have tried to solve the same problem there's someone out there who's worked on the same problem in all likelihood yeah there's so many libraries with pip or conda that you can if you knew it existed it would do no one knew it
Starting point is 00:29:45 existed exactly yeah exactly all right okay so i guess i'm am i next you are next okay so what i wanted to show this library that is called kerchunk um it's a great name yeah brand new so can you see my snail screen yeah yeah we see this now so we had this problem where um like as noah and nasa everyone's starting to throw all these net cdf files or all these different files onto the cloud and then it turned out that access in s3 was really really slow and so people got really frustrated uh because like the cloud's supposed to be fast, right? This is going to transform science. We're going to do it better now. That's the promise. Yeah.
Starting point is 00:30:30 That's the promise. But the grass isn't always greener. So this is this library that I think has really maybe some broad applications. It's being developed right now. And the idea behind it is like we have all these data formats that we're sort of stuck with. There's lots of data, but sometimes it's slow on S3. So is there a way that we can fix this? And the idea is that you create a reference file system. And so you do this by going to each of your files and just taking the data that you need for that file, like just the metadata. So like what size is it, what its dimensions and coordinates are, what variables does it contain?
Starting point is 00:31:11 So you just take those little bits and you pull them out into a JSON file. And so then you have this reference file that just contains the important information, but it's really small. And so that makes it faster to access. And then you construct this JSON file and I have some benchmark tests in here, but then you construct a mega JSON file and you basically virtually aggregate all of your data so that in one call, again, you could just get access to everything. And because you might not need actually the data, you might need to know, well, what timeframe is this? So I, do I need to read in that file or not? Right. Yeah. And in some ways, because you're doing a lot of what, one of the things with X-Ray back to that other library
Starting point is 00:31:57 is it does the lazy loading. So like this is a 16 terabyte data set that I'm loading here, but I'm just loading the data about the file. I'm not actually loading any data until I need to touch it. And so I can load this giant data set in a little bit over, you know, less than two minutes by doing this virtual aggregation with CryptChunk. And so all it's doing is it's reading these aggregated JSON files. And right now it works for three or four different types of data sets. So if you have big collections of data that are going on to S3,
Starting point is 00:32:33 they have lots of different little files. This is a way to sort of virtually aggregate them into one big data set that you can then subset. Oh, that's really cool. It seems like this is one of those that comes as part of the FS spec project, which we talked about pretty recently as well. Yeah, and so this is part of FS spec
Starting point is 00:32:52 and it's Kerchunk. It was just released and it's a unified way to represent compressed data formats and it creates this virtual data set. So that's where it's located. Yeah, super cool. See, Kim has a question.
Starting point is 00:33:07 Do you keep the individual JSON files with the data? You can. So the nice thing about this, the data can be anywhere. And again, this is the idea to make data invisible and easy to access so that you don't have to care what format it's in or where it's at. You can, as long as they make the little, you can either create them yourself and just keep the little JSON files public. And then you just make the one aggregated JSON file public. And then anybody could actually use that JSON file to access the data this way. Yeah. Fantastic. This looks really helpful for working with large data. Yeah. Yeah. I think it's cool.
Starting point is 00:33:42 Yeah. It looks awesome. All right. Brian, does that bring us to the extras? Yeah, I guess it does. How many, how many extras you got today? I just have one entertaining extra. I thought, um, as, uh, some people have amusingly noticed, um, I am attempting to grow my hair out. Um, and I went to Florida last week and it's very humid in Florida and I looked like a cotton swab. It just like poofed. Anyway, that's it was amusing to me. But you should have sent us some pictures or something. Yeah. I mean, those are the pictures you don't really want out there.
Starting point is 00:34:16 But yeah. Yeah. So I wish I could have seen like that because I was I was at Disney World and we're doing like rides and stuff. And I really wish I could have seen like the the flowing hair in the on the roller coaster or something like that so perfect i love the hair nice nice uh uh let's see what's got shells got first okay so what are extras just something that we did last week well just whatever you want to also just give a shout out to uh we're here before we call it uh i think i'm pretty good i'm really excited like uh nasa starting a big transformation to open science which is exciting
Starting point is 00:34:51 um they started a new they announced just last month a new 40 million dollar initiative to try and help scientists move to open practices and python's a big part of that because and a lot of this was the open community that Python helped develop over the last decade and all of the tools that now is making, it's not just science easier, it's making it easier for more people to participate in science. I think there's a lot of synergies and similarities between the scientific goal of spreading knowledge and publishing your work and so on and open source. Yeah. Because it used to be like scientists, like you would share your knowledge, right? You'd publish paper and that was it. And
Starting point is 00:35:30 if you like, that's what graduate, like I remember in graduate school, you would go through and they'd be like, okay, derive the equations in this paper. Cause they wouldn't show you all the steps and you would do that. And then if you wanted to code it up, you would just open up a new window and start coding. And now, you know, people are up, you would just open up a new window and start coding. And now, you know, people are starting to publish their code so that you can actually reproduce their results and then build on them and move faster. The whole reproducible science thing as well. Fantastic. Yeah. Awesome. Sam in the audience says, yes, more open reproducible science is great for everyone.
Starting point is 00:36:03 Yeah. All right. I got some extras as well, as you can imagine. Surprise. I don't remember when I was going on. Maybe this was actually in TalkPython, but I was going on and on that Visual Basic 6, I want to drag a few things on the screen and write a little bit of code, made it so easier for people to build apps. Robert Livingston out there said you know what kojo kojo x o j o or zojo i don't know is this replacement thing so if you're trying to build some desktop apps and you want to
Starting point is 00:36:33 do a bunch of draggy droppy stuff boy if it worked with python or somebody could build a python integrated thing behind those events there i would love to try to work on some integration between those things but uh currently no there's a little demo where in like six minutes, seven minutes, they build a web browser, which is kind of neat. So very visual, basic feeling. So is it Python? It's not Python. No, it's not Python.
Starting point is 00:36:56 It's more of VB6 feeling. I don't know if it's actually VB6, which is even worse. It's sort of kind of, but not exactly. I just did a webcast 10 reasons you love high charm even more in 2021 with jet brains and uh paul everett we just did five reasons so i'll link to that people care about that and then who doesn't love a little good uh tech shock and awe and um being um i don't know outrage i guess is the word I'm looking for. So Microsoft Edge is this browser that's sort of Chrome-based and they just announced like a Linux version
Starting point is 00:37:29 and it runs on macOS, which all these things surprised me. And there was getting a lot of traction and there's this whole thing where Microsoft, the team at Edge just added like a buy now, pay later thing built into the browser from some third party company not as an extension but like integrated into the browser that you can't not get when you go shopping it says would you like to use this like for payment program it's almost like adding like payday loans like baked into the browser it's insane that's so there's i know it's such a bad
Starting point is 00:38:04 idea so there's a Ars Technica article. It says users revolt as Microsoft bolts on short-term financing app into edge. That's like 30% borrowing. And one of the quotes is this all feels extremely unnecessary for a browsing experience. And the comments are, you go to the comments, they are really, there's 256 comments, which is an awesome number of comments for the moment. But there's just almost nothing but like, why? Why is it this is unbelievable to me? I can't believe this is so it just makes it feel so shady and trashy, right? Like the next thing you're gonna do is get like bail bonds offerings inside your browser if you get your browser just weird stuff so anyway i thought people might enjoy just uh reading through this and uh taking a little bit of that in it it must work right because we all have this experience where you i mean there's been this has been going on for 20 years like with their browser remember it used to install all this stuff on your machine you have to delete it all and then that was ruled illegal so they had to take it they had to separate them
Starting point is 00:39:09 out and they just keep finding ways to get back in yeah there's some really interesting stuff you know they're um they're now sort of putting ads in the start menu and stuff and then the ads are forced to open in edge not your default browser it's just like there's layers of like really like why are you doing it? It makes me happy that I'm not using Windows 11 at the moment. Whereas I've been actually looking forward to using say like the new terminal and oh my Posh shell on Windows and stuff, which looks amazing. So I think there's this sort of like different groups. So this is definitely a different group than say the VS Code group of people.
Starting point is 00:39:43 This is again going to take us back to 1995 and we're just going to be using a terminal window to access anything so we don't get annoyed by all of that. There's no ads in the Linux browser. There's no ads in the Linux browser. Yeah, exactly. Now, if they could just get the ad companies to be able to just collect your credit card information and then instead of showing you the ad just buy it for you and stay up on a payment plan that was just shared like we already know who you are just click here if you want it okay great or just send it to you anyway and just charge you later exactly so i feel like this almost could be the joke but i've got a different joke for you okay all right so the joke for this week comes from a solid source, XKCD, as you may know.
Starting point is 00:40:28 And this is about workflows and changing software. So here's the one that says workflow, and it's just in the change log or some sort of conversation flow, maybe a GitHub release or something. It says, changes in version 10.17. The CPU no longer overheats when you hold down the space bar. And then there's a frustrated user comment. It says longtime user four writes, this update broke my workflow. My control key is hard to reach. So I hold the space bar instead. And I've configured Emacs to interpret a rapid temperature rise as pressing control. The admin writes, that's horrifying. The user
Starting point is 00:41:03 writes, look, my setup works for me just add an option to re-enable spacebar heating oh i remember like enabling all the weird emacs things that only you would know about exactly exactly and the subtitle is every change breaks someone's workflow i love it yeah um actually and i it's interesting because python's even like more so like that because of the introspection and everything's really open unless you really work hard to make it i mean you can't really hide too much stuff with python so somebody even if you tell people even if you have a comment around a function or an access point to say um this is not part of the API.
Starting point is 00:41:46 This is subject to change. You can change it and it will break somebody. Because somebody has reached inside and used the thing you told them not to use. Yep. Those double underscores and single underscores, they're just there to slow you down. That's just there so you notice what you're not supposed to do. Those are where the interesting parts are.
Starting point is 00:42:04 Exactly. They wouldn't give me the feature, but I can just do it right here. to do. Those are where the interesting parts are. Exactly. They wouldn't give me the feature, but I can just do it right here. Awesome. All right. Well, I think that's it, Brian. Yeah. It was a good episode. So thanks, everybody, for showing up.
Starting point is 00:42:14 Yeah. Thanks, everyone. Yeah. Thanks, Shel, for being here. Great to have you on the show. Thanks, Michael. Thanks, Brian. Take care.
Starting point is 00:42:20 Bye, everyone.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.