Embedded - 285: A Chicken Getting to the Other Side

Episode Date: April 11, 2019

Carlos Maltzahn joined us to talk about graduate studies in open source software, research incubators, and how software development tools can be used to aid the reproduction of scientific results. Car...los is the founder and director of the Center for Research in Open Source Software (CROSS). He is also an adjunct professor of computer science and engineering at UC Santa Cruz. Some projects we spoke about: Jeff LeFevre — Skyhook: using programmable storage in Ceph to make Postgres and other databases more scalable and elastic (skyhookdm.com) Ivo Jimenez — Black Swan: using DevOps techniques and strategies to speed up the systems research delivery life cycle (falsifiable.us) Kate Compton — Tracery2 and Chancery: using open source software to support artists and poets (tracery.io)

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Embedded. I am Alicia White, alongside Christopher White. And in studio with us this week is Carlos Maltzan. We are going to talk about graduate research in open source software. Hi, Carlos. That's right. Welcome. Thank you for having me.
Starting point is 00:00:25 Could you tell us about yourself as though we met at, I don't know, the Bang Bang conference during a break? Yeah. So I'm Carlos Malzahn. I'm an adjunct professor at Computer Science and Engineering at UC Santa Cruz. And my specialty is storage systems, large storage systems that are basically, you know, easy way to explain that is they communicate data over time, as opposed to computer networks who communicate data over space. So we're looking at very large systems. And then we also more recently are looking at making science more reproducible using new kinds of software technologies like DevOps. We're going to talk about that later. Making science more reproducible would be nice. Okay, well, let's do lightning round where we ask you short questions and we want short
Starting point is 00:01:25 answers and if we're behaving ourselves we won't ask how and why until later in the show christopher i think you have the first question how many banana slugs have you seen oh my god i like to hike in the in the redwoods and so i probably saw a lot. What is the biggest one? I don't know. Maybe four inches, I guess. Hiking or surfing? Hiking. I tried to surf, but I really failed.
Starting point is 00:02:00 It's hard. Research or teaching? Research. A little teaching? Research. A little bit of teaching. Favorite closed source software product? Google Documents. What about open source? I think I like Linux.
Starting point is 00:02:22 That's a big one. I think I like Linux. Sure. Still on the fence. Favorite conference to either go to or speak at? I think, you know, BangBangCon was awesome. That was like an unconference, but it is definitely one of my favorite conferences. There are lots of other conferences.
Starting point is 00:02:45 They're also fun to go to, but probably not as cool. And do you have a tip everyone should know? Get enough sleep. It's really true. I tell that my students and it's very, you know, different than what other people tell them. work work and i said no get enough sleep then you work better somebody dealing with insomnia right now i resent that answer okay let's go on to the regular questions how do you teach open source programming it seems like a odd thing yes and it's actually a big, you know, there's a big debate on how to do that. Nobody actually really knows how to do that. And this is an experimental course, but that actually
Starting point is 00:03:36 experiment turned out pretty well. So this is actually an idea that came from a former student of mine, Andrew Schumacher, to make everyone in the class contribute to the linux kernel and which is you know these are undergraduate students some of them have not even declared their computer science major so they don't really know much about operating systems and so and especially not about linux the lin kernel. And so then you might ask, how can they possibly contribute? And so it turns out like really big open source software projects, they have lots of little things to do,
Starting point is 00:04:14 like documentation, like fixing little things, cleaning things up, making sure that programming styles are followed. And that's a really great way to get involved. And what you have to learn in the process is how to interact with a large open source software community. And that is the biggest challenge, actually, for people, even for people who have been in computer science for a long time, to understand so the the protocol and how do you
Starting point is 00:04:47 constructively contribute to a very large community and how do you interact and how what tools to use and in the process almost by accident they learn a lot by just using these tools how to be productive and it's actually they're sort of saying, and I forgot who said this, and maybe I have to look this up, but teaching open source software, the most important thing is to teach people how to be productively lost. And it's because you don't know, even the teacher doesn't know, right?
Starting point is 00:05:21 It's a very different kind of teaching because I don't know the Linux kernel. Nobody really knows the't know, right? It's a very different kind of teaching because I don't know the Linux kernel. Nobody really knows the Linux kernel, right? And it's just these big systems. And so you have to just, you're more like a, help people navigate this complexity. And so that's really what this class is about. They learn concretely how to compose and put together linux patches which are emails but they're in a certain format and how do you find out what should be patched and then do it and how test to
Starting point is 00:05:56 test how to test it and then how to then who to send it to right right? So all of these things you need to sort of know, you need to learn. And then at the end, they have an exam, and they basically submit patches and get evaluated by that. That sounds really useful. As somebody who sometimes wants to contribute to open source, but especially the Linux kernel is notoriously unfriendly. But that's largely because they get a lot of junk. And so why be friendly to the new person
Starting point is 00:06:33 when you're already inundated with so much? Yeah, and actually that's a really interesting point, right? There was like this really interesting discussion because they didn't have a code of ethics or they had sort of this code of conflict right and very recently they they switched and actually Linus Torvalds stepped down because of his very abrasive nature and I sort of my theory is is is you know I I when I saw the discussions about the maintainers and how they sort of react and almost threatened the contributors to, you better, you contributed something good because I don't want to see junk.
Starting point is 00:07:13 And they're very aggressive about getting false stuff or stuff that's not really good and causes a lot of work and maybe even bugs and i was sort of reminded of milgram those experiments at stanford that were looking at you know prisoners and guards and those experiments designed were so designed that the guard was not quite sure of themselves whether they were protected enough from their prisoners and so they became very aggressive and i thought a little bit that i mean i this is just my intuition but that's a little bit what's going on here they're really scared uh that that you know they get a lot of uh submissions and they can't really review things so thoroughly that they're very sure that there's not a bug in there right and so they try to sort of threaten but should they yeah i mean software bugs come along with software yeah it's
Starting point is 00:08:12 a trade-off right yeah and i think that but on the other hand you have to also uh look at you know how that played out and it was not very nice and a lot lot of people got, you know, and especially for newcomers, right, who are, don't have, they just try things out and then they suddenly get blasted, you know, in the worst possible way. And that's, you know, we actually, that's also part of the class to sort of have that discussion, you know, explore why that actually happens or why that could happen and what that actually means right and it's and and once students can understand that's maybe sort of more of a form of protection than actually directed to them personally they can sort of better understand what to how to deal with it i wish i had your class because i had to do this once professionally and it was a major major uh down part of my my career because i'd spent years on on a patch yeah a large one um it was for a kernel module so in my brain it's like it's a
Starting point is 00:09:20 module it's optional right shouldn't be a big deal and i just started a firestorm and you know the the networking maintainer basically killed it even though some of the other reviewers were like oh this is fine just make these changes and we're good and then you know started an argument between the maintainer and cisco systems who i was working for at the time and going back and forth and finally it didn't go in and i just gave up i actually stopped stopped that contract so i was so frustrated but yeah yeah no and you just put all but i didn't yeah i didn't expect any of this i mean i heard about linus and stuff but uh yeah i didn't expect the pushback and i also didn't i feel like we should have primed them hey guys we're working on this do you have comments ahead of time before we
Starting point is 00:09:59 get going and that was our big mistake was right just kind of dumping it here it is but you love this right and that's exactly what we're discussing we're going through multiple use case you know studies i mean every the nice thing about open source everything is documented right you can actually go through those archives and you can just point to certain things and that's really sort of the what i offer to the students in open source programming to actually find those useful cases and then study them. Yeah, but thanks for sharing. Are the Linux maintainers, have they put a hit out on your life?
Starting point is 00:10:34 Because your class is going to be submitting. I mean, they're supposed to be submitting things to the Linux kernel, which generates work for them. Actually, not quite. So that is, it's a really good point, though, because there is, you know, there are other people also teaching open source, and they have chosen to use small projects, and they get regularly overwhelmed when there's a class of newbies, right? And they suddenly all submit and... Single maintainer places.
Starting point is 00:11:04 Yes, yeah, yeah. And so Linux is suddenly all submit. Single maintainer places. Yeah. Yeah. And so Linux is one of the biggest open source software projects. And, you know, there's a lot of noise anyways. And it's sort of, the community is sort of set up for that. On the other hand, I also don't actually ask them to immediately submit. Right? So we have basically the patches get submitted to me and i review them and there's like this thing where you have to you know one of the exercises
Starting point is 00:11:30 is to who do you cc cc lists are incredibly important and and linux patches and so but then it's a fake cc so you don't put it actually in the in the email right but ever so often it happened actually the first time i taught it i you know somebody accidentally left it in the CC list intact, and it was actually going out to the Linux maintainers and got it accepted. So it was like an amazing success. Accidental, but yeah, very cool. Okay, you also do research in open source software. So that's, yeah, that's actually,
Starting point is 00:12:06 so it's, we could, I was really struggling with this name and couldn't really figure out how to come up with a good acronym that everybody could remember. And, but what this means is actually introducing research in open source software.
Starting point is 00:12:26 So it's not research on open source software. That sounds like psychology. Yeah. So introducing research in open source software is actually sort of this idea that a lot of open source software community efforts actually are outside of research and the research community in themselves they use open source software more like a tool to share some code that they did as part of an experiment and make their experiments reproducible but they don't actually embrace the very cool technologies and strategies and techniques that that open source software communities developed inside the way they do research. And, you know, the funny thing is both of it is heavily peer-reviewed, right?
Starting point is 00:13:13 Open source software is heavily peer-reviewed and science is heavily peer-reviewed. So there's actually a number of really interesting parallels. And then plus, you know, students really expect to be, you know, taught more in these open source software tools, right? So it's expected for students when they go interviewing, the first question is, do you have a GitHub repository, right?
Starting point is 00:13:39 And so we don't teach that in the university, not until recently. And it's brand new. I mean, five years ago, even five years ago, that was pretty much, okay, that's nice, cool, extra thing. But now it's, you know, if you're coming out of some, if you're a new grad, certainly. Yeah. And it's actually, you know, and actually there is this history also, which is, I think, relevant here. So I was actually co-founder of a storage systems project called Ceph.
Starting point is 00:14:14 And Ceph is a storage system that is now very popular. It's, I think, probably the… C-E-P-H. C-E-P-H. Yes. Okay. Yeah. And Ceph is actually one of the most popular open source software storage, distributed storage systems now.
Starting point is 00:14:28 I mean, it's the dominant storage system in OpenStack, which is used by the telecoms in every town. You know, all the little local phone data centers have running OpenStack. and stack and so um and so the student uh who uh ended up working on cef this was this project it was originally only supposed to be a summer project um and then it turned out to be into turned into a phd project he was in a situation that he just continued working on it after he graduated and founded a startup and uh and then sold the startup to red hat uh and was suddenly very wealthy and we asked him can you know give back to the university and he said yes i i will give you two million dollars if you can create a structure in the university that will allow other students to do as have a similar career as I did. And the key thing was,
Starting point is 00:15:28 how do you enable students who created a piece of infrastructure to continue working on it after they graduated? Because normally when students graduate, they need to go get a job, and that context switch means basically everything they did before is kind of thrown away and so this is uh this is actually what participated across the center for research in open source software and it was is basically an incubator that allows phd students to become open source software leaders. And then not to throw away all the software that they wrote
Starting point is 00:16:08 and the systems that they built, but to actually build a community around it so it lives on. Okay, I have so many questions follow up from that. Let's start with the first one I came up with, which was the acronym.
Starting point is 00:16:24 You mentioned the acronym and you mentioned the name Center for Research in Open Source Software, which is CROSS. And I wanted to know whether that was cranky, like a Linux kernel maintainer, a crucifix, or a chicken getting to the other side. Chicken getting to the other side chicken getting to the other side we are bridging the gap between a student work and open source software communities so that's what cross does right and so and then also it has you know we're from uc santa cruz right so it's santa cru Cruz. That's right. It's multi-layered. Yeah, it's multi-layered. Almost like a pun. Yeah. And then you talked about the overlap between research and open source approaches.
Starting point is 00:17:16 And you mentioned peer reviews. Are there other things that are similar between them? Oh, yeah. things that are similar between them oh yeah um so um we we're gonna actually going to one of the incubator projects but i can sort of uh very very briefly there are a lot of tools and so actually a lot of the research is actually software so all the experiments is software uh this is not only for computer science it's like for bioengineering for i mean almost any engineering field you have software to do science and so you know you share the software you produce the software your package so you do everything that normal software engineers do for production software systems. But somehow, it's never really viewed that way.
Starting point is 00:18:10 And so actually applying open source software production tools in science is incredibly valuable. And that's something that we just discover right now. Well, you mentioned that a lot of people create open source tools, but they do it as part of their research and maybe it shows up somewhere as a Jupyter notebook or in a GitHub repository that only their advisor looks at.
Starting point is 00:18:38 And I'm still like, okay, so a lot of open source projects ends up that way. A lot of research ends up as an open source project. And I totally see what you're saying about physics, bioinformatics, weather, everything. There's a computational side to science. Yeah. And so, maybe instead of asking how and why, I should ask you about Black Swan.
Starting point is 00:19:07 Because I think that's where we're headed. This is one of the projects at CROSS. Yes. So could you tell us about it? Yeah. So this is by an amazing graduate student who's working for me, Ivo Jimenez. He is from Murcio, Mexico, and he's been suddenly showing up in my class a few years ago and was like this amazing guy
Starting point is 00:19:36 who was just really into storage systems and distributed systems. And then he said, oh, I want to work with you. But then while he was sort of looking around for theme, he sort of got totally into the weeds and was not storage systems, nothing, right? It was like really into reproducibility. And I just didn't first thought, what's reproducibility, right?
Starting point is 00:20:01 I mean, it's important, but didn't really think about. And it turns out he had like this amazing software engineering background from his undergraded years and had been following the devops community which is devops is this uh is this movement in software engineering that essentially is the mission is to accelerate the software delivery cycle and so what does that mean practice when you look at your your laptop let's see today right um and a few years ago maybe 10 years 15 years ago you had to actually in order to get new software you had to go to a store and you get like shrink-wrapped box and you had to i missed that so much you unpack that you know it smells all new and that was new software it came with usually
Starting point is 00:20:55 came with like a book or at least a booklet but then you know and then there were like cds and later dvds right and it took actually some time to actually load all that software into the, and you had to like sit there and feed it slowly, right? 30 floppies. Yes. And so, and today, you know, this is almost like it just happens, right? And so that's actually, that's just for the consumer side. But I mean, in industry, the change is even more dramatic, right? In the financial industry, they can now install new software in three days instead of, you know, a year or more, right? With all the certification and everything that goes with it. a successful community of people who really figured out how to make change and update of software and debugging of software and the whole you know process really fast very efficient remove
Starting point is 00:21:56 all the friction and and create a lot of very aggressive automation in it and so when you then take that over and say okay you know what happens in science um you have experimentation pipelines you have scripts you gather software to do your experiments you um you know you your experiments don't all work out like right from the beginning you you you know then you have bugs then you have to reduce an experiment you know and then it sort of looks like that's what software engineers do they have to reproduce bugs right so the scientist also has to reproduce something and they have to be really careful about those conditions and about all the context of what they've done kind of like my build environment. Yeah, exactly. And so they haven't actually had the education, right? A bioengineer doesn't have the education of a software engineering.
Starting point is 00:22:53 And so this is sort of Ivo's insight that it's like by essentially creating tools that allow a wider community to use the DevOps tools and to really leverage those insights and how to progress. You can actually accelerate the science delivery cycle. And that's actually new. Nobody actually really thought of science as a science delivery cycle. But once you kind of look at this, then it says, oh, that's like the science life cycle. And people actually knew about that for a while. And so basically, that's progress of science. And as it turns out, a lot of friction is not the scientific insight itself.
Starting point is 00:23:45 It's just all that other stuff that you have to do in order to figure out what went wrong in your experiments. And a lot of it is the same as what the DevOps community really figured out how to make fast. You say the DevOps community, and that's an approach to software. But do you, I mean, isn't it any software, major software methodology would be applicable in this case that research has to understand how to, I don't know, I don't want to say release and deliver and productize, but to some extent that's true. Your dissertation is a product. Right.
Starting point is 00:24:29 And I think that, so DevOps community, I think, is heavily vested in open source. I think that, you know, if you look at sort of what tools they use, they use Git mostly or GitHub. They have container technology they have like docker yeah and then they have these wonderful things like ansible which is you know 250 really complex python code that implement really simple verbs like install this right it turns out when you look at the details of installing something to just say install right that's an awesome verb to have and then there's like this big complex thing that you'd never look at that just somehow manages to
Starting point is 00:25:20 install your stuff right and so all you have to say in your script, install this or uninstall this, right? And so that's like an amazing tool. And it turns out that, you know, if you have 250 verbs like that, you're pretty much set. I mean, it makes everything so much easier, so much easier. It's an abstraction level that just wasn't there before.
Starting point is 00:25:44 And that's sort of an outcome of open source software itself and so there is like this that's why it's sort of called devops it's basically this funny mixture of development and operations and so looking at sort of how can we make operations much more much faster you know much more efficient using open source development that's essentially what what the devops community did and they and then there's like other yes the agile methods you know continuous integration all these things um that kind of plays into this but somehow i think mostly because all these fortune 500 companies want to make this process faster. They poured a lot of money into this.
Starting point is 00:26:30 Now science can sort of look at that and say, well, we can leverage that. And there's huge communities. We can actually also do that and it would be very beneficial. I don't know whether that answers your question, but basically... I think it's hard because we're both device software engineers and i've only started to see kind of concepts like this where i'm at and i think i'm not sure it was intentionally kind of following on from the devops approach if we just discovered it in a similar way we have uh tools that are we have scripts and tools we use
Starting point is 00:27:03 conda just installs your entire development environment, and it's all versioned properly. And we have Python scripts. It's not called Ansible, but it's got many verbs. It does lots of things, installs things, does things to the device for testing and such. But it's all very standardized, and everybody gets the same blob when they check out the source tree and install that. But beyond that, I'd never seen anything like this, except for like the FDA saying, well, you know, every version of your software, you have to have everything packaged up so you can reproduce it. But that was more for a, when we come to arrest you, we want to know, you know, we want to be able to reproduce the bug
Starting point is 00:27:39 and figure out why. It wasn't for, you know, rapid release and we didn't use it that way so it sounds like it's devops to me when i hear that word i think okay that's that's back-end people in the in the web website world but it's definitely applicable beyond that yeah yeah i think that you know what's what's so interesting is that you've seen you know because there's so much pressure now in the science to have reproducible science, right? Reproducibility is a big deal. And it's kind of gotten out of hand because all the science has gotten so complex, right?
Starting point is 00:28:13 The systems have gotten much more complex. So people kind of, at some point in some communities, and I know that that's for computer systems the case, they kind of gave up. They said, you know, that we can't reproduce this stuff it's too complex and and in fact i i was invited at a panel session at one point and the panel uh leader or moderator got really excited because i was the only one in favor of uh of reproducibility which was like we are a science we're supposed to be scientific how can you not be in favor of reproducibility which was like we are a science we're supposed to be scientific how can you not be in favor of reproducibility it's like the the not that kind of reproducible
Starting point is 00:28:52 time i came down strongly in favor of reading yeah that's yeah yeah and and, you know, and this is basically why I think, you know, and so I think a lot of communities kind of gave up. And then Ivo came along and said, you know, there are all these tools out there, right? You don't have to reinvent everything. There's a lot of help. You just have to repurpose these tools in the right way. And then you actually... And so the cool thing is when he actually started developing this, my lab suddenly just switched. So all my students, they suddenly all used his tools to do their own research and to publish papers and felt that their productivity
Starting point is 00:29:38 went through the roof. It was really amazing to watch. And I was like dumbfounded, right when I knew okay he's on to something we can change the workflow of my students and when I only recently came across the term
Starting point is 00:29:57 DevOps as Chris says I live in a different world and it seemed to me like this was build tools build maintain maintenance uh configuration tools like git yeah it's it's more than that though it's basically it's explicitly addressing the entire cycle from you know building you know creating software releasing software keeping software running and testing software installing software
Starting point is 00:30:38 you know collecting bugs reports and you know managing all the issues with software and sort of you know and then kind of goes over you know it's like a cycle and so and and and so devops is sort of like this this idea that in traditional teams it was all all these different phases were different teams and you had sort of these walls between the teams. And then you were just like, okay, this release is ready for testing, and you throw this piece of code over the wall, and then the test team kind of looks at it, right? And so the idea here is that everyone does everything. And so you have a much more integrated sense of why something happens, right? So the programmer is also responsible for testing.
Starting point is 00:31:31 But there's lots of automation, so that testing doesn't really come in the way, but it's actually much more efficient that the programmer then sees, oh, I know immediately what went wrong, you know, and the tester doesn't have to like make a case of why this is wrong. So that's, there's, it's sort of a matter of more integrated. So you don't have the walls between these different functions of software development. And I like that. I like that a lot. I like test-driven development. I like testing. I test my code, but I hate working on these tools. But I hate working on these tools.
Starting point is 00:32:07 I hate working on these tools. I just want to write the software so that my robot does what it needs to do or so that I get the data analysis back. I don't want to spend all the time dorking with containers and installing 97 packages and even my virtual environments, making sure they're consistent across systems. I do it because I have to, but I hate it. I just want to write the stuff I'm there to do. Right. And I think a lot of work that you see in those communities is based on that. People really hate using these tools. I don't hate using them.
Starting point is 00:32:45 I hate working with them. You shouldn't have to. If you are, there's something wrong. With me or the tool? With whoever put it all together. That's usually me. I'll cut that part out. the most important part if you don't share your software um or you know how you run your software then it's sort of pointless right but we do this actually within the university all the time
Starting point is 00:33:32 so i have you know phd students who mentor master students or undergraduate students and they have to share their infrastructure they just build themselves with the students and they have to share their infrastructure data just build themselves with the students and the students come in and they have often you know maybe they do a term project that's like 12 weeks right they in order for them to be productive in 12 weeks they can't spend 10 of those weeks just to you know get stuff running yeah um and so and so the even though it's slightly more work for the phd student to prepare this once it's done it actually enables the phd student to just share their work in a much more productive way and so that's you know that's where it really pays off so the problem is you're spending all the time doing that kind of work to share with yourself.
Starting point is 00:34:26 Yes. So you're not getting all the benefits. I share it with other people, but they don't appreciate it. Yeah, they don't appreciate it. So going back to Ivo, his program is called Black Swan and his website is falsifiable.us. Practical falsifiable research yeah how does that really so it has also to do with his the name of of his framework is called popper right and it's named after karl popper uh who is uh who basically said any any science that's not falsifiable is no science. And so you have to essentially, you have to show how can you prove or disprove, you know, what would disprove your hypothesis.
Starting point is 00:35:15 And if you don't have that, then your hypothesis is basically… Yeah, yeah. Yeah, exactly. So, and that's the that's the thing right so a lot of the um software that we write um or the the things that we try to prove and computer science computer systems and other places as well right it's not very clear what would actually what actually tests the hypothesis, right? And when you actually, we did this, this is sort of, we looked at a lot of papers and you see all these graphs, you know,
Starting point is 00:35:54 and then we would ask ourselves, what would actually reproducibility mean, right? Like reproducing this graph, how exact does it have to be? And the real question is actually, what was the goal of this graph? And when you know the goal of the graph, then you can sort of say, okay, you know, if it's linear, then you proved something. If it's not linear, it disproved it or something. But that doesn't mean that you have to exactly have that graph. You just have linearity or not as an example right and
Starting point is 00:36:25 so so when you know the goal then reproducibility is a lot easier the problem is that if you look through these papers the goals are never stated and so you know these are like top conferences and so it's like there's a whole you know set of habits that kind of snuck into these communities that makes actually reproducibility really hard. And so that's sort of what now a number of communities want to roll back. Conferences kind of wake up to that and say, you know, we need to have the goals of your experiments. We need to have the artifacts. We need to have all that. And they make it actually mandatory. So supercomputing, for instance, this year, for the first time,
Starting point is 00:37:09 you have to have the reproducibility artifacts somehow mentioned in your paper and for people how to get them. You know, that's what I should have called that last chapter of the paper I wrote last week, reproducibility artifacts instead of links and data. What you're saying about goals makes me think a lot about goals in software and knowing what you're building and requirements. You mentioned DevOps and Agile, and that Agile
Starting point is 00:37:43 isn't known for setting down your requirements ahead of time. It's more known for not doing that. Is there any discrepancy there, or is it just different tools at different points? Different tools at different points because i think that you know research it often it's not even clear what the requirements are when you start a project a research project right uh it's this nebulous thing i wonder development just does not apply to research because one of the tenants of agile is having something deliverable and working at regular intervals and research is like a lot of it's sitting and staring at the paper going what have i done yeah what have i done i mean i spent at least one summer doing that
Starting point is 00:38:30 some experience but but but i think that the you know research and practice is paper deadlines right and you have to sort of figure out what and so yeah there's always this part that very, you know, explores pie in the sky kind of stuff. But then there's also pretty, you have to bring it down to something concrete and actually deliver, right? And that's where actually sort of more the formal aspects come to play. Okay, so I don't know. So Ivo was a student and then he joined Cross as your graduate student? Or did he come into the school as part of the Cross system? So he came into the school as a PhD student.
Starting point is 00:39:15 So he was basically accepted into the PhD program. He was before that an undergraduate student. So he became a PhD student when he came to uc santa cruz and uh and then he uh actually thought he might actually just graduate as a master's and then i convinced him no you should really continue doing a phd and um and i luckily was able to convince him and he actually started just is you know looking for a phd topic and when and then just by sort of i just independently of that started cross in 2015 and i you know it just happened to be a really good project because it was strategically important for Cross to actually have a platform where we could make sure that the open source software that we produce was available to people. And so he kind of helped that with his project.
Starting point is 00:40:24 But he basically did his thesis and what this is this black swan project is an incubator project so he's now an incubator fellow which means even though he graduated he stays actually at uc santa cruz to build a community around black swan so that's the key thing that we're providing at Cross, right? So he is actually now able to turn what he did as a PhD student to turn that into an open source software leadership by creating a project that gains in popularity and becomes a real thing right it's like there's black swan
Starting point is 00:41:09 right now it's fairly unknown but it's growing so there's more and more yes he's already i think given eight tutorials in various conferences and people are starting to really take notice how do you define success for an open source project? That's a really good question. And I was actually, I asked Doc Cutting, who's actually also part of our advisory committee. Doc Cutting, you know, is maybe, you know, he's like the inventor of Hadoop and a number of other big data projects that really defined the big data industry, open source part of the big data industry. Embedded software over here. Yeah, yeah.
Starting point is 00:41:55 No idea. So basically, you know, and he is, you know, was very much involved in the Apache Foundation, was a board director or director of the board of the Apache Foundation. And sort of had to deal with a lot of open source software projects and trying to figure out, are they ready to go from, they have also an incubator, you know, are they ready for the incubator? Are they ready to graduate from the incubator? And he said the key metric is the diversity of contributors. So it's not the total number of contributors, but the contributors have to come up from different organizations. And that sort of gives the project robustness. And for a number of reasons, simply by redundancy, there's always multiple people contributing to a project and if one contributor somehow moves on, there are other who continue
Starting point is 00:42:56 to contribute. The other thing is though that, you know, it's also from a legal aspect. Open source software is very much a license. Open source software is very much a license open source software licenses are based on copyright and if the copyright is mixed among different you know administrative domains then it's much harder to kind of pull it back because you would have to get everyone to agree to give the copyright to one person so that they can change the license and so if you have diverse copyright ownership
Starting point is 00:43:35 then it's much more guarantee that the open source software license will always stay and will remain the same license and so so that gives you, you know, much more guarantee that this is going to stick around. And so I think, but I think also, you know, having contributors from different organizations means there's wide interest and that it's likely to grow. You don't worry about the number of people actually using the product. That's much harder to measure. The number of people using the product, I mean, if you have a number of contributors who spend significant time on this project, then something must compel them.
Starting point is 00:44:26 I kind of worry less about that. You know, I think that this seems to be strategically important enough that, that, you know, they spend time on it. And I think, yeah, I don't know of a project where you have diverse contributors, but then, you but then nobody really knows what to use it for. I don't know. That seems to be the number one metric.
Starting point is 00:44:56 Of course, when you find out that they're just using it as a hobby and it's just very entertaining to work on it then yeah you have to also wonder what the project does but maybe some professor is using it as a school project yeah i don't know it's for me the the uh that seems to be sort of making sense that you know having contributors from different organizations the more more the better, right? And usually the three incubator projects that we have right now are very, that you can already see, you can get, you have different contributors and they're more coming. So as Ivo or as any of your students, they come in with the idea of a PhD, they join Cross, Cross pays them a bit like a graduate student, and then they become incubator people and they get paid more and their ideas get sold.
Starting point is 00:45:57 How does this work? So, we have basically, so the whole, you know, cross is basically run as what's called an industry university collaborative research center. And it's a very well-known structure that, you know, the National Science Foundation has created many, many years ago. And it's, you know, very, the industry really knows the structure. And so we basically get sustained by industry members. And we have currently five industry members and they give us each $100,000 a year. And so we have currently a budget of $500,000.
Starting point is 00:46:41 And we basically fund research projects and incubator projects. Every six months we review every one of those fellows, the research fellows and the incubator fellows of those projects. And it's
Starting point is 00:47:00 you know, the research projects are just research projects. They don't necessarily produce software. They just have a plausible path to become a successful software project. But actually, their metric is just to do research papers and do all the things that a PhD student is supposed to do. That switches in the moment when you actually get accepted into an incubator project. And it's, by the way, not every research fellow gets
Starting point is 00:47:29 accepted as an incubator fellow. The incubator fellows are higher salary, so much more expensive, but they also have to make this tough decision. Okay, they graduated with a PhD, should they go into industry
Starting point is 00:47:45 make lots of money or have a fairly low salary at the university that that but then and basically can they can do exactly what they want to do and then maybe work on a startup company or something right and they still get paid um which is kind of unusual, when you work on a startup. And so this evaluation every six months kind of ensures that the industry is sort of still with those projects. They have only recommendation power, so they can't really decide what to fund. But they give us recommendations, and we want to keep them happy so they can continue to fund the center. And then there's an advisory committee. And then we have faculty. James Davis is, for instance, on that advisory committee.
Starting point is 00:48:37 As well as Sage Weil and Doug Cutting and Karen Sandler. She is the executive director of the Software Freedom Conservancy. And so they kind of helped me decide what we should fund, what is strategically important, and so forth. And so we have to sort of see what are resources, what are the most promising projects. And we kind of select those, you know, what are the best ones. And so it's very competitive. But yeah, so the incubator projects are basically strictly people who've graduated with a PhD and have something that they already built and they want to really spend time growing a community around it.
Starting point is 00:49:30 So another of the projects in the list you suggested I look at was Skyhook by Jeff Lefevre. Lefevre. Lefevre. The fever. And it's about using programmable storage in Ceph to make Postgres more scalable and elastic. So I knew the word database. Yeah. Could you parse that down for, say, a public audience? Yeah. So there's a lot of...
Starting point is 00:50:05 So I think the best way to describe it is there are certain commercial databases. They're very expensive. That's cute. Oracle. Yeah. I wasn't going to name Oracle, but yeah. And then there's open source software databases like Postgres. And Postgres has a very good community.
Starting point is 00:50:28 It's very strong. But they're kind of, you know, when we actually... So one of the things that incubator fellows who apply for an incubator fellowship have to do is they have to sort of show us that their project has real traction. That there's an open source software community out there that is really interested in what they do or what they propose to do. And so his job was when he was applying for the fellowship was to find this community and he identified the Postgres community
Starting point is 00:50:57 as something that, you know. And so what Postgres is, is basically it is great. It's an open source software database. People are using it. But then they kind of hit a wall when they go up to a certain scale. And there's a big gap between Postgres and a commercial database. And so what he's trying to do by using the Ceph open source software storage system is to narrow that gap.
Starting point is 00:51:30 And that's really what the project is about. And he found that databases have this very static kind of hardwired relationship between a storage system and a database. The database and the storage system are kind of hardwired relationship between a storage system and a database, right? The database and the storage system are kind of hardwired. And he's kind of breaking that so that you can now expand the storage system and shrink it, you know, just like in a cloud to make it elastic. But to do that, he had to basically use some advanced storage systems abstractions that we use we call programmable storage and what that actually means is that some of the functionality of a database like a filter or a projection or something you can actually offload to the
Starting point is 00:52:20 storage system and so you can program the storage system to do that for you. And so now you have a database, and then you have lots of storage nodes out there. And you can sort of farm out those projections and filters to those storage nodes. And then things are a lot more scalable. So that's really what the project is about. And how far along is this? You said incubator So that's really what the project is about. And how far along is this? You said incubator, so this is beyond research project.
Starting point is 00:52:49 Yes. So he's been on this for two years. And he is right now working very hard to create a business case. And he's also working hard to, mean not business case sorry a business model but and also working hard to actually create the community and one of the things that i want to sort of mention which is something this is our first incubator project right so this is the first one of the three uh and we're sort of still because we haven't actually graduated a single incubator project, right? All we know is SAF was a huge success, but that was before Cross.
Starting point is 00:53:29 And so we haven't really closed the loop of like, well, is Cross really working, right? And one of the things we discovered is that when a PhD student worked for years on their thing, right, they have no idea what it means to actually grow a community. They have never done community work, right? They have not even distributed software before. And so going, one of the big learn processes is to basically say, okay, what if I had a big community? And that's sort of the chicken and egg problem to basically say, okay, what if I had a big community? And that's sort of the chicken and egg problem because you don't have one yet. But there are always like all these students out there that really need work. They need term projects.
Starting point is 00:54:14 They need master project topics. They need, you know, even PhD students who need something, you know, need a topic idea. And so it turns out these incubator projects are really good sources for things that they have sort of discovered that's a really hard problem this is a really small problem that can be done in 12 weeks and so they can constantly farm out these things to the students in the university and as in the process the incubator fellow actually learns how to manage that and how to build the infrastructure to actually do that, to have 10, 12 students working for them all on the same software. How do you coordinate all this, right? And that's where DevOps comes back in.
Starting point is 00:54:57 It's like, how do you do that, right? And so in the process, that's sort of – and then some of these students actually graduated and then continued to contribute. So they are now actual contributors, right? And so it's a really fantastic way to seed these communities using essentially the riches within the universities of all these students who need to get something done, right? And so that actually turned out to be really good. So I don't actually know how this is going to go forward. We will sort of see when is it at the point when he gets some joining because of this project, you know, and they want to sort of know what's happening next, right? And so we're still figuring this out. That's a scary place to be.
Starting point is 00:55:53 Does he have a budget to pay the other students? Yes. So basically the way we do this is most of this, actually most of the work is actually free in some sense, right? So you don't get paid for doing a master's project. You don't get paid for doing a class project, which is actually interesting by itself because if you don't get paid, then whatever you did, you own, right? And so it's truly an open source project because the copyright is actually with the student.
Starting point is 00:56:29 And so that's sort of a beginning of this development that we talked about earlier is that, you know, you're starting to sort of distribute the copyright of open source. Do you have a preferred open source copyright license? We do have a few. It depends on the community. We like
Starting point is 00:56:53 the Apache license, we like the BSD license, we like the LGPL license. Noting one you haven't mentioned yet. The GPL license. So actually, yeah, so noting one you haven't mentioned yet and the gpl license so actually yeah so um you know it's it depends on the community really um we are it's it's actually a complex process because it also we're you know uc santa cruz is a public university so it's bound by certain legal requirements on what ip the university is allowed to give away and actually it's never allowed to give anything away um and so um and so we can't actually
Starting point is 00:57:37 guarantee even that whatever comes out of the university is an open source license strictly speaking right but we can promise and so it's this interesting thing where when you look at the membership agreement of companies who are members of cross right that that it says we we we we will open source everything that comes from the center subject to the restrictions that the university has. And so that is actually, you know, strictly there's like a lot of legal stuff here that we could not guarantee. But then the industry is looking at my record and looking at the record of the university and says, you know, yeah, we believe you that you want to do this. And so we're okay with it.
Starting point is 00:58:27 Right. And so, and so they are willing to, to go for it. BSD came out of UC Berkeley. So what is the, did they do something special for that? Sorry, what would. BSD came out of UC Berkeley. Yeah. Yeah.
Starting point is 00:58:43 So BSD was a rewrite of Unix, right, basically. And I think back then, you know, the idea of licensing open source was very immature. All the GPL stuff came in later, right? And I think, yeah, I think. Yeah, it did. And so I think that, you know, but in many ways, Berkeley continues to be a pioneer here, right?
Starting point is 00:59:14 What Cross gets mostly compared with by the industry is the RAD Lab, the RISE Lab, right? Now, before that was the AMP Lab. And so this is a lab, the systems lab at UC Berkeley that created enormous successful software. And Spark is maybe one of the more later software. That basically is an example of somebody who wrote a PhD thesis and then turned it to a startup company around an open source software project. Moving on to the third project, Tracery by Kate Compton,
Starting point is 00:59:51 which I don't want to talk too much about because she's going to be on the show in a few weeks. But I guess we should at least introduce it. Will you describe what Tracery is? Yeah. So Kate Compton was like one of those applications. We have like twice a year we have a call for proposals and people send in applications about incubator projects. And Kate's proposal came in and we're sort of like whoa really cool stuff but very different right and i didn't know uh whether this would actually be interesting to any of the uh the members the industry members because they're all like storage device makers so here's what the project is uh kate is uh looking at casual creativity that's her you know that's what she's interested in and as a one of the examples of casual creativity is this idea of making it
Starting point is 01:00:59 very easy for people to create bots twitter bots in in particular. There are little programs that insert conversations or conversational pieces into the Twitter stream. And it turns out that it caused a huge followership, lots of followers who actually are artists and poets and people who emphatically don't think of themselves as programmers. But somehow thought this was compelling enough as an art form that they just threw everything into actually creating something new and so she created this programming environment for those people who have no background in computer science to actually program these little bots and that's really what tracery is is this environment of building these little bots that, you know,
Starting point is 01:02:06 do things. And, and there's like, it turns out a huge creative space. And it has this really cool feedback because they do things unexpected, they react on conversations. And you can sort of see, you know, how your creation behaves in very unexpected ways because this complex environment of tweets.
Starting point is 01:02:30 And so it's an ingenious way of engaging people into something that they would never think they could do, programming. And so I think that, you you know but there's so much more to this uh there's this she you know her thesis defense was like this really eye-opening event for me to see you know what casual creativity really is and i really look forward to hear her on your show she sent me a link to her defense. I'm looking forward to watching it. I'm familiar with Kate and Tracery because she gave a presentation on it. And I wrote a little Twitter bot.
Starting point is 01:03:15 So I suspect there'll be more about that. It was easy to write, but I have a programming background. Let me add a little bit also why actually her incubator project got accepted. That was actually very interesting to me. One of the things that, yeah, I was really concerned, as I said, that, you know, because all the industry members are device makers, you know, the thing that they really emphasized after I asked them, them you know why did you like this project was that first of all she has this amazing track record to get things done to make things successful right and that is for companies who fund cross actually actually the most important thing to show that this works right to show there's examples where when industry supports universities to do things the open source software way, really amazing things come out of it. And the second thing was that Kate's project is actually a wonderful example of a community service.
Starting point is 01:04:23 And the companies really like to point out that they are part of such a thing so it's a very positive thing that they want to be part of you mentioned that many of the funders are device companies but you haven't talked about any device projects yeah that's interesting so the reason why they joined cross is because of the history we have with Ceph. And Ceph was a huge deal or is a huge deal for these companies. Ceph is a storage system. So you need to actually, in order to use a disk drive or a flash device or, you know, any of the storage devices. You need to have a storage system. You can't just put a disk on a table and expect it to work. You have to actually have something around it.
Starting point is 01:05:12 And so the industry, how the industry worked out before you had a storage system, open source software storage system, was that these device makers had to wait for the vendors to essentially allow them to sell into the market. Because only through the storage system's vendors they could actually sell storage devices. And so,
Starting point is 01:05:47 an open-source software storage system like Ceph is suddenly opening this market directly to them. You can just get the storage system for free and then you just have to buy those storage devices for yourself and put them into the system that you run this open source software system on. And so they could then see, okay, you know,
Starting point is 01:06:10 when we come up with a new storage device, we just have to make sure that it actually works really well on Ceph. And so they could actually put their own software engineers and contribute to this open source software storage system so that these devices work well in that system. And so it's a very different dynamic
Starting point is 01:06:34 that essentially broke open the market and increased the market for them. And so it's a very great advantage for storage device makers to sort of support open source software systems and universities and the hope that will happen again right and so we have one one project that i didn't mention is basically a a new way of of of devices and how they interact with computers. And they're very interested in that.
Starting point is 01:07:07 So we create sort of an ecosystem of open source software as part of CROSS and part of the research we're doing at the university that's very promising for them to have these, to explore new markets for them, right? That are not blocked by existing system spenders. That makes sense. But it seems like you're reaping the benefits of Ceph going forward with open source projects
Starting point is 01:07:37 that are more varied. And it's because companies are hoping something will come out that will be interesting or useful to them yeah i think it's kind of like playing the same lottery numbers yeah um a little bit but on the other hand you know you have uh i think the sef project kind of demonstrated that you know students build amazing things as part of your PhD projects. And most of the time it's thrown away. And, you know, nationwide you have 45,000 or, yeah, I think that was right, 45,000 PhD students graduating every year just in the STEM fields. And, you know, likely, there's likely going to be quite a few cases where they build amazing systems and they get thrown away, right?
Starting point is 01:08:34 And so, if you think about how much innovation would be possible if you would just give those students a little bit more time to actually bring those infrastructures into the world and make them usable for other students or for other people for the industry um you know how much innovation do you get from that right and that's i think that's really what we're out for and that's what the industry kind of is is hoping for right and it's for them to, and then there's all this other traditional stuff, right? So the university fundamentally, research universities produce students and papers, right? That's what we do.
Starting point is 01:09:15 And so even if those projects don't turn out well, they get to know really good students and there can be potential recruits in the future. Failed research is not bad. Right. There's goodness there. It doesn't feel good, but definitely figuring out which paths not to go down is very
Starting point is 01:09:36 useful. Right. Well, I think we have kept you for quite a while and it is a sunny day in Santa Cruz. So, let me close it up with, do you have any thoughts you'd like to leave us with? Yeah, I think that, you know, it's important for I think students to realize that they have done a lot of
Starting point is 01:10:01 really new and cool things when they are at the university. And when they graduate often, a lot of students, when they graduate, they have a piece of really new and cool things when they are at the university and when they graduate often a lot of students when they graduate they have a piece of paper right masses or a phd even and um and they have sort of lived for a very long time to actually get there right to have a master so have a phd but then they often haven't really thought about what comes next, you know, maybe a job or whatever. I see this again and again. And so being, coming out of, you know, and putting yourself into a situation where you really leverage what you've done, I think it's done far too little, right?
Starting point is 01:10:47 And one of the problems is it's actually hard. It's hard to do. I mean, a very small fraction of students, of PhD students go actually into academia. It's very competitive. And that's really the only path where you can really leverage the work you've done. When you go in industry, yeah, they basically just know you're very smart, right?
Starting point is 01:11:08 And maybe they know your specialty area, but you have very little control of where you're actually going to work. I mean, and what kind of part of whatever field making, giving, you know, students more opportunity to sort of, you know, become an open source software leader. And there's many examples where when you're an open source software leader, you get a higher salary if you get hired at the companies. And you just are in a very different kind of position of power than if you just graduate with a PhD and like, okay, where do I go next, right? And so I think that's sort of my, encourage people to sort of take the destiny in their own hands
Starting point is 01:12:15 and sort of leverage what you've done and appreciate what you've done and try to build on it. That seems like good advice even for podcast people. Thank you. Our guest has been Carlos Montzahn, adjunct professor of computer science and engineering at UC Santa Cruz. He's also the founder and director of the Center for Research in Open Source Software.
Starting point is 01:12:42 That's CROSS. I'm sure you can find it online, but of course there will be links in open source software. That's CROSS. I'm sure you can find it online, but of course there will be links in our show notes. Thank you for being with us. Thank you. Thanks, Carlos. Thank you also to Christopher for producing and co-hosting. And thank you for listening.
Starting point is 01:12:57 You can always contact us at show at embedded.fm or hit that contact link on embedded.fm when you're checking out the show notes. And now a quote to leave you with from Eric Raymond, who wrote The Cathedral and the Bazaar. It may well turn out that one of the most important effects of the open source's success will be to teach us that play is the most economically efficient mode of creative work. Embedded is an independently produced radio show that focuses on the many aspects of engineering. It is a production of Logical Elegance, an embedded software consulting company in California.
Starting point is 01:13:40 If there are advertisements in the show, we did not put them there and do not receive money from them. At this time, our sponsors are Logical Elegance and listeners like you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.