Disseminate: The Computer Science Research Podcast - Harry Goldstein | Property-Based Testing | #55

Episode Date: June 25, 2024

In this episode, we chat with Harry Goldstein about Property-Based Testing (PBT). Harry shares insights from interviews with PBT users at Jane Street, highlighting PBT's strengths in testing complex c...ode and boosting developer confidence. Harry also discusses the challenges of writing properties and generating random data, and the difficulties in assessing test effectiveness. He identifies key areas for future improvement, such as performance enhancements and better random input generation. This episode is essential for those interested in the latest developments in software testing and PBT's future.Links:ICSE'24 Paper Harry's websiteX: @hgoldstein95 Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. As usual, Jack here. Welcome to another installment of our Cutting Edge series. And today we are going to be talking about property-based testing with Harry Goldstein. Harry recently finished his PhD at the University of Pennsylvania. And in the fall, he will start as a postdoc at the University of Maryland. Welcome to the show, Harry. Thank you. Great to be here. Great. So as I mentioned there, we're going to be talking about property-based testing today. So this work is obviously...
Starting point is 00:00:54 The reason why I invited you is you have a paper that was published at the International Conference on software engineering and it actually won the Distinguished Best Paper Award. So congratulations for that as well. Thank you. engineering and actually won the distinguished best paper award. So congratulations for that as well. And cool. So before we do talk about that, let's, let's learn a little bit more about yourself. So can you tell us how you kind of got interested in research and yeah, why property based testing? Yeah. So I'm interested in research because I've been interested in, I basically wanted to keep doing school for as long as I could keep doing school. I thought I was going to be a software engineer at one point during college, and that was sort of the path I was on. But my undergrad research advisor, Adrian Sampson,
Starting point is 00:01:37 who's just fantastic, by the way, suggested that I try to get a PhD, and I'm really glad I did. It's been a really fun process. And then why property-based testing? So my advisor, Benjamin Pierce, was working on this before I got to Penn. My soon-to-be postdoc advisor, Leo Lampropoulos, was doing property-based testing with Benjamin before I got there. And I was originally a little bit resistant to it. I thought, I don't know, is this for me? I want to do type systems. I want to do other more PLE stuff. But I ended up really enjoying working on property based testing. And it's been really fun and rewarding because there's some cool applications and stuff like that. So yeah, that's my journey. Yeah. I mean, I did the same thing once where I did a PhD. I just wanted to stay in school
Starting point is 00:02:29 as long as possible and just stay away from the real world as long as I can. But yeah, that's cool. So yeah, property-based testing then. So we've teased it for long enough. Tell the listeners, what is property-based testing? Yeah. So when you go to test code, if you're a software engineer generally, you're probably going to do some kind of unit testing, and it's probably example-based unit testing. So you write an input, and you write an output, and you check that your program actually, when it takes that input, produces the output. This works, but it's kind of time-consuming and error-prone. So it's time-consuming because you have to write down each example that you want to test.
Starting point is 00:03:09 And it's error-prone because if you forget examples, you've got nothing checking your work. You've got no way to know that you've covered all the important cases. And so property-based testing takes this idea and kind of generalizes it. Instead of having examples, you have some logical specification. So for example, if you're trying to do like a sorted list, so you've written your sort function, and the property that you might want to make sure is true is that when I sort any list, the thing I get out is ordered properly. This is a property that works for any input list. So you could take any
Starting point is 00:03:47 list, sort it, and check that it's ordered when you get it out. So that's the property part of property-based testing. And then the testing part is we generate lots and lots of random inputs to that property to check that the property's true or to get confidence that the property is true. So we'll generate lots and lots of lists, an empty list, a list with five elements, a list with that's in reverse order, you know, all sorts of lists and check that for each one, the property holds. So the result is ordered. And this technique allows us to basically check higher level specifications a lot more thoroughly than sort of standard example-based unit testing would let you do. Nice. Yeah, I can really agree with you on the statement of that. Writing kind of the example-based unit test is very, very time consuming.
Starting point is 00:04:40 In the day-to-day job, it's like, oh, no, I've got to write all these tests. Then the fun implementation, I have to test it. I it i mean technically we should probably do that in reverse right i should do the tests and then write the implementation but hey come on i mean that's what they're saying anyway cool that's awesome so kind of given that context and given that background what was the goal with your paper and your paper was called just i don't think i said it at the top of the show property property based testing in practice. So yeah, what was the elevator pitch for this work and what were you trying to achieve with it? Yeah. So this work, my research world is split between programming languages, human computer interaction, and software engineering. I sort of work on all three.
Starting point is 00:05:22 And I started off just doing programming languages stuff and, you know, with applications and testing. But I got to Penn was focused on property-based testing for proof assistants. And so this was a very different context than software engineering where I was interested. So the idea was basically, let's do some human factors work. Let's get out there, talk to some people who use property-based testing to understand how they use it and what they struggle with when they're using it. Because that's both a good source of new research questions and good justification of research questions that we had already been going after. So that's the elevator pitches that really we went out and talked to users of property-based testing and understood how they use it and what they need from it. Yeah, I love the way that you've kind of got your research kind of crossing those three main areas.
Starting point is 00:06:34 And because you've kind of got that sort of, it's quite easy, I guess, to almost do your research if you're kind of quite low level in like in your own little bubble and you've kind of away from the real world. And then, but then you lose that kind of impact aspect of it, right? Like, yeah, I can make this tiny thing a little faster or whatever, or quicker, but you're losing that sort of connection with the people actually need this for one, like, is this even a real problem? So I love the way you've kind of, you've got that kind of a foot across those three camps there. I think that that's awesome. Okay, cool. So you kind of went out and you tried to kind of find out what people were doing with property-based testing and what maybe some interesting future research questions could be. So to do that, you did a study, right? And it's there. I have it written down here as the Jane Street study. So tell us about the Jane Street study. Yeah. So Jane Street is very supportive of programming languages research in particular.
Starting point is 00:07:26 So they've sort of always been on our radar as a company to keep in mind when going out into industry to do studies and things like that. And when we chatted with some folks at Jane Street, Jane Street, by the way, is a financial technology company based in New York. Jane Street was using property-based testing pretty significantly, but they weren't using it everywhere for everything. And so we thought this was a good opportunity to talk to folks who were pretty expert or at least experienced users of this tool, but not folks who thought that this was like the end-all be-all and the only thing that they wanted to use. And so we thought we could get a kind of
Starting point is 00:08:12 balanced perspective from them. Plus it was a population that we had relatively easy access to. So that was how we chose Jane Street in the first place. And then the actual research methods we used are a kind of thematic analysis, qualitative interview style of research. So we got on Zoom with a developer, spoke with them for about an hour, asked them a bunch of questions, kind of went through a script, but let them guide our discussion. And then we took those transcripts and did a process which is kind of annoyingly called coding, which isn't programming. It's a different thing. It's where you kind of go through and you pull up passages and you attach codes to them, which kind of map to themes that you think are
Starting point is 00:09:07 kind of important in the research. Okay. Is that a manual process as well, or is the tooling to help you with the coding? Yeah. So I use a tool called Delve. Shout out to Delve. It's basically like it takes the transcript and lets you select on a sort of sentence by sentence basis. And then it lets you manage your codes on the transcript and lets you select on a sort of sentence by sentence basis. And then it lets you manage your codes on the side and assign them. But like it is a manual process in the sense that like you are you have the tools to help, but you're the one that has to go through every sentence that the user said that the participant said and like assign themes or codes if there's something interesting.
Starting point is 00:09:45 It is a very time consuming and labor intensive process. And that's what I was going to say, but that took a while because how many people were in this? How many engineers did you interview as part of this process? So we interviewed, it was 31 people, 30 interviews. One interview had two people and an hour each. So that was, you know, 30 ish hours of content yeah that's uh that's a long time that's a few few late nights and long weekends i think of uh of getting through that right yes absolutely um but i think but i think it's worth it. I think, first of all, the things we learned, I think, are worth the trouble. But also, I think good qualitative research, this is the thing. People think that quantitative do their jobs and about human aspects of the work you're doing. So yes, it is time consuming, but as a result, you get some really cool data out of it. Yeah. You kind of, I guess,
Starting point is 00:11:05 pulled on this thread a little bit more of this sort of approach and this way of doing research. You have a section in your paper that you call threats to validity. So maybe we can talk about some of those and sort of the pros and cons of, I guess, the cons now
Starting point is 00:11:16 of doing this type of approach, of this type of study, sorry. Yeah. So, right. I've talked about the pros. Let's talk about some cons. First of all, cons of the methodology. So there's two big things that can happen that are a shame in this situation. So the first is quantitative research is important for some things. For example, you get to say, you know, if we did our recruiting and sampling correctly, then this generalizes,
Starting point is 00:11:47 right? This, we, you know, with probability, vanishingly small probability, this thing is true with a capital T. You can't quite say that with these kinds of methods. We have made observations which we think are certainly robust with respect to Jane Street and probably robust with respect to people like the folks that we talked to at Jane Street. But there's no good measurement to know exactly how far this goes. It takes a little bit of extrapolation and intuition to decide, do you really think this is going to go farther? And that's okay. I think doing this well still does tell you things that can be generalizable, but you just have to be a little bit careful about how true you believe these things to be. And then another threat to
Starting point is 00:12:40 validity is just when you're doing an interview study like we did and not observing people working, they can unintentionally lie to you about how they think about things. I'm not calling any of our participants liars, by the way. I'm just saying that when you ask someone, what were you thinking about when you wrote this test? They may not remember and they may guess at what they were thinking. So that's another kind of unfortunate threat. Yeah. Like retrofit and narrative. Oh yeah, I was thinking this, thinking that really, I was thinking about lunch probably. Yes, exactly. Cool. So let's, let's get onto some results then. So let's start off with the,
Starting point is 00:13:24 the kind of key findings from, from the on to some results then. So let's start off with the kind of key findings from the study then. Yeah. So we split our results into really two categories. We have observations, which are mostly meant to contextualize current and future research and just kind of clarify how PBT is being used. And then we had research opportunities, which were about either projects that we plan on embarking on or that we think others should, or like very concrete, like here's a thing that we can do based on what we learned. I think going through all of the opportunities and all of the observations is probably not worth doing. But when I give talks on this, I isolate a few. So I'll talk about a few now. And then
Starting point is 00:14:12 if you want to go back and ask about a few more, we can do that. Is that good? Yeah, that sounds great. Okay. So one really interesting observation, one important baseline observation is like, were people getting benefit out of this? And the answer is like, absolutely, yes. The folks we talked to were overall pretty happy with property-based testing as a tool. We even, we tried to find people who weren't. We asked our contact at Jane Street, who's a co-author on the paper, Daniel Dickstein.
Starting point is 00:14:44 He's great. We asked, can you find us folks who don't like property-based testing that we can talk to as well? And we found a couple, and they mostly said, yeah, this is a situational tool. This is a tool that I don't use everywhere. And that's fine by me. That doesn't sound like a... That's not disliking it in my mind i think all software tools are situational in one way it's part of a toolkit right like it's it's not kind of if i mean if you've got a hammer and everything's a nail right but if you want you want
Starting point is 00:15:16 a kind of a nice variety of tools therefore it's like a different what's the word i'm looking for anyway sorry continue the phrase has gone from my code. Now, how do I do property-based testing? Instead, what they said was, oh, I've got this thing that I want to make sure is true, and property-based testing is a good way to help me figure that out. So for example, sometimes developers had function and an optimized version of that function, and they wanted to make sure that the optimized version still did the same thing as before. So you can use the tools that property-based testing provides to write a property that says the outputs of this are equal to the outputs of that. And you can test that your optimization didn't change anything.
Starting point is 00:16:20 This is sometimes called differential testing or model-based testing. The testing literature has been talking about this for a long time. I'm not claiming that property-based testing invented it, but the tools that property-based testing provide make this easy to apply. So that was one of the kind of situations. There's more that we discuss in the paper of like, just we call them high leverage scenarios. These times when a developer notices something and all of a sudden now they can check this thing
Starting point is 00:16:48 that they noticed they wanted to check using PBT. I have a quick question on the actual context because obviously I work a lot with distributed systems. And do you find that PBT is useful in that scenario or it's kind of useful more in sort of I'm writing some nice simple application that I can kind of reason about quite easily in my head. It's the state space isn't as large in the sense of this value, various different components interacting with each other. So, yeah, it is. I guess what were the contexts that they're using it in at Jane Street?
Starting point is 00:17:20 Was it for that kind of simpler cases in a way? It's a great question. So there's a quote from the study, which I really like, where one of our participants says that property-based testing is good when you have a clean abstraction with a complicated implementation. And I think that this happens in all kinds of systems, right? In a distributed system, yes, you've got a lot of state, you've got a lot of communication, you've got a lot of complexity. But on a single node, you've got some core logic. And that core logic is probably pretty easy to understandT is really the most exciting, is when you sort of drill down to the core of an application, and you've got some code that lots of things are depending on, but they're depending on a pretty constrained API that you can say
Starting point is 00:18:15 things about. This is both an upside and a downside, because on one hand, it means that you can, you have tools to test these kind of nicely abstracted cores of applications. On the other hand, people don't always put in the trouble, put in the effort to nicely abstract those cores of their applications. So I see this, you know,
Starting point is 00:18:39 I see it as good incentive to have nice abstractions, but some folks don't think it's enough incentive right cool the the other thing that i kind of says from kind of my own experience of playing around with with pbts is the actual the writing of specifications is sometimes i find it really hard to express it in a way sometimes of kind of like what i want and that actually that process of actually writing a specification is in itself like time consuming and hard right so kind of what were your findings around that sort of area as well yeah so this is this is kind of gets back to what i was saying about the high leverage scenarios is that we didn't find so much of jane street folks going
Starting point is 00:19:23 out of their way to specify some piece of code that they didn't have a specification in mind for. They were using property-based testing to test a specification when one was obvious to them. I don't necessarily think this is a good thing. I think that we should have, and I'm thinking about working on, better tools to help developers come up with specifications when they want a specification for their code. But I think that it's a really hard problem because, first of all, just coming up with the specifications hard, but also the specification is really a core part of the trusted computing base of an application, they call it, like the part that we really need to believe in. Because if your specification is wrong, then, you know, all bets are off. Right. You really need that to be right.
Starting point is 00:20:16 And so we can't do things or folks might try, but I don't really love the idea of like make chat GPT give me a specification. Like even if that worked some of the time i don't i don't buy it i don't buy that it's robust enough and so yeah yeah on that i can never remember which we are on these false negatives and false positives are but either way you don't want one of them at all you want to be 100 like correct and yeah but exactly cool let's let's let's talk about uh the, the other kind of another key component of PBT and the, the, the, the test generation, the generation of test generation of test data.
Starting point is 00:20:52 There we go. Yes. Yes. So tell us what you found out about that. And, and yeah, what, what did the Jane Street folks, how did they find that part of PBT? Yeah. So this is, this is what I have worked on most on the kind of PL side of my PhD. And I think the thing we found, which I don't think is terribly surprising, but is kind of important to know, is that at the point when developers are trying to apply PBT, they don't have the time or energy to think about the generators. They write their spec that was maybe hard and took a lot of thought. And now they want to test the spec. They want to just do the thing.
Starting point is 00:21:33 But in property-based testing, you do need to be a little bit careful about how you do that random generation. Because suppose, so that the sorting property I gave you before, that one's fine. But maybe you have one that says, if I have a sorted list and I insert using, you know, kind of insertion sort function, if I insert an element into that list, then the new list is also sorted. Well, in order to test that, you need sorted lists. You need to start with a sorted list in the first place. And coming up with a generator of only sorted spoke with kind of reiterated, like, you know, when I'm in those situations, it's tricky. It's tricky to take the time and energy to step away from what I consider the testing process and, like, write this extra infrastructure. Yeah. I just, there's a quote in there, in this section of the paper that made me want to first read it.
Starting point is 00:22:43 Do you know which one I'm going to say? I think I do do yeah anyway well it's got some little stars asterisks in the word yeah yeah yeah so this is the ppx quick check it's on pay for the interested listener oh there's no the page numbers are there's no i say it there it's on page seven i believe first paragraph yeah i don't know if i if i say the word out loud maybe i have to i can't click pg on the podcast anymore so maybe i don't want to get so i don't want if I say the word out loud, maybe I can't click PG on the podcast anymore. So maybe I don't want to get silly. I don't want to say it, but like it says fucking amazing. Yeah. We're all out.
Starting point is 00:23:11 We can say the naughty words. I love that quote. I think, yes. So to give a little bit more background on sort of what happened there. So when you want a generator and you have a data type for the thing you want to generate, there's a PPX, a kind of macro in OCaml, which will just give you a generator for things of that type. And this means the developer doesn't have to think at all about the generator. They just get one for free. And one of the
Starting point is 00:23:46 participants said, yeah, that it was effing amazing. And I'm so, so proud that I got that past my co-authors. I was like a little bit worried that they were going to veto it, but they didn't. And none of the reviewers said anything. And I just thought it was great. I do have to say, though, there is one thing that isn't effing amazing about these things, which is that in the case that I said before about if there's kind of a precondition of the property, if the property says, if a list is sorted, then something, those PPXs don't take that into account. And so if you're not careful, you can generate a million lists,
Starting point is 00:24:26 but only 10% of them are going to be sorted. And that 10% is going to be probably very short, right? Because it's easier for a short list to be sorted by accident. So one thing we highlighted in the paper is that developers were often overconfident about the quality of the testing they were getting from the generators that came out of these PPXs. And I think this is a big problem when it comes to evaluating and understanding the kind of impact of a test suite with property-based testing, because it really, really depends on the performance of the generator. Yeah, kind of, we were kind of joking about chat GPT being not the best use for specifications. How about the test data side of things? More applications there maybe,
Starting point is 00:25:17 or that type of sort of machine learning approach could maybe be more useful in there? Yeah, this is a great question. So there's been work by a bunch of folks, including Caroline Lemieux, and I forget who else is working on that right now. But there's a group that has worked on this before where they're trying to get LLMs to generate inputs directly. My understanding of that work is that it's expensive to get the LLM to give you inputs. And it's also the LLM is trying to give you things that are predictable, like that's how LLMs work. And so there's a risk that it will give you inputs that are too predictable, whereas you really want
Starting point is 00:26:03 something random. You want edge cases, right you really want something random. Edge cases, right? You want to test the full space. If it's going to give you the middle every time, then yeah, I want the edges of the bell curve, right? Exactly, exactly. I am interested in exploring and have been thinking about using the LLM to produce the generator
Starting point is 00:26:22 because if we could validate that the generator it produced was good, then there's a little bit more flexibility there. But that's really a program synthesis problem and has all of the complexities of a program synthesis problem. So it's a thing that I think certainly could be a thing down the line, but not yet. Cool. Nice. Whilst we're kind of running through the sort of results areas, there's two last bits, understanding failure and understanding succession, your paper. So let's start off with the failure case and debugging. So tell us what the GNS3 engineers thought about that sort of when things go wrong with your PBT, how the hell do I debug what's gone wrong? Yes. So there's a process that PBT, that within PBT
Starting point is 00:27:12 called shrinking, they'll also call it test case reduction. This idea that, well, okay, I'm doing all this work to generate these crazy random inputs that are going to cause my test to fail. What if those crazy random inputs are really big? Debugging with that is really hard. And so what you do is you write or obtain, through some other ways, a shrinker, which can take the big input and get you to a much smaller one that still fails the property. This process is difficult for some kinds of inputs, especially if you have complex preconditions, but it's really, really worthwhile. It's worth doing. And the developers that we spoke with at Jane Street were impressed upon us that in some cases, this was really important because they wouldn't be able to understand what had gone wrong
Starting point is 00:28:02 if they got a megabyte worth of input that they didn't know able to understand what had gone wrong if they got you know a megabyte worth of input that they didn't know what to do with yeah cool i guess this maybe is there's a there's a an area for future research and sort of making these shrinkers better as well maybe and kind of handling kind of larger and larger amounts of data maybe i'm not i'm not massively up to speed on the shrinkers they sound really cool though it makes a cool name for something but yeah yeah i yes there is there's absolutely room for so so the downside that the developers were talking about was that often in pbt frameworks you have to write these by hand or if you want one that can be robust with a precondition you have to write it by hand. I have some work that appeared at ICFP in 2023 that tries to improve the process of shrinking by – there's work basically that uses generators to do shrinking. And so there's some work by Dave McIver that I built on in an ICFP paper.
Starting point is 00:29:04 And so there's work on that but i think i think the story around automated shrinking is probably not done yeah i mean this is a great is the obvious paper title in there as well honey i shrunk the something right i mean someone's probably already done it right but it's it's someone's got to do it if they haven't already but yeah i don't think i've seen one i have to i have to check on that because you're right if if no one's done that someone has to do that yeah cool right so yeah the last the next thing understanding success then so often a lot of the time even though we've kind of we've had success does that mean the thing's correct and yeah what do what did the Jane Street engineers think about this? Yeah. So, you know, you get a hundred inputs run test passed in your terminal and you're really
Starting point is 00:29:52 excited for a second, but what you should do is think like, what does that mean? And what we found was that the Jane Street developers admitted that, you know, a good chunk of them admitted that, they don't always take that extra second to think about what that test passing means. And like I was saying before, the quality of your input generator really decides, you know, how confident you should be after your property passes. And there were some unfortunate cases that we heard about where, you know, bugs made it into production because the developer saw, okay, 100 tests passed, I'm happy, and didn't realize that their generator was missing an important part of the input space of the input distribution. So there is, I think, some really good room, which I have been,
Starting point is 00:30:49 there's room for work, which I have been exploring on helping developers to evaluate once they've done their testing, like how well did that actually go and helping them kind of visualize the input space that they're actually testing with. Yeah, there's something kind of, obviously, it's related to this sort of line of work, I believe, around verifying isolation levels in databases. And it being, and having the, it's kind of, I think, I'm not exactly sure. It's very related to property-based testing. Essentially, you want some, like, I want this isolation level to hold,
Starting point is 00:31:27 and you fire a transaction history, and then you say, okay, yeah, did it pass or did it violate? Did I see this anomaly? And the problem there is that it's really hard to say that you have, the system will always kind of hold in that it'll always hold because a different set of transactions you might see the anomaly all you can say is kind of yeah we can it's hard to prove a negative right so like it's it's i guess i don't know whether that specific problem is more general across property-based testing as well like sometimes it's really hard to like okay yeah i've tested on 100 inputs, but that 101 might fail.
Starting point is 00:32:07 Yes. Yes. And this is a problem. People will always ask, is it possible to get an estimate of how likely there's still a bug given that you've done your testing. And Marcel Boma has a really cool paper about using coverage to do some of those estimates. But coverage is also a kind of incomplete measure of how good, of how well testing does. The fundamental problem is that without knowing the distribution of bugs that you might have written there's no way to really know for sure how likely there are it likely it is that there's bugs after you've done your testing and if you knew the distribution of bugs you should just go fix them because apparently you know where they are yeah yeah but definitely compared to sort of like the example-based unit tests like you you have a lot more confidence with this approach.
Starting point is 00:33:05 And even though if you can't still have 100% confidence, right? But it's always good to never be 100% confident in most things, right? That's correct. And if you see like, oh, I ran a thousand inputs through this program and they look good. They look like a diverse set of inputs. They look like they're hitting different corner and edge cases. You can have some confidence, right? It's not doing nothing for you.
Starting point is 00:33:31 Awesome. Let's get on to future research directions then. So yeah, from the study, what are the sort of kind of key directions for future research you think they are out there for PBT? Yeah. So, so I think the stuff that we were talking about with generators is probably the, the biggest and hardest one is like continuing to make it so that developers have access to automatically produced high quality data generators is the thing I've been working on for the last five years. And the thing I've been working on for the last five years and the thing I will continue working on probably for the next five years. But I think it's worth doing because once you have that, there's kind of no reason not to do property-based testing, at least sometimes. I think there's good future work in helping developers to find properties and to understand, be able to kind of pattern match
Starting point is 00:34:29 and see like, oh, I'm in a situation where property-based testing would be great, where property-based testing could really tell me if my application is doing what I think it's doing. So that's like both education and tooling to kind of help with these sorts of things. And then I think the big third point is evaluation, is working on. So I had a demo at WIST in 2023, an HCI conference of a tool that helps developers to evaluate what happened during testing. And I've continued working on that. And I think there's also a big opportunity in kind of understanding metrics
Starting point is 00:35:14 for how developers should be measuring testing effectiveness. There are lots of cool ways that people try to estimate like how good a suite of tests is and i think some of those just haven't been applied to pbt yet and there are also probably new ones that work better for pbt than in other situations that that folks could come up with now on the on the tooling um question really quick so it kind of feels to me i don't know this is this is true because i've not got a kind of a view of everything but certain languages and ecosystems have this sort of stuff as more of a first class citizen in terms of that it's easier to sort of get it in your past part of your workflow i'm thinking kind
Starting point is 00:35:58 of some of the the more functional programming languages like haskell and or camel and i think even i think rust is where i can i think there's a few nice cool libraries. But how is it the same for something like Java? Because I kind of, I use day-to-day for my sins. But yeah, how do you feel like, could there be more work there on some of the sort of, and other programming languages to sort of make, in those ecosystems to make it kind of easier
Starting point is 00:36:19 for software engineers to reach for? Yes, yes. As part of my work on the stuff I demoed at UIST, I went through the PBT frameworks in like lots of major programming languages to understand sort of what they provide in terms of tooling and that kind of thing. By the way, almost every major programming language does have a framework for property-based testing. I can't guarantee they're great, but like JUnit QuickCheck, I think, is the one you'd want to use in Java if you were interested. The problem is that because they're so inconsistent and because they're implemented in sort of different styles, some, you know, some just try to clone QuickCheck from Haskell, which is like the first, the original property-based testing framework.
Starting point is 00:37:04 Some try to do other things. quick check from Haskell, which is like the first, the original property-based testing framework. Some try to do other things, but because of this inconsistency, it is hard to guess for a particular language, like how good the PPT support will be. And I got around this in the demo I had at UIST by trying to come up with something that was generic, trying to come up with a thing that each framework can implement a hook for once, and then we can use this user interface in other, you know, just across all of them. Because I think we should, as a PBT, as PBT researchers and developers, start to try to pull this space together a little bit and and and present a little bit more of a unified front because i think the easier it is for developers to transfer those skills between languages the better chance people continue
Starting point is 00:37:57 to feel like they have this as a tool they can reach for yeah for sure that's sort of like standardization i guess in some way is definitely helps kind of i don't know i mean you can look at even from like from query language or anything having a standardized or a kind of a go-to way of doing things just helps adoption massively right and once you've got you say once you've got that new toolkit wherever you go you can use it and that's only going to kind of yeah the benefits they're going to compound over time more for sure so yeah that would be really nice to to see cool yeah, the next sort of talking point I have is about impacts. And obviously, PBT has had a lot of impact. But what impact do you think your
Starting point is 00:38:33 work can have? Obviously, this paper and the previous work you've done in this area and going forward as well can have on, would you like it to have on sort of software engineers' day-to-day lives? Yeah. So, so I think, I think there's kind of two places that I really want to impact from this. One is getting the word out about property-based testing, getting more people, making more people aware of it and giving people better recipes for how they might be able to use it in their development workflow.
Starting point is 00:39:03 So does the high leverage scenarios are one thing. And just kind of understanding the benefits and drawbacks of property-based testing, I think, will help people make better decisions about how and when to incorporate it into their workflows. And then I also want, and I hope this has impact on the research of property-based testing. I hope this provides good motivation to folks for some problems that are worth pursuing, some ideas about how property-based testing is being used that may run counter to the way that current research in property-based testing is thinking about it being used. And sort of getting those things better aligned so that the research we're doing has a better chance of having impact kind of down the line. Will there be a, would you think there'll be a sort of a, this sort of study again in maybe five, 10 years time,
Starting point is 00:40:01 and to see how some type of study like this to sort of see how the adoption's kind of going or yeah, some sort of, what's the word I'm looking for? Some epidemiological study of kind of the spread of PPT throughout the industry. I don't know. I mean, if it was a big enough thing that it was worth an epidemiological study,
Starting point is 00:40:18 I'd love that. That would be great. I do think more qualitative work in this space would be good. I would love to study a company that is adopting property-based testing actively, because we got established PBT users at Jane Street. And understanding the growing pains as it's being adopted is another thing I think would be really worth doing. But yeah, I'd like to see more studies like this. I users if there were more studies like this that were bringing those user experiences into the programming languages community yeah i agree with that that point completely i mean even sort of when i was kind of doing my phd and i was in
Starting point is 00:41:20 davis i found the papers that were kind of when an asked actual end users questions were so great because you see what people but the pain point people have so you can go and solve those and you can have that nice thing i'm actually solving someone's actual problem cool let's um let's talk about surprises now harry so whilst working on on pbt what has been the sort of the most surprising thing you've thought of maybe maybe on this maybe on this project or kind of on pbt as a whole and what's been the most surprising thing you've sort of maybe maybe on this maybe on this project or kind of on ppt as a whole and what's been the most surprising thing it's a good question i i think i'm i'm surprised that it's still a niche thing is is that is that too self-promotional i don't know i'm by the way just to be totally clear property-based testing was absolutely not
Starting point is 00:42:04 my idea right this is a thing i'm studying and working on, but the original library for it was by John Hughes and Kuhn Klassen in 2001. But I think that it does feel to me when I write code that there are these times when I'm just like, oh, this is the thing I need. This is how I make sure that this program is doing what I think it should be doing. And I think that I'm surprised by the pushback, both when it comes to people not adopting it broadly, but also pushback on how important correctness is. It sometimes feels like you're telling people to eat their vegetables,
Starting point is 00:42:46 but like, it's really important that, that software that we write is correct. Sometimes like it's impacting people's daily lives and their finances and all this stuff. And, and correctness is really key. And there are so many good tools out there for helping people make sure their software is
Starting point is 00:43:05 correct that just like don't get used because they're like a little frustrating and i i was surprised by that and continue to be surprised and a little disappointed by that but you know there's there's room for that i think yeah it's funny it's funny you said because i was at um just kind of off on a tangent here but i was at an air show the other weekend and i was looking inside the what kind of which what plane it was it was like a stratophosphorus or something this giant big bomber thing we're in the under carriage and there's all these electrical wires and everything obviously of the computer system on those things as well and thinking like how do they have the confidence that this is does this thing carried nukes probably
Starting point is 00:43:47 like how do they have the confidence that this is gonna be like i don't know i agree i think correctness is so so important especially when you think it's like plays such an important role in every you'd hear if you're if your airplane started coming out the sky right because it's someone had not tested something properly so uh yeah plus one for me on the correctness yeah i think we should think about it more but anyway yes so yeah origin story so this specific paper how did it come around and so was it sort of like what was the original idea who are they okay let's do this study who was the kind of the creator of that, I guess. Yeah. So it really started in, I think, the third year of my PhD. And Professor Andrew Head had just come to Penn.
Starting point is 00:44:31 He had just started. And he was one of the two first HCI professors, Danai Mataksa being the other one. And when Andrew got here, this was at a point where I was a little bit lost in my PhD. I was like, I had done a little work on random data generation, but I wasn't sure where it was going, and I wasn't sure if it was having impact. And my advisor, Benjamin, said, go talk to Andrew and ask him if HCI people have ways of measuring impact in the way that you want to have impact.
Starting point is 00:45:03 And Andrew had a whole bunch of ideas. And so I took his class. We wrote a little paper during class called Some Problems with Properties, which we published at HATRA in 22. And then we were off to the races with the study. We had some conversations with Ron Minsky at Jane Street and it just made sense. So I would say the idea was kind of Andrew, Benjamin and me kind of in a bunch of meetings thinking about how to measure impact and this seemed like the study to run. Nice. That's a good segue to hurry into my next question and this is my always my favorite question and it's about the creative process and generating ideas and then selecting which ones are actually worthwhile working on so how do you approach that do you have a systematic a systematic
Starting point is 00:45:55 way of doing it or is it let's just get in an office and just chat and see what comes up that's such a good question so i i don't think i have a very rigid process. I think I do have like a document of ideas, which just kind of seems to always grow and never shrink, and I never get to any of them. But I think the ideas that I really end up pursuing are ones that I feel like I can get my collaborators excited about. I think collaboration is critical in the research process. I would not know how to do research alone. And a lot, I think, of how I guide what I think is worth working on is like, can I get people excited about it? And is it, you know, it doesn't seem possible, obviously. So, so, you know, I'll, I'll, I'll play around with an idea for a while, see if it works. And if it seems like it's starting to work, I'll bring it to someone and say like, hey, I did this thing. What do you think? And if they say, then maybe I abandon that, right? Maybe I don't keep going. But if my collaborators seem excited, then full steam ahead. So I think that's the process that has gotten me through my PhD. Although I think this
Starting point is 00:47:07 is still a thing I'm very much building. I think that the creative process is probably the hardest part of research and I don't think it ever becomes very easy. Yeah. I think it's an always evolving process for sure. But that's a nice answer to that question. Cool. We've arrived at the time for the last word now, Harry. So what's the one takeaway you want the listener to get from this podcast episode today? Try property-based testing. Pick it up. Whatever language you are working in probably has a library for property-based testing. Get familiar with it so that if the time comes when it seems like, oh, property-based testing, get familiar with it so that if the time comes when it seems like,
Starting point is 00:47:46 oh, property-based testing would be useful for this, you have a really great way to get assurance for your software. Fantastic stuff. Well, thank you very much, Ari. It's been a lovely chat today. I'm sure the listener will have thoroughly enjoyed it as well. We'll put links to everything you've mentioned
Starting point is 00:48:00 across the course of the podcast in the show notes as well. And where can we find you on socials are you twitter ix sorry linkedin i'm i'm still technically on twitter at h goldstein 95 i'm also like theoretically on macedon which is at harrison goldstein sorry me at harrison goldstein with a dot before the i n My website is the main place to reach me, which is my name, Harrison Goldstein, but with a dot before the I-N. And if I'm allowed a little shameless plug,
Starting point is 00:48:34 I will be on the faculty job market in the fall. So if folks are looking for tenure track faculty that does PL, HCI, and or software engineering, please get in touch. Awesome stuff. Great. We'll leave it there. And we'll see you all next time for some more awesome computer science research. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.