Disseminate: The Computer Science Research Podcast - Shrey Tiwari | It's About Time: A Study of Date and Time Bugs in Python Software | #64

Starting point is 00:00:00 Disseminate the computer science research podcast. Hi, and welcome to disseminate the computer science research podcast. I'm your host, No, Jack Watby. My name is Bogdan Soika, and I'll be guest hosting today's episode. Today I'm joined by Shrey Tiwari, who is a midway PhD student at CMU, advised by Rohan Padier. Shrey has recently published a paper on date and time bugs called It's About Time, an empirical study of date and time bugs in open-source Python software. This paper appeared in the International Conference on Mining Software Repositories this year

Starting point is 00:00:40 and won the best paper work there. It's my pleasure to have Shrey here with us. Shrey, welcome to the show. Hi, Bogdan. Thank you so much for having me here. I'm excited. So Shrey, do you want to tell us a little bit about your research? Leon, tell us a bit about your PhD journey so far.

Starting point is 00:01:03 Yeah. So I can start up by telling you how I ended up at POSSA Lab in general. So I graduated in the year 2020 and started my job as a software developer at this company called Citrix. And Citrix is one of the leading players in the VPN Solutions market. And that's important because when I started my job, COVID was at its peak. Work from home was becoming the norm. So the demand for VPN solutions was growing rapidly. So around the year that I spent at Citrix, it was extremely intense.

Starting point is 00:01:40 I had a lot of feature developments, multiple production deployments, custom issues, and so many bugs. The battles cars that I gained at Citrix are what make me care about software reliability so much. While I was at Citrix, I started to wonder, you know, is there something? that I can do to make software more reliable in general rather than just making the product that I'm working on more reliable. And that's when I started looking at more opportunities outside of Citrix. And mind you, this is not when I had a PhD on my mind at all. But serendipitously, around the time, a opening at Microsoft Research showed up at the cloud reliability team led by Dr. Akash Lal. And I was lucky enough to join as a research fellow. But those of

Starting point is 00:02:27 you who don't know, research fellowship is essentially a pre-doctoral position. So I spend the next two years at MSR working on some important programming language problems that were, you know, crucial to Azure. And that's what made me realize that I enjoy research a lot. And I should probably consider PhD as a career option. And around the time, I started applying for PhDs. But as you may know, there are just so many incredible labs across the world. and it's really hard to pick and choose a hand few. But I chose Paso Lab for a couple ofusions. Firstly, you have to admit that it's one of the coolest names a lab can have, right?

Starting point is 00:03:08 And the slogan also goes like, we are not afraid of spec decode. And I think that's quite fitting. But apart from that, I think PasoLab was probably one of the few labs that actually had a very explicit ethos listed on their website. And let me see if I can pull it up. up right now, it goes something like for science to make rapid and sustainable progress, we believe that academic research should be made openly accessible, reproducible, and where

Starting point is 00:03:38 active facts exist also reusable. And that was something that really stuck with me, because I've always been driven by impact and change in the real world. So that's how I ended up at Pasteelab doing software reliability research. And the past couple of years have been great. Yeah. Cool. Great, great, great stuff. So you had a grown-up job and then decided that, you know, it's time to make some more, some broad impact. I like it. I think, I think this is becoming the norm. I don't know. I've graduated recently, but I've started my PhD back in 2019. But now when I talked with more PhD students that started more recently, They all have the same ARP. They went to industry. They gained some experience. They found the problem they excited about and started working towards it as a PhD student.

Starting point is 00:04:40 Great stuff. So speaking of that, what actually drew you to time-related bugs? I know you have been working on this topic for quite some time. You have a paper. You have some ongoing research. Or maybe I did jump a little bit ahead. Would you mind tell us a little bit about your research and how you end up working on this particular type of bucks?

Starting point is 00:05:09 Yeah. So it's actually an interesting story. So back in November 2023, when I just started my PhD, CMU was sending out invitations to summer REU projects. and essentially that stands for research for undergraduates. And Rohan was like, hey, do you want to mentor a few RAU students for the upcoming summer? And I was like, yeah, sure, I'd be excited and I'd love to. But at the time, I was working on this concurrency project related to this project called Frey,

Starting point is 00:05:48 and we really couldn't find anything that could wait until the next summer. So that's when me and Roan got to brainstorming about different ideas and Ron was like, hey, have you ever thought about date and time bugs? And I was like, yeah, I faced quite a few when I was at Citrix. And ever since different kinds of YouTube videos on confusing date and time concepts or developer blogs keep popping up in my feed. So since this was like an area of common interest, me and Rohan just wrote up a naive project draft for the summer and submitted it.

Starting point is 00:06:22 it got accepted but then later as in the next semester we continued looking into this problem more and we were surprised to see that there are just so many different cases of date and time bugs that are out there and and this this this was quite interesting because we're like this is such a understudy area there's so many problems it affects so many people can we do something about it and rohan went ahead and actually wrote an sf grant for this whole project which later got accepted and that that was nice then come summer

Starting point is 00:06:56 I had three incredible RU students who joined the lab and we set out on a project to build a tool to find date and time bucks but we soon realized that hey what even are date and time bucks like we couldn't even define the problem

Starting point is 00:07:11 and the project kind of pivoted into studying what's out there understanding the landscape and that's what resulted in the paper I'm quite glad how things turned out and I really hope that this research kind of benefits the academic world

Starting point is 00:07:29 and also the broader community in general. Cool. That's awesome. So you had encountered this type of bugs while you were at Citrix, so in industry. So this is not just an academic exercise. These bugs are plaguing real. software that have, you know, white base of users, right?

Starting point is 00:07:57 So why do they happen in the first place? Because when I read your paper, which is a great paper, by the way, I invite the listeners to go and read it. It's super accessible. It's great, great stuff. When I was reading the paper, the only thing that I could think of is, okay, these are very straightforward patterns. So why do developers keep making those mistakes?

Starting point is 00:08:26 They have been studied extensively in the past. They've plagued software extensively in the past. We know how to fix them, or at least we think we know how to fix them. Why are they still there? And why are they so hard to avoid and later detect? Yeah. Yeah. That's a great question.

Starting point is 00:08:45 So I would say that date and time bugs occur in software when developers bake some kind of assumptions about date and time concepts that may not hold in production. So let's take a simple example. Let's do a fun exercise where I ask you a few questions. Let's assume that you are a developer who is trying to develop a calendar application. Now, assuming that, let me ask you a few simple questions. Don't overthink them, but just tell me what's the first thing that you're in your mind.

Starting point is 00:09:23 How many hours do you think are in a day? 24. Okay. And how many seconds are in a minute? 60. Okay. And the last one may sound a little crazy, but please bear with me. Do you think you can time travel, at least as of now?

Starting point is 00:09:41 Are you asking me if I believe in time travel or if I can time travel right now? Oh, we can dedicate another podcast. Let me rephrase the question. Do you think time always increases monotonically? As a human being, yes. As a computer scientist, not really. Great, great, great. That's actually the right knowledge, right?

Starting point is 00:10:11 So first question, you said 24 hours, that's maybe not really true because there's at least one day in a year that has 23 hours and one day in a year that has 25 hours if you're living in a time zone that observes daylight savings. And not every minute in a year needs to have 60 seconds, right? Maybe there's this concept of lead seconds where, you know, people around the world add one more second to an year just to make sure that solar. time aligns with, you know, local time. So there's these, there's so many fundamentally complex things about day and time that people do not understand. I'm glad that you know that time does not always monotonically increase on computer systems. But a lot of people would just say, yeah, it always does.

Starting point is 00:11:01 And these kind of things can lead to a lot of problems in production. So for example, recently there was like an incident in Cloudflare. and essentially what Cloudflare was trying to do was it was trying to measure the RTT of their packets and they were doing so by taking the timestamps but because Cloudflare is so fast once when there was a leap second their RTT went to negative

Starting point is 00:11:25 because of time clock sinks and their system was not designed to handle RTT that was negative they didn't expect responses to come before requests and that's crazy because this was in proper production outage and they had to release a post-mortem report on this. So these are like things that developers actually bake into production.

Starting point is 00:11:48 And there are also other simple concepts that people just forget to handle, right? Like there was an incident in Microsoft Azure where they were trying to generate expiration certificates and the whole system just crashed on February 29 because all that they were trying to do was assign a expiry date of one year from today. And one year from February 29 does not. exist. So the whole Azure came crashing down. And this was like again another public post-mortem report. So these bugs are quite important. They can have money. They can cost you a lot of money. But there are also bugs that can cost you lives, right? Like if you look at safety critical systems

Starting point is 00:12:29 like aircrafts, there have been bugs in like Raptor 22 Raptors where, you know, systems have crashed when the plane has crossed an international date line. And it's put the part Pilots life in danger. Or there have also been bugs in like epic systems, which is like the software that's run in almost all of United States hospitals. And that prevents people from getting surgery on time. So day and time bugs do plague a lot of systems. They can cost you a lot of money.

Starting point is 00:12:58 They can harm human lives. They are really important. And why are they so deceptive? Basically because there are so many various data-in-time concepts that people don't understand. maybe I can walk you through some of the ones that we talk about in the paper, right? Sure, yeah, please. So one is just string representations. Data in times can be represented in so many different formats.

Starting point is 00:13:24 So for Python example, just 29th February in itself is a valid representation of a date time object. Now, you may ask, if I pass in the input as 29th Feb, what should be the year? should the default year be the beginning of the Linux epoch, 1970, in which case it is not a leap year, so parsing such a date would always fail? Or should the year be the current year, in which case parsing such a date would only succeed once every four years? And these are real problems, these have been discussed in Python PEPs. There is also a lot of date and time arithmetic that people need to understand about date and time computations.

Starting point is 00:14:09 So basic math rules don't apply to date and times. Let's take a very simple example of reversibility, right? Let's say you have X, you add a delta and subtract the same delta. You get X back in math. Not so much in date times. Let's say you have an object that represents 31st January.

Starting point is 00:14:28 You add one month, you either get 29th or 28th and then you subtract one month and you don't get Jan 31st back. so basic arithmetic rules don't apply and you have to be cognizant of these things when you're performing computations. There are also just weird geographic aspects and there's this one funny story

Starting point is 00:14:48 that I always like sharing. So there's this island called Samoa that's near the international dateline and at some point Samoa decided that it wanted to jump from one side of the international dateline to the other. So it wanted to move from a time zone

Starting point is 00:15:05 of negative GMT minus 11 hours to GMT plus 13 hours. So that's essentially a difference, a jump of 24 hours. And that's what Samoa did, right? Back in 2012, I guess, Samoa went from December 30th to Jan 1st and totally skipped December 31st. Now, imagine you are a developer

Starting point is 00:15:26 who's writing a calendar application or even better, a payroll application. Where do you even begin to handle these kind of things, right? So I think it's fair for me to say that there is a lot of knowledge out there that needs to be assimilated, organized, and, if I may, disseminated through articles, blogs and podcasts to spread awareness. I think this is one of the things that makes these simple bugs really hard to catch. This is very, very insightful. So just to see if I understood correctly, you're saying that, although we all understand dates because we're working with computers,

Starting point is 00:16:12 this may not well represent our human understanding of dates because computers can do operations on dates that need to be mindful of various rules. You can cross different time zones and even the fact that a day, an hour or a minute might not have the number of units we as humans think we have. And, yeah, so developers are human, at least for now. Yeah. AI will replace us all. So these are valid, valid concerns.

Starting point is 00:16:48 And I can see how developers can easily get tricked and tripped by these. How did you, I'm so sorry, please go ahead. So I was just going to add that, you know, For any kind of bug, there are essentially two things, right? One is, or essentially maybe two steps. First, you need to have an idea of what kind of a bug you're looking for, and then you've got to have a plan on how to generate a workload that could potentially trigger that bug if it exists in your program.

Starting point is 00:17:26 If you take the example of null point of exception bugs, it's pretty simple to answer both of these questions. A, you're looking for null point. or exceptions. And B, you generate a lot of corner case inputs and check if any null pointer exception is thrown by a program. This is pretty straightforward to do with existing techniques today. But for date and time bugs, both of these questions are kind of hard to answer, right? Hopefully with our paper, we are able to better articulate what kind of bugs we are looking for but still it's really hard to answer the second question on how do we test our program

Starting point is 00:18:08 because there are no tools that exist today that can allow you to run your software in multiple time zones or there is no tool that can allow you to simulate a dST change or even like test your application with different kinds of dates so i think that's another reason why you know even simple patterns escape uh software testing because there's just no tool support for developers out there. Right. This is great because it's actually what you does describe segues perfectly into my next

Starting point is 00:18:43 question. Somewhere in your paper you talk about these bugs being obscure. You introduced a concept of bug obscurity for time-related bugs. Can you tell us a little bit more about what you mean by that? And maybe you can briefly explain how did you classify those bugs? Did you find any similarities and differences

Starting point is 00:19:08 that can eventually help, that eventually helped you detect them? And maybe they will serve as lessons learned for future software engineers. Yeah, great question. So as part of our study, we analyzed around seven different. factors. Three of them were specific to understanding the bug and the code. And the remaining

Starting point is 00:19:38 four were in some sense error agnostic, where we just wanted to understand the nature of date and time bugs without worrying about any specific error pattern. The idea that we had was these two sets of factors can help us understand date and time bugs holistically and come up with hypotheses on how to prevent them. So, yeah, let's talk about the four factors that we used to understand the nature of data and time bugs. So firstly, we had something that was known as bug obscurity, and that was essentially a factor that had to do or that describes how difficult is it to detect or trigger a bug.

Starting point is 00:20:23 And we stuck to scales such as low, medium high for all four of these factors, just to make it easy to derive insights from the data. For bug obscurity, any bug that was readily detectable with any kind of input would have a low obscurity. Example of that would be making use of any outdated API. If you pass in any input, it's going to throw an error warning. medium obscurity bugs were bugs that would manifest only under certain inputs so let's say maybe with a leap day as an input or would occur only during a dst change but these bugs again would throw explicit errors and finally we had bugs with which we classified as really high obscurity and these were bugs that would arise very rarely on edge cases and also they would never through any explicit warning. So these could be bugs where they silently produce a wrong output

Starting point is 00:21:26 and there's no way for you to know if anything has gone wrong. And we do a similar exercise for bug severity. We manually look at all the developer reports and classify bugs as low severity, medium severity, high severity, low representing some non-critical task being affected, medium representing that a critical task is affected but there's a workaround

Starting point is 00:21:52 and high severity bugs are bugs where a critical task is affected and there's just no way to get around it. We also wanted to understand what it took to fix reviews. Hopefully trying to understand what it would take to catch these bugs and for that

Starting point is 00:22:08 we looked at the fixed size and fixed complexity. The fixed size essentially tried to look at how big was the fix? Was it in one line? was it limited to a single function or was it spread across multiple files? And the complexity essentially looked at how logically involved was the fix. Did the fix involve changing the API? Did it involve redesigning your whole software architecture and re-implementing multiple methods, things like that?

Starting point is 00:22:35 So one thing that I would like to note here is that we did not blindly pick numbers for all of these factors. So we did not mine PRs and then look at the number of lines added or removed because there could be changes that may not be specific to the date and time bugs. So we manually analyzed each of the issues and tried to isolate all the changes that were specific only to date and time bugs or the bugs that were understudy. So what did we find? We found that nearly 60% of the bugs demonstrated low obscurity. around 74% of the bugs had medium to high impact 82% of the bugs were localized to a single function

Starting point is 00:23:19 and lastly only 10% of the bugs required extremely complicated fixes and these are really interesting results because for each factor we have a result that's not so intuitive this is also super interesting because understanding the nature of these bugs can help us figure out the right approaches to catch them and hopefully I'll get to talk about them in the future.

Starting point is 00:23:42 But yeah, so these were like the four factors that we looked into for understanding the nature of bugs. There were also three other factors that we looked at to understand more about code-specific changes. They were mainly what were the concepts that the developers got wrong. What were the programmatic operations involved in the bugs that occurred? and what were the error patterns.

Starting point is 00:24:11 So you can think of this as like a tree structure where at the highest level you have a date and time concept that developer got wrong. There are many different ways in which the developer could get that one concept wrong. So for example, if the developer got a concept like time zones wrong, maybe the developer got it wrong because he was parsing strings incorrectly

Starting point is 00:24:33 and dropping time zone information or he got it wrong because he was doing conversions of time zones and he made a mistake in converting and for each of the different programmatic operations that the developers got wrong there are multiple ways in which they could have failed let's say the developer is parsing a datum uh date time object so maybe he went wrong in parsing by not accounting for time zone information or maybe the time zone information never existed and he invented one so there so this tree-like structure helped us analyze different bugs there too. I hope I answer your question, but if not, you can please feel free to ask any specific

Starting point is 00:25:12 follow-ups if you have. No, great. This perfectly describes, you know, I think the structure of your study and the patterns. You mentioned that you don't use any kind of pattern matching or or mechanized way of analyzing these bug reports. You are analyzing them by hand, trying to understand what the developers did, what the users encountered, what the root cause, according to the developer eventually was,

Starting point is 00:25:53 and how did they fix it? Yes, that is kind of. Right. So how do you, what's your secret? How do you analyze such a large amount of bug reports in a such efficient way? And you distill them into concrete patterns that are easily explainable, not only in the paper, but in a one-on-one conversation like the one we were having now. What's your secret?

Starting point is 00:26:22 Yeah, that's a great question. And I'm glad you ask it because in my experience, a lot of researchers in the field of bug finding, are more concerned with results and insights rather than the methodology. But I believe that when it comes to studies like this, it's absolutely crucial that you understand the methodology and convince yourself that you agree with it before you look at any of the results, right?

Starting point is 00:26:47 And I say this because getting our high-quality data set was not particularly easy. We have to perform multiple steps of mining and filtering to get most likely data and time issues. right. So the way we approached the problem was first we went to GitHub and we looked at all the relevant software projects in Python that would help us out. So that meant looking at projects that were created in the past 10 years, projects that had a certain number of stars, and also projects that made use of date and time code. And once we performed that filtering step,

Starting point is 00:27:23 we were left with around 22,000 Python projects, which is a big, number and that's not even looking at the issues right so for each of these 22,000 projects we mined all of their GitHub issues and we looked for keywords that are related to time like you know nanoseconds or milliseconds or we looked at any date and time python methods that were being called in the code like sturf time or from UTC now or from timestamp and we were looking for like general data and time concepts being mentioned in the issue description like time zones, so UTC, DST. And after performing this keyword matching step,

Starting point is 00:28:06 I think we had around 26,000 issues, which is like a very big number. So again, we had to go back to the drawing board and perform another set of filtering. And what we did was we were like, hey, let's look at issues that are resolved so that we can conclusively derive some insights from each issue. Let's look at issues that have a PR.

Starting point is 00:28:28 so that we can look at a code change and concretely say what was the code bulb. And also let's look at issues that have sufficient information in them, right? Many a times, issues just have a title and a question and nothing else. So after doing all of this, you were left with a set of around 9,000 issues, which was much better in quality, but still not manageable to analyze manually. And that's when we pulled out a trick from, something that's quite popular

Starting point is 00:29:01 in information retrieval systems and that was the idea of applying TF IDF or term frequency inverse document frequency to essentially find out among these 9,000 issues what issues are most likely to be date and time issues, right?

Starting point is 00:29:18 Because even in the 9,000 issues that we had we could have a performance bug that mentioned something like elapsed time and that would get pulled into our data set because of keyword search. And for those of you who don't know at TFIDF, I would say, please go check it out. It's a very interesting technique. But at a high level, what TFIDF tries to do is it tries to prioritize issues with multiple keywords that are present more often in date and time bugs.

Starting point is 00:29:46 And after applying this, we were able to sort our issues and go top down to find all the interesting data time bugs. And because we're doing this manually, we were only able to create a data set of 150 bugs, but because it's such high quality, we are able to derive really good insights. And one question that you may have right now is, hey, Strait, did you try using AI to automate the classification of issues? And that's a fair question. We actually did perform a systematic study of AI's performance.

Starting point is 00:30:16 And what we actually did was we took the first 100 or so bugs that we had manually classified and we used GPT4O to try and classify the same bugs and compare the labels generated by GPT4 against ours. We found that it was only 60% accurate at the time and even with advanced prompting techniques and breaking down the problem, it was ever barely able to achieve 70% accuracy. And we just deemed that unacceptable.

Starting point is 00:30:44 In my opinion, if you ever want to do automated analysis of data, the accuracy has to be really high. or errors can compound over time. But in short, that's how we ended up with a data set of 152 positives that anybody can look at today. Yeah. Awesome.

Starting point is 00:31:03 And this data set is public, right? Yes. Yes. Great, great stuff. You mentioned AI and using AI to classify them. Do you think, and I know this kind of goes into the part where you talk about how to detect them. But maybe before we jump to that, do you think AI can help in any way mitigate some of

Starting point is 00:31:31 those issues while developers are designing or writing their code? Have you thought about this kind of use case? Yeah. It's very interesting that you ask that. Because in some sense, that's the project that I'm working on currently. in my PhD and as it turns out AI has been trained on

Starting point is 00:31:55 a lot of code that's out there so it means that it has been trained on multiple different date and time libraries that exist that may be similar but subtly different and what I'm trying to do in my current project is that essentially the goal of the project was to see if we can

Starting point is 00:32:16 determine or find bugs in date and time libraries. So we have two key insights here. All date and time libraries have functions that accept inputs that are easily generatable using generators. So the second key insight is that for any library API calls FOO that accepts two date times, DT1 and DT2, we can generate DT1 and DT2, but we necessarily do not know what the output of FU should be. But what we know is that If there is another library that offers the same function full, both of them have to give similar outputs, right? Because at the end of the day, the function is the same.

Starting point is 00:32:58 So in some sense, we started performing something known as AI-assisted differential first testing on all of these code. And what we started to realize is that AI just assumes things on the behalf of users. So you could say that, hey, give me a function. that computes the number of days between two dates. Now, the AI is going to go ahead and give you a code, piece of code. And that piece of code is going to work well 99% of the times. But there might be a subtle edge case bug that does not handle daylight savings.

Starting point is 00:33:37 And it's very hard for users to know that that bug exists because there's no tool to help test the code. So in my opinion, AI can help you understand APIs, but if you were to be able to blindly use AI to write date and time code, you're very likely to run into HKS behaviors in production. And that's scary territory for me. Great. This is actually great. You heard that folks at disseminate first. AI cannot help you write bug-free code. In fact, it introduces more bugs, which, I don't know. is it it's bad for the users

Starting point is 00:34:18 but great for you Shrey because you're doing research in this area and that's your bread and butter I'm also doing debugging by choice so I'm a software reliability person and yeah sounds like

Starting point is 00:34:34 with AI we will not run out of a job or at least we will not run out of bugs to fix that's that's good to hear at least we're not replaceable yet. There's still a need for bug finders like yourself, right? I want to bring the conversation back to the daytime bugs that you studied in this project in the paper.

Starting point is 00:35:00 And I would love to get back to future work and how you can use AI as a cog in a log wheel in a larger mechanism to find more broader types of daytime bugs. But going back to the paper, you have identified all these patterns. You have identified, I'm assuming, some common or similar root causes or at least some classes of root causes. How did you fix them? That's a great question. So in our research, we did not necessarily focus on fixing.

Starting point is 00:35:43 These bugs, rather we focused on finding them so that developers could go fix them based on their application context. There are a few things. Let's try to break it down. So firstly, we, like I said, right, we analyzed seven different factors and we identified the different kinds of concepts that developers got wrong. We looked at different kinds of programmatic operations that developers got wrong. we identified error patterns.

Starting point is 00:36:15 We also had four other factors that were describing the nature of the bugs. So even before going out and building rules, we developed some hypothesis. We tried to study the correlation between bug severity and bug obscurity. And what we found that there was a slight negative correlation,

Starting point is 00:36:39 so around 0.11%. And this meant that severe bugs were not necessarily hard to detect. Also, we found that a lot of bugs were localized to a single function. So this meant that even simple static analysis tools could potentially help you catch date and time bugs. And not just catch any date and time bugs, they could help you catch date and time bugs that potentially had high impact. and this was like a great hypothesis to have

Starting point is 00:37:14 but again it's just an idea right until you show it so that's what we did next we went ahead and we developed like a few static analysis most specifically we developed static analysis for three patterns in our study we found that the thing that tripped up the developers most was construction of date and time objects which in itself is a very surprising thing

Starting point is 00:37:36 and maybe we can talk about that later but we developed three analysis to focus on this specific problem of data and time construction. So firstly, we looked at use of outdated APIs for constructing data in times. And this is like a simple problem, right? It's not dependent on application context. You could potentially just use string grep in the code base

Starting point is 00:37:58 to find use of outdated APIs, but we made use of code QL. And for those of you don't know code QL, it's a framework that helps you develop static analysis in the form. of queries. The next analysis that we developed looked at, you know, use of custom fixed time zones in your code. So if your code is making use of a custom time zone that has a fixed offset, it's not necessary that your code is correct. If you're running your code in a time zone that

Starting point is 00:38:25 observes daylight savings, then your offsets have to change with change in daylight savings. And lastly, we developed another query that looked at partially replacing fields of date and time objects, right? So let's say you had an object that represented 29th February and you only replaced the year. You might get an object that represents a imaginary date and when you operate on it, your program could be buggy. So we ran these three queries on a relatively small set of projects out there and we were able to find quite a few bugs. and I was happy that the developers actually acknowledge these bugs and we were able to successfully file them on GitHub. It's always good to see your research go find bugs in the real world that way.

Starting point is 00:39:11 And essentially what they showed was this, this is evidence that all the hypotheses and theories that we develop in the paper actually hold true. And I hope this is incentive for more people to go build more tools out there. And I hope these things help spark more. more research in this area. But coming back to your question, right? What more did we do about finding bugs?

Starting point is 00:39:38 I'm sorry, could you repeat your, like, remind me again what your question was? Sure, sure. I was wondering, actually, you answered most of the question. I was wondering how you went about and found, to find the bugs. And I was also, in my mind, I was also going to ask if you tried to fix some of them, you kind of, you know, answer that question by saying you reported those box to the developers and hopefully they're on the way of being fixed. You mentioned that your approach to finding those bugs is to rely on static analysis and code UL, which it's a great, straightforward way

Starting point is 00:40:21 of finding many interesting bugs in software. I guess one question that I have, and you may have answered this in the paper is, did you find any limitations of your approach, your static analysis, bug-finding approach? Are there any time-related bugs that you could not detect?

Starting point is 00:40:47 Or maybe you had a little bit more false positives than you'd have liked to have with this technique? Can you tell us a little bit about that? Yeah, yeah. So that's a great question. question actually. So one of the insights that we had in our paper, which I briefly alluded to in my previous answer, is that there are actually quite a lot of different error patterns that

Starting point is 00:41:14 exist. And they can be as simple as making use of outdated APIs, or they could be as complex as you know naively using date time objects where you should have been using time zone aware date time objects and what we realized is that for each of these different patterns you need to have a lot more information than just the code to be able to discern true positives from false positives so this information could come in the form of documentation it could be code comments or it could be project specifications trying to understand what the is trying to do. If the project is designed to run locally on your system and that's it, probably it's okay to not use time zone aware date time objects. And that's like one of the

Starting point is 00:42:04 biggest limitations with static analysis. I mean, you have experience with static analysis, right? It's always a trade-off. We want analysis that are both sound and complete, but more often than not, we can't have both. And many times we can't have either. So it becomes a game of balancing false positives and true positives. So while we showed in our paper that static analysis is capable of finding bugs, it produces a lot of false positives. For example, the query that was detecting fixed offset time zones in code, returned 24 results to us for a set of thousand projects,

Starting point is 00:42:39 and only one among the 24 was a true positive, and we had to like manually sit through all the other. So one learning here is that, of course, we can improve static analysis and make it better to reduce the false positive rate. But code context is quite important when it comes to date and time applications. And that's one of the limitations of static analysis. Maybe AI can help us there. You know, maybe AI can process documentation, code comments, or project specifications

Starting point is 00:43:11 to, you know, tell false positives and true positives in code. and although I do not have like concrete numbers to back my claim about AI I'm confident because AI has been able to help with different kinds of bugs in other domains and has been able to rightly identify bug patterns and I guess this has been your experience too right

Starting point is 00:43:33 I read your paper on VTIB bugs in Microsoft where you developed code QL queries and also prompted LLMs to find bugs and from your paper like the LLMs did as good a job as code QL in finding true positive in Microsoft. So, yeah, I guess that's one key takeaway for us

Starting point is 00:43:55 that any date and time tool, bug-finding tool that is going to be developed in the future has to be able to deal with not just the code, but more about the project and other code contexts. Because without that, it's just extremely hard to tell true positives from false positives. Maybe you can do that for simpler patterns, but if you want to tackle the serious big ones,

Starting point is 00:44:20 for sure, you're going to need more than just code. Right. Yes. You mentioned our previous work on retrived box. This is exactly what we discover that right now it's no longer enough to reason about how individual program statements are interacting with memory, with each other, whether or not they're in order or not or whether that matters.

Starting point is 00:44:51 But now you have to go one level of abstraction higher and think about these functionality or mechanism bugs where, as you mentioned, you have to take into account policies, you have to take into account, develop a intent, you have to take into account documentation in addition to what the actual implementation is. And yes, I, you know, I'm skeptical about AI being able to replace us completely, but I'm really excited about AI helping us as a tool,

Starting point is 00:45:25 like you mentioned, to find patterns, or distill this developer intent you were talking about. Yes, I'm right there with you. And I think, you know, this is a great use case. As long as you can validate what these tools are outputting or answering. right? I think this is a great use case to, as you mentioned, mitigate false positives. So if I am a Python developer today and I have a service that uses any kind of, or deals with any kind of date and time and calendars, and I am to read your paper, what should I be mindful of? What? What? How can I leverage your study to improve my implementation, my design, or at least my awareness to these kind of bugs?

Starting point is 00:46:26 I guess what I'm getting at is if you have any advice and lessons learned for software engineers when they are dealing with this type of code. Yeah, that's a great question. In fact, this was something that my advisor would also ask time again. when writing the paper. It's always satisfying to read a paper that provides insights, but also follows up with actionable advice that can benefit people reading the paper. So I'm just going to tell you at a high level what we have in the paper.

Starting point is 00:47:02 So if people are interested, you should like go learn more. But like at a high level, what we found was, you know, one of the things that tripped up developers the most was concepts that were related to time zones. And more specifically, developers made mistakes while constructing dating time objects. Also, our analysis revealed that there are different patterns

Starting point is 00:47:30 and it's not always easy to know what you should be testing for. But if there is something that you can do as a developer today, that's probably testing. because there are no bug-finding tools that exist out there. So the best option that you have is to just thoroughly test your code. In our study, we found that around 60% of the bugs can be detected by just executing the code with any input. Of course, like the projects that we studied had very poor code coverage,

Starting point is 00:48:05 and hence they could not find these bugs. The solution for finding, you know, like low-hanging fruit, is to just implement a very comprehensive test suite that covers all critical date and time computations in your code. You have to ensure that your testing framework can control as many randoms in your code as possible. So make random number generation deterministic. If your code reads the system clock or then mock out those APIs, try to control the time zone in which your tests run. There are some useful libraries out there that can help you do this like freeze gun or lip fake time for Python and just using a testing framework for deterministically executing code with like

Starting point is 00:48:49 controlled environments and deterministic inputs can go a long way in ensuring correctness. Of course, this is like just one class of bugs, right, where you get an error as soon as you execute the code. But we can go further. In our study, we found out that around 32% of the bugs can be caught easily when the code is executed with the right inputs. and again by right I mean bug triggering inputs

Starting point is 00:49:13 and this is important because these kind of bugs are very evident like you provide the right input right edge case and the program is going to crash or it's going to throw an error message and that's easily catchable

Starting point is 00:49:27 in a testing environment so what can you do about that maybe you know you have code that deals with date and time strings and tries to store them in a database what you can do is you can try to test for reversibility.

Starting point is 00:49:42 Maybe generate a lot of written time strings, store it in the database and try to read it back. And assert can, is the value that you stored and read the same. If not, maybe there is some mistake in your code where it's trying to parse or store incorrectly. And there could be many other properties about your code that you could encode as assertions in your testing framework and then perform property-based testing.

Starting point is 00:50:08 And as you may know, there are popular property-based testing tools like hypothesis that can help you detect bugs. Also, sometimes you have simple bugs, right? Like, maybe you are providing inputs to your program, but your APIs are not designed

Starting point is 00:50:23 to handle this. For example, there was this bug that I ran into where the software was trying to account for different buildings in a city. And the buildings could be older than the 1900s. Maybe they were in the 1800s. but the Python API string pastime

Starting point is 00:50:41 does not support for any data that goes before the 1900s. And the program just fails. The developers didn't know this when they published the code. So what do you do about these bugs? And the answer is simple, right? Like if you develop a testing framework that can make your system deterministic that does property-based testing, then you already have most of the pieces of the puzzle.

Starting point is 00:51:03 All you need to do is just enable first testing and generate random inputs. maybe these inputs could be extreme cases and you don't even need an article to find these bugs right because your program is just going to crash and that in itself is enough for you to see that hey there's a bug in my code because my code should never crash and I don't know if

Starting point is 00:51:21 I know you know this but I don't know if the listeners know this but in all of software engineering research fuzzing has proven to be the most effective bug finding strategy in the real world period so you should really take the advantages of fuzzling seriously and invest in it because it has a very high return on investment.

Starting point is 00:51:43 So I would like to conclude by saying if you're a developer today, you have to continue learning about different data and app concepts. Read our paper, try to understand, get a feel for what this domain is like, understand more about the library functions that you're using, their designs, their limitations.

Starting point is 00:52:02 And most importantly, to ensure that your code runs the way it's supposed to run, perform thorough testing because that's the best you can do today. Until, you know, hopefully somebody from academia like us or you come up with better tools for finding data time bucks. Awesome. I think these are all great, great, great lessons that any developer working in this space should be mindful and should pay attention to. And hopefully, you know, next time we'll have this conversation with your next project,

Starting point is 00:52:33 the number of bugs that you observe. reserve, at least as bug reports on GitHub, shrinks considerably and, yeah, makes it more manageable for future PhD students working in this space. Because as you mentioned, going from 26,000 reports to just a couple of hundreds, 100 bug reports that you analyzed has been quite a task. So I'm still, you know, in my mind, that's one of the things that I'd like. I love the most about this paper, the methodology of distilling and filtering and getting to the core of this problem.

Starting point is 00:53:15 So this is, this is great. And it's been a fun paper to read. And which brings me to, and I know you just published and it's at least in my, you know, when people are talking, when I talk with people about my work and then they ask, okay, you did all this great work and what's next.

Starting point is 00:53:39 You know, I feel a bit of a brain freeze like, oh, okay, now I have to come up with the next project, right? That's the life of a researcher. So I'm going to turn the tables and ask you this question. And after all this great research, this award-winning research and with awesome insights, what's next? have you thought about expanding the work or the technique or maybe the methodology of filtering out and weeding out reports that are uninformative? Yeah. Firstly, thank you so much.

Starting point is 00:54:17 Yeah, I think your first best paper award or distinguished paper award is always special and I'm quite happy about it. About broader implications. So in our study, we focused only on Python. and what we observed was that Python captures a lot of different bug patterns that also exist in other languages. Sure, the distribution of the bug patterns may vary, but we believe that most of these patterns that we describe generalized to other languages.

Starting point is 00:54:46 It would be interesting if people perform like in-depth studies for other popular programming languages like Java or JavaScript because there are some crazy ways in which those libraries behave to. In fact, funny story, I recently came across this website whose whole purpose was dedicated to quizzing people on what they would think would be the output of a specific JavaScript code that involved the JavaScript data library

Starting point is 00:55:10 and I'm ashamed to say that despite all this research I was able to only get two out of the 15 questions that had tried correct and this is really hard stuff what can I say but other than that I really hope that this study sparks an interest in the bugfinding community to develop new tools

Starting point is 00:55:28 that help developers catch bugs early on. Like I mentioned, there's a lot of scope for developing static and dynamic analysis tools, static tools with AI, dynamic analysis tools in the form of testing. I mean, there's even scope for, you know, adding data in times as post-class citizens for SMT theories to enable formal verification.

Starting point is 00:55:52 So that's some of the work that we are doing at Pasteel Lab, right? one of the projects that I already mentioned was trying to find bugs in libraries themselves because if a library has a bug, then potentially all software that depends on that library also has a bug in it. And I think research of this kind has been pretty successful in other domains,

Starting point is 00:56:16 like compiler testing or database testing, and we are trying to do the same for data in time libraries. That's an ongoing project. A new project that we are looking at, at is can we support date and time types in SMT theories to enable formal verification of software that deals with

Starting point is 00:56:35 data and time in safety critical systems. So there are, there's a lot of work left to be done. I love finding bugs. I love making software more reliable. And I am very sure that just one person or one lap can't achieve everything that's out there.

Starting point is 00:56:52 So I really hope that this kind of study sparks and interest. in the community to look at this understudied yet very important domain of bugs and I hope it serves as a stepping stone for more future research work. Yeah. I truly

Starting point is 00:57:08 hope that that's going to be the case because this is again such an interesting topic and I may be mistaken and you please correct me if I'm wrong but I don't remember this type of bugs being studied in the past

Starting point is 00:57:23 and which brings me to my next question, how did you come up with this project or not necessarily the work itself, but I'm interested in how do you think about starting a new project, how you go a new research project, how do you go about it, what's your creative process, how do you generate ideas? Yeah, yeah, that's actually,

Starting point is 00:57:53 really good question. And I think I almost have like a philosophical answer to this question because I strongly believe that people do their best work when they have a sense of purpose or calling. And for me, I am very clear about what I want. I have always wanted to have impact through my work. When I started my PhD, I knew that I did not want my research to end up in an academic paper, but rather in the real world. And in some sense, this sense of purpose is what has driven my research career. Let me explain how. Like, a lot of people start by reading research papers and finding gaps in existing literature.

Starting point is 00:58:38 Somehow that never really clicked for me. Of course, it's important to build a strong foundation of existing methods by reading papers, but that's where it stops for me. I usually start by looking at problems rather than limitations. How do I find problems? I talk to developers. I look at bug reports from companies. I read post-mortems of octages,

Starting point is 00:59:01 and I see what kind of bugs exist out there that cost millions of dollars. To me, any bug that has the potential to bring down production and cost millions of dollars automatically becomes a bug that I'm interested in. I guess that's how I'm wired. But once I have my eyes set on the problem, I then go read research papers to see what exists. and what are the gaps. But as you may know, as a fellow bug-finding researcher,

Starting point is 00:59:29 that's not where it stops, especially because once you have a bug, it's important to familiarize yourself with the bug. So that involves reading a lot of bug reports, looking at a lot of code fixes, and trying to understand the different ways in which the bug can manifest and what it would take to catch this bug. And the rest is just trusting the process, right?

Starting point is 00:59:49 You do the work, you lay the dots, and hopefully they'll connect as you go along. that's predominantly what my research or creative process looks like oh i think there's another aspect of it which is just talking to your peers and discussing ideas i've seen wonders happen when people from different backgrounds share ideas discuss research because it could be as small as finding a simpler way of implementing a piece of code or discovering a whole new approach to solve the problem but either case always talking to your peers is something that has helped me a lot So in the nutshell, yeah, that's how I operate

Starting point is 01:00:27 and that's how I'm wired to look at research problems. Awesome. So you tackle a real-world problem or bugs that affect real system that have severe impact. You talk with your peers and then you, once you found the bug and once you generated some initial ideas, you drill down and find the gaps in the existing literature that could, you know, help you fix those bugs that haven't been fixed before.

Starting point is 01:01:03 You, I think you summed it up perfectly. That's the recipe for award-winning research as this research is. So for listeners, you know, please take notes from Shrey. which brings me to, you know, the last kind of last question on the menu today. And I think it's about time to, if I may, if I may make a pun, to tell listeners a key takeaway that you want them to remember from your work, from this paper in particular, from your transition to industry, to research or your creative process

Starting point is 01:01:52 and the way you approach finding new bugs to tackle. Yes, sounds good. So maybe let me first talk about my research in general, which is about software reliability, right? To state the obvious software is ubiquitous and is growing at an exponential rate, but also software is very buggy, especially with, like, the advent of AI code generators becoming the norm,

Starting point is 01:02:21 it is increasingly important to ensure software correctness. Like, while we still can't prove correctness of industry-scale software, the next best thing that we have to ensure our software is free of bugs, is to make use of bug-finding tools. And in my experience, the use of bug-finding tools in industry is not as widespread as I had hoped for. For example, recently I attended this developer conference called Bugbash that was hosted by this company antithesis. And there was a question that was asking the audience, how many of you use fuzzling to test your code at your company?

Starting point is 01:02:54 And only 10% of the people raised hands that feel like a big gap. Like apart from big tech or maybe even parts of big tech, engineers continue to ensure software correctness through rudimentary techniques like stress testing or unit testing with. hard-coded values or manual testing. This is not good. I would strongly urge developers to invest more time in learning and adopting well-established bug-finding techniques

Starting point is 01:03:24 or at least improving the existing test suites that they have. It may seem like a waste of time at first. In fact, a developer once called it a distraction from a distraction because the goal of developers is to write features. Writing tests is a distraction and investing in harnesses for fuzzling

Starting point is 01:03:42 is a distraction from a distraction. But trust me, it's worth it. And talking specifically about date and time bugs, if you're writing software that deals with date and time, I would say just assume it's wrong. That's the best mindset to be in when you're writing any code that's related to date and time. There are just so many nuances and edge cases

Starting point is 01:04:02 that there's always likely some assumption that you made that may not hold in production. And since there are no date and time related bug finding tools, the best you can do is just thoroughly test your code. First test with random inputs. Try using property-based testing. Execute your code in sandboxes where you can control time zones and date times.

Starting point is 01:04:25 And as our paper shows, many of the severe bugs can be caught with simple testing techniques. You don't have to invest a lot. So the return on investment is quite significant. So yeah, that's like the biggest takeaway that I have. Please be mindful while writing date and time code, thoroughly test it and continue to learn about the different nuances of date and time systems.

Starting point is 01:04:47 So yeah, that's it from my end. That's, I think that's a great tagline to add to our entire bug finding field. If you're a developer and deploy code in production, please, please use bugfinding tools before doing that. It's worth it. Shre, thank you so much for being here with us. And I cannot wait to hear about and read about your next project. I know you have a few things cooking, and I'll be following software engineering conferences to spot your name there.

Starting point is 01:05:23 Thank you so much for joining us today. Thank you so much for having had a great time.

Disseminate: The Computer Science Research Podcast - Shrey Tiwari | It's About Time: A Study of Date and Time Bugs in Python Software | #64

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.