Disseminate: The Computer Science Research Podcast - Shrey Tiwari | It's About Time: A Study of Date and Time Bugs in Python Software | #64
Episode Date: September 23, 2025In this episode, Bogdan Stoica, Postdoctoral Research Associate in the SysNet group at the University of Illinois Urbana-Champaign (UIUC) steps in to guest host. Bogdan sits down with Shrey Tiwari, a ...PhD student in the Software and Societal Systems Department at Carnegie Mellon University and member of the PASTA Lab, advised by Prof. Rohan Padhye. Together, they dive into Shrey’s award-winning research on date and time bugs in open-source Python software, exploring why these issues are so deceptively tricky and how they continue to affect systems we rely on every day.The conversation traces Shrey’s journey from industry to research, including formative experiences at Citrix and Microsoft Research, and how those shaped his passion for software reliability. Shrey and Bogdan discuss the surprising complexity of date and time handling, the methodology behind Shrey’s empirical study, and the practical lessons developers can take away to build more robust systems. Along the way, they highlight broader questions about testing, bug detection, and the future role of AI in ensuring software correctness. This episode is a must-listen for anyone interested in debugging, reliability, and the hidden challenges that underpin modern software.Links:It’s About Time: An Empirical Study of Date and Time Bugs in Open-Source Python Software 🏆 ACM SIGSOFT Distinguished Paper AwardShrey's homepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the computer science research podcast.
Hi, and welcome to disseminate the computer science research podcast.
I'm your host, No, Jack Watby.
My name is Bogdan Soika, and I'll be guest hosting today's episode.
Today I'm joined by Shrey Tiwari, who is a midway PhD student at CMU, advised by Rohan Padier.
Shrey has recently published a paper on date and time bugs called It's About Time,
an empirical study of date and time bugs in open-source Python software.
This paper appeared in the International Conference on Mining Software Repositories this year
and won the best paper work there.
It's my pleasure to have Shrey here with us.
Shrey, welcome to the show.
Hi, Bogdan.
Thank you so much for having me here.
I'm excited.
So Shrey, do you want to tell us a little bit about your research?
Leon, tell us a bit about your PhD journey so far.
Yeah.
So I can start up by telling you how I ended up at POSSA Lab in general.
So I graduated in the year 2020 and started my job as a software developer at this company called Citrix.
And Citrix is one of the leading players in the VPN Solutions market.
And that's important because when I started my job, COVID was at its peak.
Work from home was becoming the norm.
So the demand for VPN solutions was growing rapidly.
So around the year that I spent at Citrix, it was extremely intense.
I had a lot of feature developments, multiple production deployments, custom issues, and so many bugs.
The battles cars that I gained at Citrix are what make me care about software reliability so much.
While I was at Citrix, I started to wonder, you know, is there something?
that I can do to make software more reliable in general rather than just making the product
that I'm working on more reliable. And that's when I started looking at more opportunities
outside of Citrix. And mind you, this is not when I had a PhD on my mind at all. But
serendipitously, around the time, a opening at Microsoft Research showed up at the cloud reliability
team led by Dr. Akash Lal. And I was lucky enough to join as a research fellow. But those of
you who don't know, research fellowship is essentially a pre-doctoral position. So I spend the next two
years at MSR working on some important programming language problems that were, you know, crucial
to Azure. And that's what made me realize that I enjoy research a lot. And I should probably
consider PhD as a career option. And around the time, I started applying for PhDs. But as you
may know, there are just so many incredible labs across the world.
and it's really hard to pick and choose a hand few.
But I chose Paso Lab for a couple ofusions.
Firstly, you have to admit that it's one of the coolest names a lab can have, right?
And the slogan also goes like,
we are not afraid of spec decode.
And I think that's quite fitting.
But apart from that, I think PasoLab was probably one of the few labs
that actually had a very explicit ethos listed on their website.
And let me see if I can pull it up.
up right now, it goes something like for science to make rapid and sustainable progress,
we believe that academic research should be made openly accessible, reproducible, and where
active facts exist also reusable. And that was something that really stuck with me, because I've
always been driven by impact and change in the real world. So that's how I ended up at Pasteelab
doing software reliability research. And the past couple of years have been great. Yeah.
Cool. Great, great, great stuff. So you had a grown-up job and then decided that, you know, it's time to make some more, some broad impact. I like it. I think, I think this is becoming the norm. I don't know. I've graduated recently, but I've started my PhD back in 2019. But now when I talked with more PhD students that started more recently,
They all have the same ARP.
They went to industry.
They gained some experience.
They found the problem they excited about and started working towards it as a PhD student.
Great stuff.
So speaking of that, what actually drew you to time-related bugs?
I know you have been working on this topic for quite some time.
You have a paper.
You have some ongoing research.
Or maybe I did jump a little bit ahead.
Would you mind tell us a little bit about your research
and how you end up working on this particular type of bucks?
Yeah.
So it's actually an interesting story.
So back in November 2023, when I just started my PhD,
CMU was sending out invitations to summer REU projects.
and essentially that stands for research for undergraduates.
And Rohan was like, hey, do you want to mentor a few RAU students for the upcoming summer?
And I was like, yeah, sure, I'd be excited and I'd love to.
But at the time, I was working on this concurrency project related to this project called Frey,
and we really couldn't find anything that could wait until the next summer.
So that's when me and Roan got to brainstorming about different ideas and Ron was like,
hey, have you ever thought about date and time bugs?
And I was like, yeah, I faced quite a few when I was at Citrix.
And ever since different kinds of YouTube videos on confusing date and time concepts
or developer blogs keep popping up in my feed.
So since this was like an area of common interest,
me and Rohan just wrote up a naive project draft for the summer and submitted it.
it got accepted but then later as in the next semester we continued looking into this problem
more and we were surprised to see that there are just so many different cases of date and time
bugs that are out there and and this this this was quite interesting because we're like
this is such a understudy area there's so many problems it affects so many people can we do
something about it and rohan went ahead and actually wrote an sf grant for this whole project
which later got accepted
and that that was nice
then come summer
I had three incredible
RU students who joined the lab
and we set out on a project
to build a tool
to find date and time bucks
but we soon realized that
hey what even are date and time bucks
like we couldn't even define the problem
and the project
kind of pivoted into studying
what's out there
understanding the landscape
and that's what resulted in the paper
I'm quite glad how things turned out
and I really hope that this research kind of benefits
the academic world
and also the broader community in general.
Cool.
That's awesome.
So you had encountered this type of bugs
while you were at Citrix, so in industry.
So this is not just an academic exercise.
These bugs are plaguing real.
software that have, you know, white base of users, right?
So why do they happen in the first place?
Because when I read your paper, which is a great paper, by the way,
I invite the listeners to go and read it.
It's super accessible.
It's great, great stuff.
When I was reading the paper, the only thing that I could think of is, okay,
these are very straightforward patterns.
So why do developers keep making those mistakes?
They have been studied extensively in the past.
They've plagued software extensively in the past.
We know how to fix them,
or at least we think we know how to fix them.
Why are they still there?
And why are they so hard to avoid and later detect?
Yeah. Yeah.
That's a great question.
So I would say that date and time bugs occur in software
when developers bake some kind of assumptions about date and time concepts
that may not hold in production.
So let's take a simple example.
Let's do a fun exercise where I ask you a few questions.
Let's assume that you are a developer who is trying to develop a calendar application.
Now, assuming that, let me ask you a few simple questions.
Don't overthink them, but just tell me what's the first thing that you're in your mind.
How many hours do you think are in a day?
24.
Okay.
And how many seconds are in a minute?
60.
Okay.
And the last one may sound a little crazy, but please bear with me.
Do you think you can time travel, at least as of now?
Are you asking me if I believe in time travel or if I can time travel right now?
Oh, we can dedicate another podcast.
Let me rephrase the question.
Do you think time always increases monotonically?
As a human being, yes.
As a computer scientist, not really.
Great, great, great.
That's actually the right knowledge, right?
So first question, you said 24 hours, that's maybe not really true because there's at least one day in a year that has 23 hours and one day in a year that has 25 hours if you're living in a time zone that observes daylight savings.
And not every minute in a year needs to have 60 seconds, right?
Maybe there's this concept of lead seconds where, you know, people around the world add one more second to an year just to make sure that solar.
time aligns with, you know, local time.
So there's these, there's so many fundamentally complex things about day and time that
people do not understand.
I'm glad that you know that time does not always monotonically increase on computer systems.
But a lot of people would just say, yeah, it always does.
And these kind of things can lead to a lot of problems in production.
So for example, recently there was like an incident in Cloudflare.
and essentially what Cloudflare was trying to do
was it was trying to measure the RTT of their packets
and they were doing so by taking the timestamps
but because Cloudflare is so fast
once when there was a leap second
their RTT went to negative
because of time clock sinks
and their system was not designed to handle
RTT that was negative
they didn't expect responses to come before requests
and that's crazy
because this was in proper production
outage and they had to release a post-mortem report on this.
So these are like things that developers actually bake into production.
And there are also other simple concepts that people just forget to handle, right?
Like there was an incident in Microsoft Azure where they were trying to generate
expiration certificates and the whole system just crashed on February 29 because all that
they were trying to do was assign a expiry date of one year from today.
And one year from February 29 does not.
exist. So the whole Azure came crashing down. And this was like again another public post-mortem
report. So these bugs are quite important. They can have money. They can cost you a lot of money.
But there are also bugs that can cost you lives, right? Like if you look at safety critical systems
like aircrafts, there have been bugs in like Raptor 22 Raptors where, you know, systems
have crashed when the plane has crossed an international date line. And it's put the part
Pilots life in danger.
Or there have also been bugs in like epic systems,
which is like the software that's run in almost all of United States hospitals.
And that prevents people from getting surgery on time.
So day and time bugs do plague a lot of systems.
They can cost you a lot of money.
They can harm human lives.
They are really important.
And why are they so deceptive?
Basically because there are so many various data-in-time concepts that people don't understand.
maybe I can walk you through some of the ones that we talk about in the paper, right?
Sure, yeah, please.
So one is just string representations.
Data in times can be represented in so many different formats.
So for Python example, just 29th February in itself is a valid representation of a date time object.
Now, you may ask, if I pass in the input as 29th Feb, what should be the year?
should the default year be the beginning of the Linux epoch, 1970, in which case it is not a leap year,
so parsing such a date would always fail? Or should the year be the current year, in which case
parsing such a date would only succeed once every four years? And these are real problems,
these have been discussed in Python PEPs. There is also a lot of date and time arithmetic
that people need to understand about
date and time computations.
So basic math rules don't apply to date and times.
Let's take a very simple example of reversibility, right?
Let's say you have X, you add a delta
and subtract the same delta.
You get X back in math.
Not so much in date times.
Let's say you have an object that represents
31st January.
You add one month, you either get 29th or 28th
and then you subtract one month
and you don't get Jan 31st back.
so basic arithmetic rules don't apply
and you have to be cognizant of these things
when you're performing computations.
There are also just weird geographic aspects
and there's this one funny story
that I always like sharing.
So there's this island called Samoa
that's near the international dateline
and at some point
Samoa decided that it wanted to jump
from one side of the international dateline
to the other.
So it wanted to move from a time zone
of negative GMT minus 11 hours to GMT plus 13 hours.
So that's essentially a difference,
a jump of 24 hours.
And that's what Samoa did, right?
Back in 2012, I guess,
Samoa went from December 30th to Jan 1st
and totally skipped December 31st.
Now, imagine you are a developer
who's writing a calendar application
or even better, a payroll application.
Where do you even begin to handle these kind of things, right?
So I think it's fair for me to say that there is a lot of knowledge out there that needs to be assimilated, organized, and, if I may, disseminated through articles, blogs and podcasts to spread awareness.
I think this is one of the things that makes these simple bugs really hard to catch.
This is very, very insightful.
So just to see if I understood correctly, you're saying that,
although we all understand dates because we're working with computers,
this may not well represent our human understanding of dates
because computers can do operations on dates that need to be mindful of various rules.
You can cross different time zones and even the fact that a day, an hour or a minute
might not have the number of units we as humans think we have.
And, yeah, so developers are human, at least for now.
Yeah.
AI will replace us all.
So these are valid, valid concerns.
And I can see how developers can easily get tricked and tripped by these.
How did you, I'm so sorry, please go ahead.
So I was just going to add that, you know,
For any kind of bug, there are essentially two things, right?
One is, or essentially maybe two steps.
First, you need to have an idea of what kind of a bug you're looking for,
and then you've got to have a plan on how to generate a workload
that could potentially trigger that bug if it exists in your program.
If you take the example of null point of exception bugs,
it's pretty simple to answer both of these questions.
A, you're looking for null point.
or exceptions. And B, you generate a lot of corner case inputs and check if any null pointer
exception is thrown by a program. This is pretty straightforward to do with existing techniques
today. But for date and time bugs, both of these questions are kind of hard to answer, right?
Hopefully with our paper, we are able to better articulate what kind of bugs we are looking
for but still it's really hard to answer the second question on how do we test our program
because there are no tools that exist today that can allow you to run your software in multiple
time zones or there is no tool that can allow you to simulate a dST change or even like test
your application with different kinds of dates so i think that's another reason why you know
even simple patterns escape uh software testing because there's just
no tool support for developers
out there. Right.
This is great because it's actually
what you does describe segues perfectly into my next
question. Somewhere in your paper
you talk about these bugs being obscure.
You introduced a concept of bug obscurity
for time-related bugs.
Can you tell us a little bit more about what you mean by that?
And maybe you can briefly explain
how did you classify those bugs?
Did you find any similarities and differences
that can eventually help,
that eventually helped you detect them?
And maybe they will serve as lessons learned
for future software engineers.
Yeah, great question.
So as part of our study,
we analyzed around seven different.
factors. Three of them were specific to understanding the bug and the code. And the remaining
four were in some sense error agnostic, where we just wanted to understand the nature of
date and time bugs without worrying about any specific error pattern. The idea that we had was
these two sets of factors can help us understand date and time bugs holistically and come up
with hypotheses on how to prevent them.
So, yeah, let's talk about the four factors that we used to understand the nature of
data and time bugs.
So firstly, we had something that was known as bug obscurity, and that was essentially
a factor that had to do or that describes how difficult is it to detect or trigger a bug.
And we stuck to scales such as low, medium high for all four of these factors,
just to make it easy to derive insights from the data.
For bug obscurity, any bug that was readily detectable with any kind of input would have a low obscurity.
Example of that would be making use of any outdated API.
If you pass in any input, it's going to throw an error warning.
medium obscurity bugs were bugs that would manifest only under certain inputs so let's say maybe with a leap day as an input or would occur only during a dst change but these bugs again would throw explicit errors and finally we had bugs with which we classified as really high obscurity and these were bugs that would arise very rarely on edge cases and also they would never
through any explicit warning.
So these could be bugs where they silently produce a wrong output
and there's no way for you to know if anything has gone wrong.
And we do a similar exercise for bug severity.
We manually look at all the developer reports
and classify bugs as low severity, medium severity, high severity,
low representing some non-critical task being affected,
medium representing that
a critical task is affected
but there's a workaround
and high severity bugs are bugs
where a critical task is affected and there's just
no way to get around it.
We also wanted to
understand what it took to fix
reviews. Hopefully
trying to understand what it would
take to catch these bugs and for that
we looked at the fixed size and fixed complexity.
The fixed size essentially
tried to look at how big was the fix?
Was it in one line?
was it limited to a single function or was it spread across multiple files?
And the complexity essentially looked at how logically involved was the fix.
Did the fix involve changing the API?
Did it involve redesigning your whole software architecture and re-implementing multiple methods, things like that?
So one thing that I would like to note here is that we did not blindly pick numbers for all of these factors.
So we did not mine PRs and then look at the number of lines added or removed because there could be changes that may not be specific to the date and time bugs.
So we manually analyzed each of the issues and tried to isolate all the changes that were specific only to date and time bugs or the bugs that were understudy.
So what did we find?
We found that nearly 60% of the bugs demonstrated low obscurity.
around 74% of the bugs
had medium to high impact
82% of the bugs were localized to a single function
and lastly only 10% of the bugs required
extremely complicated fixes
and these are really interesting results
because for each factor we have a result that's not so intuitive
this is also super interesting because understanding
the nature of these bugs can help us figure out
the right approaches to catch them
and hopefully I'll get to talk about them in the future.
But yeah, so these were like the four factors that we looked into
for understanding the nature of bugs.
There were also three other factors that we looked at
to understand more about code-specific changes.
They were mainly what were the concepts that the developers got wrong.
What were the programmatic operations involved
in the bugs that occurred?
and what were the error patterns.
So you can think of this as like a tree structure
where at the highest level you have a date and time concept
that developer got wrong.
There are many different ways in which
the developer could get that one concept wrong.
So for example, if the developer got a concept like time zones wrong,
maybe the developer got it wrong
because he was parsing strings incorrectly
and dropping time zone information
or he got it wrong because he was doing
conversions of time zones and he made a mistake in converting and for each of the different
programmatic operations that the developers got wrong there are multiple ways in which they could
have failed let's say the developer is parsing a datum uh date time object so maybe he went wrong
in parsing by not accounting for time zone information or maybe the time zone information never
existed and he invented one so there so this tree-like structure helped us analyze different bugs
there too. I hope I answer your question, but if not, you can please feel free to ask any specific
follow-ups if you have. No, great. This perfectly describes, you know, I think the structure of your
study and the patterns. You mentioned that you don't use any kind of pattern matching or
or mechanized way of analyzing these bug reports.
You are analyzing them by hand,
trying to understand what the developers did,
what the users encountered,
what the root cause,
according to the developer eventually was,
and how did they fix it?
Yes, that is kind of.
Right.
So how do you, what's your secret?
How do you analyze such a large amount of bug reports in a such efficient way?
And you distill them into concrete patterns that are easily explainable,
not only in the paper, but in a one-on-one conversation like the one we were having now.
What's your secret?
Yeah, that's a great question.
And I'm glad you ask it because in my experience,
a lot of researchers in the field of bug finding,
are more concerned with results and insights rather than the methodology.
But I believe that when it comes to studies like this,
it's absolutely crucial that you understand the methodology
and convince yourself that you agree with it
before you look at any of the results, right?
And I say this because getting our high-quality data set
was not particularly easy.
We have to perform multiple steps of mining and filtering
to get most likely data and time issues.
right. So the way we approached the problem was first we went to GitHub and we looked at all the
relevant software projects in Python that would help us out. So that meant looking at projects
that were created in the past 10 years, projects that had a certain number of stars, and also
projects that made use of date and time code. And once we performed that filtering step,
we were left with around 22,000 Python projects, which is a big,
number and that's not even looking at the issues right so for each of these 22,000 projects
we mined all of their GitHub issues and we looked for keywords that are related to time
like you know nanoseconds or milliseconds or we looked at any date and time python methods that
were being called in the code like sturf time or from UTC now or from timestamp and we were
looking for like general data and time concepts being mentioned in the issue description like
time zones, so UTC, DST.
And after performing this keyword matching step,
I think we had around 26,000 issues,
which is like a very big number.
So again, we had to go back to the drawing board
and perform another set of filtering.
And what we did was we were like,
hey, let's look at issues that are resolved
so that we can conclusively derive some insights from each issue.
Let's look at issues that have a PR.
so that we can look at a code change and
concretely say what was the code bulb.
And also let's look at issues that have sufficient information in them, right?
Many a times, issues just have a title and a question and nothing else.
So after doing all of this, you were left with a set of around 9,000 issues,
which was much better in quality, but still not manageable to analyze manually.
And that's when we pulled out a trick from,
something that's quite popular
in information retrieval systems
and that was the idea of applying
TF IDF or term frequency
inverse document frequency
to essentially find out
among these 9,000 issues
what issues are most likely to be
date and time issues, right?
Because even in the 9,000 issues that we had
we could have a performance bug that mentioned something
like elapsed time and that would get pulled into our
data set because of keyword search.
And for those of you who don't know at TFIDF, I would say, please go check it out.
It's a very interesting technique.
But at a high level, what TFIDF tries to do is it tries to prioritize issues with multiple
keywords that are present more often in date and time bugs.
And after applying this, we were able to sort our issues and go top down to find all the
interesting data time bugs.
And because we're doing this manually, we were only able to create a data set of 150 bugs,
but because it's such high quality, we are able to derive really good insights.
And one question that you may have right now is, hey, Strait,
did you try using AI to automate the classification of issues?
And that's a fair question.
We actually did perform a systematic study of AI's performance.
And what we actually did was we took the first 100 or so bugs that we had manually classified
and we used GPT4O to try and classify the same bugs
and compare the labels generated by GPT4 against ours.
We found that it was only 60% accurate at the time
and even with advanced prompting techniques
and breaking down the problem,
it was ever barely able to achieve 70% accuracy.
And we just deemed that unacceptable.
In my opinion, if you ever want to do automated analysis of data,
the accuracy has to be really high.
or errors can compound over time.
But in short, that's how we ended up
with a data set of 152 positives
that anybody can look at today.
Yeah.
Awesome.
And this data set is public, right?
Yes.
Yes.
Great, great stuff.
You mentioned AI and using AI to classify them.
Do you think, and I know this kind of
goes into the part where you talk about how to detect them.
But maybe before we jump to that, do you think AI can help in any way mitigate some of
those issues while developers are designing or writing their code?
Have you thought about this kind of use case?
Yeah.
It's very interesting that you ask that.
Because in some sense, that's the project that I'm working on currently.
in my PhD
and as it turns out
AI has been trained on
a lot of code that's out there
so it means that it has been trained on
multiple different date and time libraries
that exist
that may be similar but subtly different
and what I'm trying to do in my current project
is that essentially the goal
of the project was to see if we can
determine or find bugs in date and time
libraries. So we have two key insights here. All date and time libraries have functions that
accept inputs that are easily generatable using generators. So the second key insight is that
for any library API calls FOO that accepts two date times, DT1 and DT2, we can generate DT1 and
DT2, but we necessarily do not know what the output of FU should be. But what we know is that
If there is another library that offers the same function full,
both of them have to give similar outputs, right?
Because at the end of the day, the function is the same.
So in some sense, we started performing something known as AI-assisted differential first testing
on all of these code.
And what we started to realize is that AI just assumes things on the behalf of users.
So you could say that, hey, give me a function.
that computes the number of days between two dates.
Now, the AI is going to go ahead and give you a code, piece of code.
And that piece of code is going to work well 99% of the times.
But there might be a subtle edge case bug that does not handle daylight savings.
And it's very hard for users to know that that bug exists because there's no tool to help test the code.
So in my opinion, AI can help you understand APIs, but if you were to be able to
blindly use AI to write date and time code, you're very likely to run into
HKS behaviors in production. And that's scary territory for me.
Great. This is actually great. You heard that folks at disseminate first. AI
cannot help you write bug-free code. In fact, it introduces more bugs, which, I don't know.
is it
it's bad for the users
but great for
you Shrey because you're
doing research in this area
and that's your bread and butter
I'm also doing
debugging by choice so I'm a
software reliability person
and yeah sounds like
with AI we will not
run out of a job or at least we will
not run out of bugs to fix
that's
that's good to hear
at least we're not replaceable yet.
There's still a need for bug finders like yourself, right?
I want to bring the conversation back to the daytime bugs that you studied in this project in the paper.
And I would love to get back to future work and how you can use AI as a cog in a log wheel in a larger mechanism to find
more broader types of daytime bugs.
But going back to the paper, you have identified all these patterns.
You have identified, I'm assuming, some common or similar root causes or at least some
classes of root causes.
How did you fix them?
That's a great question.
So in our research, we did not necessarily focus on fixing.
These bugs, rather we focused on finding them so that developers could go fix them based
on their application context.
There are a few things.
Let's try to break it down.
So firstly, we, like I said, right, we analyzed seven different factors and we identified
the different kinds of concepts that developers got wrong.
We looked at different kinds of programmatic operations that developers got wrong.
we identified error patterns.
We also had four other factors
that were describing the nature of the bugs.
So even before going out and building rules,
we developed some hypothesis.
We tried to study the correlation
between bug severity and bug obscurity.
And what we found that
there was a slight negative correlation,
so around 0.11%.
And this meant that severe bugs were not necessarily hard to detect.
Also, we found that a lot of bugs were localized to a single function.
So this meant that even simple static analysis tools could potentially help you catch
date and time bugs.
And not just catch any date and time bugs, they could help you catch date and time bugs that
potentially had high impact.
and this was like a great hypothesis to have
but again it's just an idea right until you show it
so that's what we did next
we went ahead and we developed like a few static analysis
most specifically we developed static analysis for three patterns
in our study we found that
the thing that tripped up the developers most
was construction of date and time objects
which in itself is a very surprising thing
and maybe we can talk about that later
but we developed three analysis to focus on this specific problem
of data and time construction.
So firstly, we looked at use of outdated APIs
for constructing data in times.
And this is like a simple problem, right?
It's not dependent on application context.
You could potentially just use string grep in the code base
to find use of outdated APIs,
but we made use of code QL.
And for those of you don't know code QL,
it's a framework that helps you develop
static analysis in the form.
of queries. The next analysis that we developed looked at, you know, use of custom fixed time zones
in your code. So if your code is making use of a custom time zone that has a fixed offset,
it's not necessary that your code is correct. If you're running your code in a time zone that
observes daylight savings, then your offsets have to change with change in daylight savings.
And lastly, we developed another query that looked at partially replacing fields of date and time objects, right?
So let's say you had an object that represented 29th February and you only replaced the year.
You might get an object that represents a imaginary date and when you operate on it, your program could be buggy.
So we ran these three queries on a relatively small set of projects out there and we were able to find quite a few bugs.
and I was happy that the developers actually acknowledge these bugs
and we were able to successfully file them on GitHub.
It's always good to see your research go find bugs in the real world that way.
And essentially what they showed was this,
this is evidence that all the hypotheses and theories
that we develop in the paper actually hold true.
And I hope this is incentive for more people to go build more tools out there.
And I hope these things help spark more.
more research in this area.
But coming back to your question, right?
What more did we do about finding bugs?
I'm sorry, could you repeat your, like, remind me again what your question was?
Sure, sure.
I was wondering, actually, you answered most of the question.
I was wondering how you went about and found, to find the bugs.
And I was also, in my mind, I was also going to ask if you tried to fix some of them,
you kind of, you know, answer that question by saying you reported those box to the developers
and hopefully they're on the way of being fixed. You mentioned that your approach to finding
those bugs is to rely on static analysis and code UL, which it's a great, straightforward way
of finding many interesting bugs in software. I guess one question that I have,
and you may have answered this in the paper
is, did you find any limitations
of your approach,
your static analysis,
bug-finding approach?
Are there any time-related bugs
that you could not detect?
Or maybe you had
a little bit more false positives
than you'd have liked to have with this technique?
Can you tell us a little bit about that?
Yeah, yeah.
So that's a great question.
question actually. So one of the insights that we had in our paper, which I briefly alluded to in
my previous answer, is that there are actually quite a lot of different error patterns that
exist. And they can be as simple as making use of outdated APIs, or they could be as
complex as you know naively using date time objects where you should have been using
time zone aware date time objects and what we realized is that for each of these different
patterns you need to have a lot more information than just the code to be able to discern
true positives from false positives so this information could come in the form of documentation
it could be code comments or it could be project specifications trying to understand what the
is trying to do. If the project is designed to run locally on your system and that's it,
probably it's okay to not use time zone aware date time objects. And that's like one of the
biggest limitations with static analysis. I mean, you have experience with static analysis,
right? It's always a trade-off. We want analysis that are both sound and complete,
but more often than not, we can't have both. And many times we can't have either. So it becomes
a game of balancing false positives and true positives.
So while we showed in our paper that static analysis is capable of finding bugs,
it produces a lot of false positives.
For example, the query that was detecting fixed offset time zones in code,
returned 24 results to us for a set of thousand projects,
and only one among the 24 was a true positive,
and we had to like manually sit through all the other.
So one learning here is that, of course,
we can improve static analysis and make it better to reduce the false positive rate.
But code context is quite important when it comes to date and time applications.
And that's one of the limitations of static analysis.
Maybe AI can help us there.
You know, maybe AI can process documentation, code comments, or project specifications
to, you know, tell false positives and true positives in code.
and although I do not have like
concrete numbers to back my claim about AI
I'm confident because
AI has been able to help
with different kinds of bugs in other domains
and has been able to rightly identify bug patterns
and I guess this has been your experience too right
I read your paper on VTIB bugs in Microsoft
where you developed code QL queries
and also prompted LLMs to find bugs
and from your paper like
the LLMs did as good a job as
code QL in finding true positive
in Microsoft.
So, yeah, I guess that's one key takeaway for us
that any date and time tool,
bug-finding tool that is going to be developed in the future
has to be able to deal with not just the code,
but more about the project and other code contexts.
Because without that, it's just extremely hard to tell true positives
from false positives.
Maybe you can do that for simpler patterns,
but if you want to tackle the serious big ones,
for sure, you're going to need more than just code.
Right. Yes.
You mentioned our previous work on retrived box.
This is exactly what we discover that right now
it's no longer enough to reason about
how individual program statements are interacting
with memory, with each other,
whether or not they're in order or not or whether that matters.
But now you have to go one level of abstraction higher
and think about these functionality or mechanism bugs
where, as you mentioned, you have to take into account policies,
you have to take into account, develop a intent,
you have to take into account documentation
in addition to what the actual implementation is.
And yes, I, you know, I'm skeptical about
AI being able to replace us completely, but I'm really excited about AI helping us as a tool,
like you mentioned, to find patterns, or distill this developer intent you were talking about.
Yes, I'm right there with you.
And I think, you know, this is a great use case.
As long as you can validate what these tools are outputting or answering.
right? I think this is a great use case to, as you mentioned, mitigate false positives.
So if I am a Python developer today and I have a service that uses any kind of, or deals with any kind of date and time and calendars,
and I am to read your paper, what should I be mindful of? What? What?
How can I leverage your study to improve my implementation, my design, or at least my awareness to these kind of bugs?
I guess what I'm getting at is if you have any advice and lessons learned for software engineers when they are dealing with this type of code.
Yeah, that's a great question.
In fact, this was something that my advisor would also ask time again.
when writing the paper.
It's always satisfying to read a paper that provides insights,
but also follows up with actionable advice
that can benefit people reading the paper.
So I'm just going to tell you at a high level what we have in the paper.
So if people are interested, you should like go learn more.
But like at a high level, what we found was, you know,
one of the things that tripped up developers the most was
concepts that were related to time zones.
And more specifically, developers made mistakes
while constructing dating time objects.
Also, our analysis revealed that
there are different patterns
and it's not always easy to know
what you should be testing for.
But if there is something that you can do
as a developer today, that's probably testing.
because there are no bug-finding tools that exist out there.
So the best option that you have is to just thoroughly test your code.
In our study, we found that around 60% of the bugs can be detected by just executing the code with any input.
Of course, like the projects that we studied had very poor code coverage,
and hence they could not find these bugs.
The solution for finding, you know, like low-hanging fruit,
is to just implement a very comprehensive test suite that covers all critical date and time computations in your code.
You have to ensure that your testing framework can control as many randoms in your code as possible.
So make random number generation deterministic.
If your code reads the system clock or then mock out those APIs, try to control the time zone in which your tests run.
There are some useful libraries out there that can help you do this like freeze gun or lip fake time
for Python and just using a testing framework for deterministically executing code with like
controlled environments and deterministic inputs can go a long way in ensuring correctness.
Of course, this is like just one class of bugs, right, where you get an error as soon as you
execute the code.
But we can go further.
In our study, we found out that around 32% of the bugs can be caught easily when the code
is executed with the right inputs.
and again by right I mean
bug triggering inputs
and this is important
because these kind of bugs
are very evident
like you provide the right input
right edge case
and the program is going to crash
or it's going to throw an error message
and that's easily catchable
in a testing environment
so what can you do about that
maybe you know you have code
that deals with
date and time strings
and tries to store them in a database
what you can do
is you can try to test for reversibility.
Maybe generate a lot of written time strings,
store it in the database and try to read it back.
And assert can, is the value that you stored and read the same.
If not, maybe there is some mistake in your code
where it's trying to parse or store incorrectly.
And there could be many other properties about your code
that you could encode as assertions in your testing framework
and then perform property-based testing.
And as you may know,
there are popular property-based testing tools
like hypothesis that can help you
detect bugs.
Also, sometimes you have simple
bugs, right? Like, maybe you
are providing inputs to your
program, but your APIs are not designed
to handle this. For example,
there was this bug that I ran into
where the software was trying to
account for different buildings in a city.
And the buildings could be older
than the 1900s.
Maybe they were in the 1800s.
but the Python API string pastime
does not support for any data that goes before the 1900s.
And the program just fails.
The developers didn't know this when they published the code.
So what do you do about these bugs?
And the answer is simple, right?
Like if you develop a testing framework that can make your system deterministic
that does property-based testing,
then you already have most of the pieces of the puzzle.
All you need to do is just enable first testing
and generate random inputs.
maybe these inputs could be extreme cases
and you don't even need an article to find these bugs right
because your program is just going to crash
and that in itself is enough for you to see that
hey there's a bug in my code because my code should never crash
and I don't know if
I know you know this but I don't know if the listeners know this
but in all of software engineering research
fuzzing has proven to be the most effective
bug finding strategy in the real world period
so you should really take
the advantages of fuzzling
seriously and invest in it
because it has a very high return on investment.
So I would like to conclude by saying
if you're a developer today,
you have to continue learning about different data
and app concepts.
Read our paper, try to understand, get a feel
for what this domain is like,
understand more about the library functions
that you're using, their designs, their limitations.
And most importantly, to ensure that your code runs
the way it's supposed to run, perform thorough testing because that's the best you can do today.
Until, you know, hopefully somebody from academia like us or you come up with better tools
for finding data time bucks.
Awesome.
I think these are all great, great, great lessons that any developer working in this space
should be mindful and should pay attention to.
And hopefully, you know, next time we'll have this conversation with your next project,
the number of bugs that you observe.
reserve, at least as bug reports on GitHub, shrinks considerably and, yeah, makes it more manageable
for future PhD students working in this space. Because as you mentioned, going from 26,000 reports to
just a couple of hundreds, 100 bug reports that you analyzed has been quite a task. So I'm still,
you know, in my mind, that's one of the things that I'd like.
I love the most about this paper,
the methodology of distilling and filtering
and getting to the core of this problem.
So this is, this is great.
And it's been a fun paper to read.
And which brings me to,
and I know you just published
and it's at least in my,
you know, when people are talking,
when I talk with people about my work
and then they ask, okay, you did all this great work and what's next.
You know, I feel a bit of a brain freeze like, oh, okay, now I have to come up with the next project, right?
That's the life of a researcher.
So I'm going to turn the tables and ask you this question.
And after all this great research, this award-winning research and with awesome insights, what's next?
have you thought about expanding the work or the technique or maybe the methodology of filtering
out and weeding out reports that are uninformative?
Yeah.
Firstly, thank you so much.
Yeah, I think your first best paper award or distinguished paper award is always special and
I'm quite happy about it.
About broader implications.
So in our study, we focused only on Python.
and what we observed was that Python captures a lot of different bug patterns
that also exist in other languages.
Sure, the distribution of the bug patterns may vary,
but we believe that most of these patterns that we describe generalized to other languages.
It would be interesting if people perform like in-depth studies
for other popular programming languages like Java or JavaScript
because there are some crazy ways in which those libraries behave to.
In fact, funny story, I recently came across this website
whose whole purpose was dedicated to quizzing people
on what they would think
would be the output of a specific JavaScript code
that involved the JavaScript data library
and I'm ashamed to say that
despite all this research I was able to only get
two out of the 15 questions that had tried
correct and this is really hard stuff
what can I say but other than that
I really hope that this study sparks
an interest in the bugfinding community
to develop new tools
that help developers catch bugs early on.
Like I mentioned, there's a lot of scope
for developing static and dynamic analysis tools,
static tools with AI, dynamic analysis tools
in the form of testing.
I mean, there's even scope for, you know,
adding data in times as post-class citizens
for SMT theories to enable formal verification.
So that's some of the work that we are doing at Pasteel Lab, right?
one of the projects that I already mentioned
was trying to find bugs in libraries themselves
because if a library has a bug,
then potentially all software that depends on that library
also has a bug in it.
And I think research of this kind
has been pretty successful in other domains,
like compiler testing or database testing,
and we are trying to do the same for data in time libraries.
That's an ongoing project.
A new project that we are looking at,
at is can we support
date and time types
in SMT theories to enable
formal verification of software that deals with
data and time in safety critical systems.
So there are, there's a lot of work
left to be done.
I love finding bugs.
I love making software more reliable.
And I am very sure that
just one person or one lap
can't achieve everything that's out there.
So I really hope that
this kind of study sparks and interest.
in the community to look at this
understudied yet very important
domain of bugs and I
hope it serves as a stepping stone for more
future research work. Yeah.
I truly
hope that that's going to be the case
because this is again such
an interesting topic
and I may be
mistaken and you please correct me if
I'm wrong but I don't
remember this type of
bugs being studied in the past
and which brings me to my next question,
how did you come up with this project
or not necessarily the work itself,
but I'm interested in how do you think about starting a new project,
how you go a new research project,
how do you go about it, what's your creative process,
how do you generate ideas?
Yeah, yeah, that's actually,
really good question. And I think I almost have like a philosophical answer to this question
because I strongly believe that people do their best work when they have a sense of purpose or
calling. And for me, I am very clear about what I want. I have always wanted to have impact
through my work. When I started my PhD, I knew that I did not want my research to end up in
an academic paper, but rather in the real world. And in some sense,
this sense of purpose is what has driven my research career.
Let me explain how.
Like, a lot of people start by reading research papers and finding gaps in existing literature.
Somehow that never really clicked for me.
Of course, it's important to build a strong foundation of existing methods by reading papers,
but that's where it stops for me.
I usually start by looking at problems rather than limitations.
How do I find problems?
I talk to developers.
I look at bug reports from companies.
I read post-mortems of octages,
and I see what kind of bugs exist out there that cost millions of dollars.
To me, any bug that has the potential to bring down production
and cost millions of dollars automatically becomes a bug that I'm interested in.
I guess that's how I'm wired.
But once I have my eyes set on the problem,
I then go read research papers to see what exists.
and what are the gaps.
But as you may know, as a fellow bug-finding researcher,
that's not where it stops,
especially because once you have a bug,
it's important to familiarize yourself with the bug.
So that involves reading a lot of bug reports,
looking at a lot of code fixes,
and trying to understand the different ways in which the bug can manifest
and what it would take to catch this bug.
And the rest is just trusting the process, right?
You do the work, you lay the dots,
and hopefully they'll connect as you go along.
that's predominantly what my research or creative process looks like oh i think there's another aspect of
it which is just talking to your peers and discussing ideas i've seen wonders happen when
people from different backgrounds share ideas discuss research because it could be as small as
finding a simpler way of implementing a piece of code or discovering a whole new approach to solve
the problem but either case always talking to your peers is something that has helped me a lot
So in the nutshell, yeah, that's how I operate
and that's how I'm wired to look at research problems.
Awesome.
So you tackle a real-world problem or bugs that affect real system
that have severe impact.
You talk with your peers and then you, once you found the bug
and once you generated some initial ideas,
you drill down and find the gaps in the existing literature that could, you know, help you fix
those bugs that haven't been fixed before.
You, I think you summed it up perfectly.
That's the recipe for award-winning research as this research is.
So for listeners, you know, please take notes from Shrey.
which brings me to, you know, the last kind of last question on the menu today.
And I think it's about time to, if I may, if I may make a pun,
to tell listeners a key takeaway that you want them to remember from your work,
from this paper in particular, from your transition to industry,
to research or your creative process
and the way you approach finding new bugs to tackle.
Yes, sounds good.
So maybe let me first talk about my research in general,
which is about software reliability, right?
To state the obvious software is ubiquitous
and is growing at an exponential rate,
but also software is very buggy,
especially with, like, the advent of AI code generators becoming the norm,
it is increasingly important to ensure software correctness.
Like, while we still can't prove correctness of industry-scale software,
the next best thing that we have to ensure our software is free of bugs,
is to make use of bug-finding tools.
And in my experience, the use of bug-finding tools in industry
is not as widespread as I had hoped for.
For example, recently I attended this developer conference called Bugbash that was hosted by this company antithesis.
And there was a question that was asking the audience, how many of you use fuzzling to test your code at your company?
And only 10% of the people raised hands that feel like a big gap.
Like apart from big tech or maybe even parts of big tech, engineers continue to ensure software correctness through rudimentary techniques like stress testing or unit testing with.
hard-coded values or manual testing.
This is not good.
I would strongly urge developers
to invest more time
in learning and adopting
well-established bug-finding techniques
or at least improving the existing
test suites that they have.
It may seem like a waste of time at first.
In fact, a developer once called it
a distraction from a distraction
because the goal of developers is to write features.
Writing tests is a distraction
and investing in harnesses for fuzzling
is a distraction from a distraction.
But trust me, it's worth it.
And talking specifically about date and time bugs,
if you're writing software that deals with date and time,
I would say just assume it's wrong.
That's the best mindset to be in when you're writing
any code that's related to date and time.
There are just so many nuances and edge cases
that there's always likely some assumption that you made
that may not hold in production.
And since there are no date and time related bug finding tools,
the best you can do is just thoroughly test your code.
First test with random inputs.
Try using property-based testing.
Execute your code in sandboxes
where you can control time zones and date times.
And as our paper shows,
many of the severe bugs can be caught
with simple testing techniques.
You don't have to invest a lot.
So the return on investment is quite significant.
So yeah, that's like the biggest takeaway that I have.
Please be mindful while writing date and time code,
thoroughly test it and continue to learn about the different nuances of date and time systems.
So yeah, that's it from my end.
That's, I think that's a great tagline to add to our entire bug finding field.
If you're a developer and deploy code in production, please, please use bugfinding tools
before doing that.
It's worth it.
Shre, thank you so much for being here with us.
And I cannot wait to hear about and read about your next project.
I know you have a few things cooking, and I'll be following software engineering conferences to spot your name there.
Thank you so much for joining us today.
Thank you so much for having had a great time.