CoRecursive: Coding Stories - Story: briffa_sep98_e.pro - The File That Sparked a Storm
Episode Date: April 2, 2025 Can a single line of code change the way we see science, policy, and trust?  In this episode we explore the "Climategate" scandal that erupted from leaked emails and code snippets, fueling doubts... about climate science. What starts as an investigation into accusations of fraud leads to an unexpected journey through the messy reality of data science, legacy code struggles, and the complex pressures scientists face every day.  Along the way, we uncover stories of hidden errors and misunderstood phrases taken out of context, revealing a world where science, software engineering, and human complexity intertwine. This story doesn't just challenge assumptions—it shows the power and importance of transparency in science and technology.  Join Adam as he digs deep into Climategate, uncovering what really happened when code got thrust into the spotlight, and what it means for trust, truth, and open science. Episode Page Support The Show Subscribe To The Podcast Join The Newsletter Â
Transcript
Discussion (0)
In Athens, 2013, Maria sits in a cafe. She's 32 and she's been jobless for three years.
Like many Greeks, she feels stuck. Each month tougher than the one before.
Newspapers drone on about austerity. There's cuts, there's layoffs, there's pensions slashed.
Life is closing in on her. That morning, though, she spots an unusual headline. It's a strange story from across the ocean about two Harvard economists,
Carmen Reithart and Kenneth Rogoff.
Their 2010 paper claimed that when a country's debt tops 90 percent of its GDP, economic growth takes a hit.
This idea became a key reason for austerity policies worldwide, including those imposed on Greece.
But the news article Maria reads is an update.
A graduate student with his professors found a critical error in Reinhart and Rogoff's analysis,
a simple spreadsheet mistake.
A miscalculated formula left out significant data, leading to just factual inaccuracies. Instead of economies shrinking
when debt topped 90% of GDP as the original paper had claimed, the corrected figures show average
growth rates of around 2%. This wasn't just an academic blunder, it had real world fallout.
Governments misled by the flawed study where they had just not selected all
the rows, rolled out austerity measures. And because of that, there was prolonged recessions.
There was soaring unemployment. There was social unrest. In Greece, it was a national
crisis, unemployment over 27%. Public services falling apart, the lives of people like Maria
in chaos. It's unsettling, this idea that a simple spreadsheet error,
a coding mistake could steer global economic policy
could change the lives for millions of people.
Maria's story is in fact fictional, just a composition,
but the people affected by this error were real.
And it makes you wonder what other unseen mistakes
or unintentional deviations in code are quietly shaping our world?
And what happens when those lines of code are thrust into the spotlight?
Welcome to Co-Recursive. I'm Adam Gordon Bell.
Today we're exploring the invisible code that quietly shapes the world around us.
Code most of us never think about at all.
Here's an example.
My mom doesn't own a computer unless you count her flip phone.
She's nowhere near the Silicon Valley bro stereotype.
But she did write code once in university.
She studied psychometry, measuring intelligence and cognitive skills.
And back then writing her research up meant running statistical calculations,
which meant writing programs on punch cards and submitting them to batch
processors to calculate correlations.
A lot of the world's most important code is like that or like that GDP spreadsheet.
It's just some simple calculations tucked away somewhere in academia sitting on a co-author's
machine that's only pulled up when a diagram needs to be regenerated or a
constant needs to tweak or when somebody requests it. It's invisible code but it's
powerful. It affects policies. Often this hidden code stays unnoticed until
something goes wrong or until a single line
out of context gets thrust into the spotlight.
That's what today's story is about.
A story about more than just scientific data.
A story about the human side of data analysis and the pressures on those who do it.
I'm talking about Climategate.
Does anyone remember Climategate? It was all over
the news back in 2009-2010, about 15 years ago, and I vaguely recall it being a really big deal.
I remembered something about leaked emails and climate change. It was one of those scandals that
just happened in my past, and if I thought hard about it, I recall hearing about scientists getting
caught red-handed fudging data to make global warming look worse than it was. They were supposed to be
truth seekers, but they were twisting the numbers to fit their agenda. I think there was a hack or
a leak. At least that's what I remember. But then I looked into it and I realized it all boiled down to a single file, a single piece of code.
That Briffa encore CEP 98 underscore E dot pro.
Rolls off the tongue, doesn't it?
Climategate was like that spreadsheet error, but on a massive scale because it shifted
how people saw scientists.
And it sparked distrust in science for some, maybe for many.
And that trend continues today, but here's the thing,
we can find the truth ourselves.
Today, I'm gonna go download
the actual leaked Climagate files,
open up the controversial code,
and dig through it step by step.
All to answer one big question,
was Climagate evidence of scientific fraud?
Or was it something else entirely?
To answer this, we're gonna take some detours.
We'll explore strange files with cryptic names,
decipher obscure programming languages like IDL,
venture into unrelated scandals,
like the Alzheimer's research scandal,
and at times it might feel
like I'm getting lost in the details, but trust me,
we're always chasing the same goal,
uncovering exactly what happened in ClimateGate, Like, I'm getting lost in the details, but trust me, we're always chasing the same goal,
uncovering exactly what happened in Climate Gate.
Because I think it matters.
Because I think we live in a world where science itself is increasingly under attack, where
misinformation spreads faster than actual explanations, and where trust in experts is
super, super fragile.
So no matter what we uncover,
the act of careful investigation itself
is an essential skill.
It will help us figure out what and who to trust
in a moment when it feels like the stakes
on the truth have never been higher.
It all started on November 17, 2009. Something was wrong at the Climate Research Unit at the University of East Anglia in Norwich,
England, a city of about 150,000 people.
A backup server holding years of emails and research data had been breached.
The university called it a sophisticated and carefully orchestrated attack.
160 megabytes of data were copied, emails, documents, code, everything.
And then there were some whispers online, and a curious upload to real climate, and
then anonymous posts hinting at secrets, suggesting that climate science was too important to
be kept under wraps.
By November 19th, Whispers became a roar.
An archive file with everything was copied across the internet and spreading fast.
Suddenly thousands of private emails and documents were out there.
Climate change denial blogs jumped on it, claiming the truth was finally being revealed.
In just a few days, the media picked up the story and headlines ran about leaked emails
about a brewing scandal.
And this was all just weeks before
the Copenhagen Climate Summit.
The University of East Anglia confirmed the breach
and the police got involved
and the world watched as climate gate erupted.
No one knew the full impact yet,
but it was clear something big had just happened
and the world of climate science was about to be shaken. When these files first leaked, James Dellingpole
at the Telegraph reported that global warming was based on one massive lie. Now I just want
to say, I believe in global warming and I believe that it's caused by humans. My intent
here is not to give a platform to the science deniers, but I do want to
explore how we can move beyond just trusting the experts, how we can look at
things ourselves, how we can investigate what is the truth using our own minds
and you know, some effort.
That's why I found these leaked files and I downloaded them. It's a zip
file, foia.zip, and it's packed with documents. It's split into two folders,
documents and mail. The mail folder is like 11,060 text files with names like
1, 2, 5, 4, 2, 3, 2, 8, 5,.txt if you open it, it's just a plain text email
between two researchers,
usually talking about a paper they're working on.
The key file of our story,
BRFAA-CEP98E,
is in the documents folder
in a directory called Harris Tree.
And this file is considered the smoking gun
that triggered
the controversy and led to an entire university lab being investigated by the
UK House of Commons. It led to eight official inquiries, articles in the New York
Times, articles in the Washington Post that claimed the climate scientists were
lying, that claimed that they were hiding things, that the world was actually
getting cooler. And it's just one file.
It's just a small file.
It's 150 lines of what turns out to be IDL, a programming language that's kind of like
MATLAB or like NumPy, but Fortran based.
I guess IDL is mainly used in science for number crunching and graphing.
It's imperative code.
It's like set this variable, then load this one, loop over these.
And it's pretty heavily commented.
Although in IDL, the comments start aligned with a semicolon,
which I find a bit confusing, but I got used to it.
Anyways, in this file, right at the top, in all caps with asterisks before and after
to set it off as a heading.
It says applies a very artificial correction for decline.
Artificial correction.
It's, it's right there in the code.
Just two lines down from the top of the file and then a list of values and the
values are labeled fudge factor, fudge factor, artificial correction.
This wasn't sophisticated climate modeling jargon, right?
This sounds like they were just making stuff up.
But to get to why this artificial correction
stirred things up, you kind of need to know
what was going on at the time.
What was happening in the 90s, the late 90s,
and the early 2000s, and about the hockey stick graph.
In the late 90s, climate scientist Michael Mann,
along with some others, Raymond Bradley, Malcolm Hughes,
introduced the hockey stick graph.
It showed global temperatures holding steady
for a thousand years, and then shooting up sharply
in the late 19th and early 20th centuries. Picture a hockey stick lying
flat on the ground and then suddenly at the end the blade curving upward. That's the shape.
That was temperature, worldwide temperature, for the globe and for the Western Hemisphere.
The graph wasn't just scientific trivia, it exploded into the public view.
It became this shorthand for the urgency of climate change.
Al Gore held it up, it was a big moment in an inconvenient truth.
And suddenly this image, this shape was everywhere, a symbol of the crisis that was going on.
But that power made it a target.
Almost immediately, it faced fierce scrutiny.
Skeptics didn't just question it. They attacked, claiming the data was manipulated to exaggerate
warming. To them, this wasn't science revealed. It was a political weapon that had been forged to
institute drastic policies. So when Climategate erupted, what was really happening is these
phrases like artificial correction and fudge factor popped up in the leaked code and skeptics thought they'd hit the jackpot.
Right? They had proof of fraud. They had proof that they didn't have to worry.
Here's the critical question. Was the hockey stick graph genuinely compromised? Was somebody mis-presenting things? Or was this controversy more about
misunderstanding? Were people misunderstanding the scientific process?
Thankfully, right, we have the code. Now I just need to figure out how IDL works
and I need to find the data and understand what's going on here. The
fudge factor is actually pretty straightforward. It's a series of numbers
from 1400 to 1992. It starts at zero. So we have a zero value from 1400 to 1904.
Then it dips negative into the 30s. And then it shoots up in the 50s all the way through the 70s.
And then finally leveling off. I couldn't actually figure out how to run IDL,
so I did what any developer would do.
I just converted it to Python.
If you graph those values, you see a long flat line
and then the shaft like a hockey stick
starting up in 1950 onwards.
It's a blade that tilts sharply upward.
The code does more than just
graph that fudge factor though. It reads in climate data and it applies a low
pass filter to it, basically smoothing it out, and then it applies that fudge
factor over top. So I did the same thing. I made up random climate data from 1400
to now, and then I applied the very artificial correction. And
then I can graph both with the correction and without. And without it's a very straight
line, but with it turns into a hockey stick. The fudge factor completely overshadows the
real data. I can see why the skeptics were concerned. When this surfaced, Eric S. Raymond, a well-known open source advocate,
the guy who wrote The Cathedral and the Bazaar, and also a well-known social conservative,
he saw it too.
He did the same process and found some of the same issues.
This is blatant data cooking, plain and simple.
It flattens the warm temperatures of the 1930s and 40s.
See those negative coefficients?
Then it adds a positive multiplier
to create a dramatic hockey stick.
This isn't just a smoking gun.
This is a siege cannon with the barrel still hot.
Eric Raymond's a vivid blogger, right?
Siege cannon, barrel still hot.
It's powerful imagery.
And it was coming from an expert in software.
So it was hard to dismiss.
He wasn't just some random internet crank.
Eric at the time, he had a big book out.
He was a respected figure in the tech world,
and he definitely understood code.
His vivid take on the situation
helped shape how people first saw the code.
He posited that this was an error cascade.
The people at CRU had manipulated climate data with this hockey stick fudge factor and
that led many people to believe this false narrative about climate gate.
And soon the world was buying into this big lie until this leak happened and the deception
came to light.
Some claimed that climate change was fake and this fudge factor in this code was proof. Climate change of course wasn't fake, but that
didn't mean that scientists weren't nudging the numbers. Both could be true. So what was really
happening? With any good investigation, you can't stop at the first piece of evidence that fits the
narrative. You have to keep digging, especially when the accusations are this big.
And the deeper I dug, the more I kept seeing
another infamous phrase, one that seems like
a direct confession that kept coming up, hide the decline.
And in reference to the original hockey stick graph
published in Nature, there was an email in this leak
that talked about Mike's nature trick.
To bloggers and to the mainstream media,
this felt like a confession.
Some thought this hack was an inside job.
Maybe someone at CRU was fed up with all the lies,
and so they leaked this data out.
But before jumping to conclusions,
we need to understand what this code is really doing.
You see, climate science is actually pretty complicated.
You can't just read the file.
You need to understand the context.
So heads up, we're about to do a deep dive, but stick with it.
I think it's worth it.
All right.
Imagine this, your uncle, it's 2 a.m. and pager goes off.
Main transaction system is throwing errors.
Latency is spiking. You dive in, but something goes off. Main transaction system is throwing errors.
Latency is spiking.
You dive in, but something's off.
The detailed performance logs, the granular stuff you need,
they only go back six hours.
Before that, you just have daily averages.
Nothing useful for debugging this spike.
You can see the system is acting weird now.
But the crucial question is, is this spike. You can see the system is acting weird now. But the crucial question is, is
this spike completely unprecedented, or is this just Tuesday? And that's when the batch
jobs run, and then it throws some alerts like this, and you should ignore them. Without
that historical context, without those older logs, you're flying blind, trying to figure
out the root cause. Climate science is exactly like this. But the system is planet Earth. The mistakes are considerably higher. We have
solid detailed data on the Earth's climate, thermometer readings from
weather stations, from ships, from lots of places going roughly back 150 years.
This is the instrumental record and it tells a clear story of the planet's
average surface temperature has risen by about one and a half degrees Celsius
over the last century. Just like that production system with only six hours of
logs, 150 years is a blink of an eye in climate terms. So is that degree and half
of warming normal for the earth or is it abnormal? Is it outside of the natural
variability? We've got this huge blind spot before the late 1800s. We just don't know.
So to answer this question, scientists need to become data detectives. They need to find ways
to reconstruct climate history from before the time of widespread measurement. But this isn't like restoring logs from archives.
Nature doesn't keep clean, standardized, you know, JSON files.
The data log scientists have to work with were things like the width of tree rings,
or the chemistry of ancient ice layers drilled up from Greenland, or the
skeletons of corals, or even the temperature
profiles found deep underground in boreholes. These are called climate proxies, and they're
imperfect, they're noisy, they measure climate indirectly. They're sparsely located around the
globe, and they sometimes record things other than temperature. And also they have gaps,
and they come in completely different formats.
Piecing together the earth climate history
from fragmented and messy data is a huge challenge.
Climate science is actually a lot like data archeology.
You're using complex statistical modeling
and a painstaking process to try to figure out
if the picture you're assembling
is an accurate representation using
all this proxy data.
So let's look at some of the main types of data used.
It's really the only way to get an understanding of what's happening in that file.
The most famous temperature proxies are the tree rings.
This is central to the story because this is actually what CRU focuses on.
Trees grow a new layer each
year and how thick or dense that layer is often depends on the conditions
during the growing season. Maybe how warm the summer was or how much rain fell. So
you find some really old trees and you drill out a core and you count the rings
back in time measuring their properties. It sounds simple but it's actually messy.
Trees only grow in the mid latitudes of the globe,
and you won't find any trees in the ocean or in Antarctica.
And even where they do grow,
it's not just climate affecting them.
Younger trees grow faster, trees get diseases,
maybe a nearby tree falls,
giving the tree you're measuring more sunlight.
It's like a performance metric that's being affected by random GC pauses or network hiccups that you
weren't tracking or the amount of work available to do and a million other
factors. But there's actually so many trees so you get lots of data and
hopefully with that much data the individual noise can cancel out and you
can find the signal the aggregate growth growth rate, year upon year, for an area where
the trees are in, going back as far as those trees do.
And actually even further, we'll get to that.
Next up are ice cores.
You drill deep into an ice sheet on Greenland or Antarctica or in a high mountain glacier.
And you can get a lot of data out out because as snow falls and compresses into ice
year after year, it traps tiny bubbles of atmosphere
from that specific year.
And scientists can measure the CO2 concentration
from hundreds or thousands of years ago.
Ice cores are how we know
that today's CO2 levels are unprecedented.
The ice itself, the frozen water molecules, also hold clues.
The ratio of heavy oxygen isotopes to light ones change depending on the temperature when
the snow originally formed. So that's another proxy. But it's not perfect. The isotope
ratio can be thrown off by where the snow came from, not just the local temperature.
And the deeper you go,
the more the ice layers get compressed together.
So the yearly resolution gets fuzzier and fuzzier.
It's like a log file where older entries
are being aggressively compressed.
For oceans, and especially in the tropics,
scientists look at corals.
Corals build skeleton out of calcium,
and they add layers year by year, sort of like tree rings.
So corals give us this precious data that we were missing from the vast ocean areas where trees don't grow.
And then there are other types of proxies.
There's layers of sediment washed into lakes each year that can tell you about the levels of snowmelt.
You can use that to infer temperature.
Fossils and deep ocean mud give
clues about temperature over millennium, though often really fuzzy in terms of what year it
is. And you can even measure temperature down in boreholes that are drilled deep into the
earth's crust. I don't totally get how that one works.
But the point is, you've got all these different types of proxies. Tree rings measure summer
temperatures in North America,
coral skeletons record sea surface temperatures in the tropical Pacific, ice cores log polar
temperatures, lake mud will tell you about the spring snow melt. So they're all recording
something about climate, but they're all indirect and they're all noisy. And they all have different
time resolutions, some annual,
some spanning decades, some spanning centuries. And the dating isn't always perfect. Someone
is piecing this data together by hand. Also, they all cover different parts of the globe
and different seasons. Some stop abruptly. Some end up with weird glitches in them. So
how do you take this mix of messy, scattered, imperfect data and turn it into a clear picture
of the climate over time?
How do you pull together data from systems
that are so different and that are barely documented
and that are sometimes reliable and get out of that
a reliable view of the system's past, of the Earth's past?
The first problem you have to overcome is uneven data
distribution. You may have hundreds of tree ring records from North America, but only a few crucial
ice core records from the Arctic, and also a few coral records from the tropics. So you pick a year,
you have hundreds of values from different proxies and locations, but most of it's tree rings. If you just toss all this raw data into a model, the tree rings would dominate, skewing
the results to reflect only the mid-latitude forests and ignoring all this vital polar
and ocean data.
That's not ideal for a global temperature view.
So before we put together a model, we need to pre-process the data.
We have to transform that chaotic mix
of raw proxy measurements into smaller and more structured and representative sets of
features. We do this with principal component analysis. It works like this. Imagine you're
monitoring again a massive microservice deployment. You've got hundreds, maybe thousands of metrics
streaming in, CPU load, memory usage, request latency,
error counts, database connections.
For every single service instance.
So at one moment, you capture a snapshot.
You got 500 CPU metrics from your web tier.
You have 10 latency metrics from your database cluster,
five error rate metrics from your authentication service.
So you have 515 numbers describing your system state
at one particular moment in time.
But looking at all 500 of these raw numbers
is overwhelming and not helpful.
And many of those 500 CPU metrics
are probably telling you the exact same thing.
If the cluster's under heavy load,
most of these CPUs will be high.
In other words, they're all highly correlated variables.
And you don't necessarily care about tiny variations between CPU 101 and CPU 102.
You care about the overall pattern of the load on that web tier.
So principal component analysis, PCA, is the algorithm that spots these patterns or themes in your C of metrics.
It would scan all 500 CPU metrics and say,
the biggest variation is here.
The main signal is whether the whole group
is generally high or low.
And we'll call that PC1,
principal component one for the webs here.
It might capture another pattern,
like front end servers are busy,
but backend servers are idle,
as PC component two, PC two.
PCA creates these new synthetic variables,
principal components,
which are each made up of mixes
of the underlying components.
The cool thing about principal component analysis
is it figures out patterns
without needing to know what's what.
It's an unsupervised
learning method that extracts correlated information from the data. Crucially, these principal
components are ordered by how much of the total information in the original data they explain,
and each principal component is uncorrelated with the last.
Back to the climate data, for a given year, you have these 500 tree
ring measurements and a few ice cores and coral values. Instead of tossing all 500 noisy correlated
tree ring values in the main model, you first extract the principal components. PCA finds the
main shared patterns of tree growth across that network. The first few principal components might capture 80 or 90%
of the meaningful variation.
The first component could literally represent
the overall good growing conditions of the season,
while the hundredth might just reflect something
like rainfall in one very small area of North America.
PCA allows you to zero in on the big consistent patterns in tree growth,
cutting through the noise of the individual trees. So PCA doesn't give us a final temperature map
from our tree rings. Instead, it gives us a neat simplified data set, gives us just a couple data
points to look at. And the cool thing is, it's all here in the data leak. While many climate models mix
various metrics together for the most accuracy, our BRFa file is just based on tree ring data.
And if you look around, it's not too hard to find the PCA file. It's in Documents Osborne Tree 6
in a file that starts with PCA. It's another IDL file.
But getting that data ready for principal component analysis
is no small feat either,
because there's another file,
documents-osborn-tree-6-rrd-all-mdx1.pro,
that does a lot of the heavy lifting
to process this raw data.
It's nice though, that it's all here.
Now that I kind of am starting to understand IDL and how these climate models work, I can
look through the files and kind of see what they're doing.
So now that we've got our refined proxy features for each year, we can focus on calibration.
Calibration depends upon the overlap in time between when we have actual temperature readings
and when we have tree core measurements. In our data, this overlap period is from 1856 to 1990.
That's when our tree rings overlap with temperature data, although that's not quite true,
and you'll see why as we go. But yeah, that is the period where we both have
processed proxy features and reliable thermometer
temperatures.
That overlap is our ground truth for our climate model.
We're building a statistical model to link patterns
in our proxy features with those in the known
temperature records from this overlap period.
Think of it like training a machine learning model. I mean, in this case, it it like training a machine learning model.
I mean, in this case,
it's actually not a machine learning model.
It's more simple statistics.
The idea is the same.
You give it the process proxy features as inputs
and the instrumental temperatures as known outputs.
The algorithm figures out the complex correlations
and the weights, the best way to basically map
from those inputs to the
output temperatures during that time period.
In our data leak, this process is done alongside the principal component analysis.
Ian Harris, known as Harry, throughout this leak, checks the principal components that
are extracted against rainfall records.
Rainfall being the strongest non-temperature signal
that we have records for. This lets him extract the temperature component, which
is the non-rainfall component, which is then used in the graph in the question
BRFAA-98 file. Now here's where it gets interesting. I feel like this is the part
that the skeptics missed. Harry calibrated his statistical model using the overlapping data, and the PCA helped him
pull out the signal.
So when you feed your trained model only the proxy data from before the thermometers existed,
from like 1000 AD to the start of our measurement area, the model using the relationships it learned during calibration
gives its best estimate of the temperature for those years. And just like that you have a curve
stretching back centuries showing the estimated ups and downs of the past temperature. You might
ask, as I did and I had to look into this, how can you have tree rings that go back to 1000 AD? Well, this tree ring
data set is the MXD data set and it actually uses live very old trees but also dead preserved trees
that can be exactly dated. And they can be exactly dated via their correlation to live trees.
It's more detective work, but basically high altitude, very old dead wood
can be found and can be precisely dated. But yeah, building and running the algorithm is just the
start. The next question is, does this work? Is this reconstruction solid or did we just create
a complicated statistical illusion? That's where the verification step comes in.
The verification step uses holdout validation.
Remember that overlap period where we have both proxy data and thermometer
readings, instead of using all of that to train the statistical model, you
deliberately hold back a chunk of the thermometer data, and then you can test
against that to see if your model's working.
If the reconstruction can successfully predict
the temperatures in the period that you held out,
it boosts your confidence
that the relationships that learned are real.
It's like using a separate validation data set
in machine learning.
Model validation is the key.
And we have a lot of files in this data leak,
Cal, PR, Bandtemp.pro, Calibrate, Bandtemp.pro,
and so on and so forth,
many files in this leak all aimed at validating the data.
And it's actually in this validation step
that we find the answer to hide the decline,
the controversial phrase that led to the reporting
that climate scientists were hiding the truth.
But before we dive into those emails
and what hiding the client is,
there's another layer to consider
because the past climate data
isn't just about pinning down a single global temperature.
It's a complicated web.
The Earth's climate isn't a simple thermostat that slowly goes up or down.
It's a chaotic system that's fluctuating
on multiple timescales that are all layered
on top of each other.
You have events like El Nino and La Nina
that pop up every few years,
and they warm or cool big parts of the Pacific
and shake up weather patterns around the world.
You have big volcanic eruptions
that send aerosols into the atmosphere,
and these particles reflect sunlight,
and they cause global cooling for a year or two.
And that's just two of the timescales at play.
There's many more, and the big challenge
for climate scientists is pulling apart
all these overlapping signals.
It's much more complicated
than just a global yearly average temperature.
But okay, all right, we've circled back.
Hopefully you made it through all my background.
With all that proxy data, with all those proxies, and with all that data complexity in mind,
let's tackle these infamous phrases. Let's break them down.
First, let's break down Mike's nature trick.
This sparked huge controversy, right?
Was Mike Mann publishing something incorrect?
Was he hiding things?
Then we'll cover hide the decline, the so-called smoking gun that caused ABC and CBC and the New York Times and the Washington Post all to accuse climate scientists
of misleading the public.
But yeah, first, Mike's nature trick.
Mike Mann is the man behind the iconic climate change graph.
He's the one behind the original hockey stick graph, the one from Al Gore's Inconvenient
Truth.
And while Mike's nature trick sounds like something from a spy novel, it's not about secret manipulation.
It's about taking all this complex data and turning it into a simple graph.
Mike had these projections from climate models, right? The proxy data and what they implied.
And he also had real temperature data.
Thermometer readings, the straightforward stuff where no crazy stats are needed.
You just check the thermometer.
His trick was to put both types of data on one graph.
Mike used two separate lines,
one for real measured temperatures from 1860 to today,
and another for proxy temperatures,
reaching far back in time,
which he also added error bars to.
Mike's trick was putting both sets of data on one graph.
The proxy data is complex,
but without the real temperature data,
which shoots up as a hockey stick blade,
that's what makes it have a punch.
The thing is that blade was never in doubt.
It's just the yearly average temperature any weather station could tell you that
Now the folks at the CRU made a somewhat intentional misleading choice
Instead of using two separate lines they combine them into one line
The instrumental and the projections now climatologists would understand
That when the line hits modern times and the error bars go to zero,
that it's showing real data and not a projection.
But not everybody would get that.
So that is a little bit misleading, but there's no lies involved.
But the real kicker, the real thing that upset people was emails that said hide the decline.
You know, you would have a cold winter or a snowstorm
and politicians would show up trying to cast
a suspicious light at global warming with snowballs.
Where's global warming now?
So when somebody said hide the decline,
they're like, yes, I get it.
They were hiding the fact that it's actually getting cold.
But as I said, it's easy to verify that the world
wasn't getting colder. The world was in fact warming. The year 1999 this data came from was
the hottest year on record. So here's the deal. Highly decline wasn't about covering up a drop
in global temperatures. It was about a decision to leave out unreliable post-1960s data.
You see, for centuries, tree ring data matched up well with temperature. Warmer conditions meant
denser wood formed late in the growing season. But around 1960, this relationship broke down.
This is known as the divergence problem,
but it does seem like a real issue.
We have this temperature data,
this tree ring data is being used as a proxy
to project backwards until the temperature 1000 years back,
but yet it doesn't even work in known periods,
like from 1960 to present
how solid is our past reconstruction if these proxies seem flawed and
And here's the thing. I actually found an answer for that me
just somebody who downloaded this data leak and started poking through and
Read a book or two to fill in some information
this data leak and started poking through and read a book or two to fill in some information. I figured it out.
It was pretty exciting for me and it involved reading lots of this IDL code.
But first, before I share what I find, I want to say that questioning this data, looking
carefully at this code, even if I assume that climate change is a given, it's still a good thing.
It's not anti-science to check their work.
Critical examination, that impulse that I feel to look closer, is a vital thing, even
when it's uncomfortable.
No field is immune to bad intentions.
Sometimes even foundational work warrants a second look, somebody needs to check it.
And a big reminder of this is a major ongoing investigation in a completely different field, Alzheimer's research.
So before I tell you what I found in the data, let me tell you about Alzheimer's research.
In it, the dominant theory for decades was this amyloid hypothesis.
It's the idea that this sticky beta plaque in the brain were what caused the disease.
In 2006, Sylvan Lesney and his team published a paper in Nature that seemed to back the amyloid hypothesis.
They identified this protein, A beta star 56, and suggested that it
caused memory issues in rats. And this paper became a quarterstone. It was
cited thousands of times and it ended up directing billions of dollars in
research funding and drug development towards targeting these amyloid plaques.
But over the years things didn't quite add up.
Top Alzheimer's labs tried to replicate his findings,
but often they couldn't do it consistently.
Now it's a big warning sign,
but yet some labs managed to replicate the results,
and then they led to more research.
And then there was drug development
based on those findings.
Then enters Matthew Scrag. He wasn't digging through
emails or private messages. He wasn't trying to read IDL files like me. He was focused on the
science. He was scrutinizing published papers in Alzheimer's research and he spotted some anomalies,
especially in the images, including the papers. It started with some offshoot papers,
images including the papers. It started with some offshoot papers, but the more he dug, the more it led back to Lesney's 2006 nature paper. Basically, he was able to tell that the
images were photoshopped. Somebody had used a cloning tool and you could see mismatched
backgrounds or lines that appeared too clean. And this wasn't just online talk that he posted on his blog.
No, he was a major investigator and he led to a major investigation that was released
in Science Magazine in 2022.
It wasn't just misunderstood jargon or internal debates.
In this case, it was actually the integrity of visual evidence in peer reviewed studies.
It had a huge follow up.
The follow up is actually still ongoing.
Lesney's University launched an investigation.
Nature issued a cautionary editors note to the original paper.
All these things feel pretty mild.
But what's now known is that these results don't hold up.
This was fraud. The process of retraction is messy and slow because no one wants to admit they've been chasing a lie. It's huge
damage done to the field, but there's also a chance for science to self-correct. Scientists
are human, right, and some will cheat. And Skrag's investigation shows the danger of a real error cascade.
That 2006 paper wasn't just a study, it was a foundation.
Thousands of studies built on it.
Billions of funding followed.
Patients took drugs based on faulty research.
Drugs that were costly, drugs that had side effects and that even led to deaths.
Drugs that ultimately failed to cure or help with Alzheimer's.
An entire field pouring resources down a path that led nowhere all because of some fraud in a key study.
I just mention this because this investigation reminds us that the skepticism is vital.
Questioning these findings, even influential ones, is crucial.
This impulse to dig deeper is sound. That's why I think I need to apply this skeptical spirit
to climategate and to this BRFAA CEP98 underscore E profile. But yeah, I think we can now understand what's happening in that file.
The startling comment that caused such a stir applies a very artificial correction for the
decline, followed by the fudge factor array.
We can now explain what those are.
At first glance, skeptics like Eric Graman said that this was a siege cannon.
It seemed super damaging.
It looked like clear evidence of data manipulation to force that hockey stick shape.
But now we know the decline is not about global temperatures dropping.
It's about certain proxies like the tree ring data no longer being reliable indicators.
Here's how I know.
Here's what I found.
Remember those calibration files I mentioned,
like calibrate, band temp pro? They're really crucial. When you run the whole process, PCA,
correlate, and then validate on this tree ring data, the predictions that come out are pretty
noisy. There's something in the data, especially from the overlap period, that's causing noise and making the predictions inaccurate.
So Harry or the team or whoever, after digging into the data, the issue became clear.
The post-1960 tree data.
For centuries, these rings matched up with the temperature readings.
Warmer summers meant denser rings.
But after 1960, that link broke. The thermometers
showed warming, but the rings suggested cooling. Something changed. Something changed with
how trees were growing in Earth. Maybe the extra CO2 from global warming. Maybe the trees
just don't grow the same forever. Maybe pollution. Maybe chemicals, we don't know. But the trees weren't matching
predictions. But they found a way to overcome this. They would skip the post-1960s data
for principal component analysis. By focusing in on the data before 1960, they could better
extract the signal. If they removed that 1960s data, they could better extract the signal. If they remove that 1960s data,
they could better estimate the temperatures going backwards.
So that gave them a better ability to project backwards,
but it led to a problem, right?
When they feed that data forward to the post 1960s,
the model predicted lower temperatures.
So if the global temperature was 14 degrees in 1972, the model would lower temperatures. So if the global temperature was 14 degrees in 1972,
the model would say 12.
They found a way to build a model
that predicts past temperatures well,
but shows a decline just as the world heats up.
That is the divergence, right?
That's the failure of the specific proxy data post 1960.
That's the decline that they are hiding.
The reason it diverges is because of the way they built the model to ignore whatever changed
post-1960.
It's actually all in the leak.
If you look through the calibration attempts, you can find them performing these. They used the data from 1911 to 1960 to build the model
and then calibrated it backwards using data
from 1856 to 1910.
And that worked better than if they used 1911 to present day.
This wasn't a secret.
They in fact published a paper on the divergence problem
in nature back in 1998.
It was a known issue.
But it's fascinating to me that you can dive into the code
and you can see how they derived this.
It doesn't clear up everything, right?
As I said, when a key proxy method goes wonky,
just as we have better tools to check it,
it does raise real questions
about how reliable the method is.
But the puzzle here is just about the
limits of this specific proxy. It's not about a lie. And then going further, if we look
at our file, our brypha sep98-e, the file name is telling the underscore e is actually
some old school version control. There's actually A through D as well.
And these are all found in a personal folder
named Harris Tree for Ian Harry Harris, the programmer.
And that fudge factor, those hard-coded numbers,
they look like a hockey stick graph
in the context of the divergence problem
that's pretty clear.
This is actually Harry manually mixing in
the instrumental data, the real
world temperature data. As I said, ideally you'd show these as two separate lines,
but Harry was just trying to manually hack in the instrumental data into his
graph. But here's the real kicker, this wasn't the code that was used for the
paper. In the linked files, there's a whole different directory where the actual published data is.
There's brypha, sep98, decline1, and decline2.
These files are quite similar, but they tackle the divergence problem differently.
They don't have a hard-coded fudge factor. They don't mention an artificial decline.
Instead, they read the actual instrumental data from files.
There's no fudge factor. There's just reading in the temperature and adding it to the graph. The actual methods used later just
use temperature data from a real public source.
So the core accusation that scientists were literally inventing numbers to fake warming doesn't hold up when you actually look at what the files were. It's also crucial to just zoom out
and remember what this data set is. This is the CRU high latitude tree ring
density data. This is the stuff with the divergence problem. It's just one single
thread in a vast tapestry of climate science. The overall conclusion that the Earth is warming
and that humans are the primary cause
doesn't rest on this file, or in fact, on this leak.
It comes from the convergence of many independent lines
of evidence gathered and analyzed
by all kinds of scientists worldwide.
In fact, the graph that Al Gore used was based on
ice core samples, not this data at all. So there's no air cascade here. The CRU
data matters, especially for reconstructing detailed temperature maps of
the northern hemisphere land temperatures over the last millennium.
But that's just a part of the story. The attackers who leaked these files and the
bloggers who spread the story weren't actually doing a thorough review of the CRU's work. Perhaps
that's not surprising. Likely they just ran keyword searches for terms like trick or hide or
artificial. And in this massive dump of emails and files, they found some juicy snippets.
In one file that was never used for a published paper,
and they took them out of context and claimed that they found a lie and that they found
a conspiracy. Here's where the Alzheimer's story stands out as being quite different, right? Matthew
Schrag wasn't sifting through stolen emails for dirt. He was carefully examining published
scientific evidence one by one. He was questioning
its integrity through complicated visual analysis. This was skepticism aimed at the science itself
leading to potential corrections for the field. In fact, he did it because he wanted to get
the field back on track. Climategate was driven by a specific code file it used out of context chatter.
It used experimental code to target scientists and to sow doubt rather than engaging with the full body of the published work.
And in fact, it was timed for this all to happen right before the Copenhagen climate conference.
So there's some pretty strong hints that there was a political agenda here.
Find a lie and then you can say that they're lying about everything.
But here's the cool part. Maybe the real story in the Climacate files isn't about conspiracy or fraud at all.
Maybe it's about something far more mundane, yet I think profoundly important.
The unglamorous, often frustrating reality of being a programmer trying to make sense of messy scientific data.
Because Ian Harry Harris, the CRU programmer whose name is on this folder, HarrisTree,
in the leak there's another file, a massive text document, 15,000 lines long, called Harryreadme.txt. And it's basically Harry's personal log
stretching over the years,
documenting his day-to-day struggles
to maintain and update and debug these climate datasets
and to work on the code that's used to process them.
And reading it is like, well,
if you've ever worked on a legacy code base
or if you've ever tried to integrate data from dozens of different inconsistent sources, I think you can feel a deep sense
of empathy for Harry.
Harry wasn't writing about grand conspiracies, he was writing about the grind of data wrangling
and the challenges of software archaeology.
He writes about an Australian data set being a complete mess, that so many stations have been introduced and he can't keep track of it.
He complains a lot about Tim and Tim's code, and I assume that Tim is somebody who came before him and didn't sufficiently document what he did.
Sometimes he just writes, oh fuck this, all in caps. As in, oh fuck this, it's Sunday night and I've worked all weekend and
just when I thought it was done, I'm hitting yet another problem and it's the hopeless
state of our databases. There's no data integrity. It's just a catalog of issues that keep growing
on and on. Reading Harry's log, you don't see this cunning manipulator working to hide
inconvenient truths. You see just an overworked programmer,
likely under-resourced, grappling with complex,
messy real-world data and imperfect legacy code.
And he's just, he's doing his best to make sense of it all.
He's dealing with inconsistent formats.
He's dealing with missing values, undocumented changes.
It's just the kind of stuff that data scientists
and that legacy software engineers everywhere deal with daily.
And he leaves all these exasperated comments and they don't sound like admissions of fraud,
just like the slightly cynical remarks of someone deep in the trenches of doing the difficult work of climate change.
Maybe this is the real story of climate gate.
It's not a scientific scandal, but a human one. A story about the immense and
often invisible technical labor required to turn noisy observations into
scientific understanding. And the pressures faced by those tasked with
doing it often without recognition or even the resources they need.
And then after all that, they get attacked and their private work files become the hot
topic on ABC News.
So where does all this leave us?
After all, the sound and the fury and the investigations and the accusations, I mean,
what did Climategate really reveal?
At first, the media jumped on this idea that this was a smoking gun.
Nobody wanted to deal with global warming.
I mean, nobody still wants to deal with it.
Al Gore called it an inconvenient truth.
So there was hope.
There was hope that it was all a mistake or a fraud.
And people ran with that.
Newspapers churned out stories of
deception for weeks after the leak and the investigations came much slower.
There was eight official inquiries, yes eight, and all came to the same
conclusion. No fraud, no scientific misconduct. Climate Sciences core
findings stood firm. The hockey stick graph could be
debated for some of its statistical details. You can debate the limits of some of these
proxies but it's backed by many other studies that use different methods and use different
data. The trick wasn't a deception, it was just a graphical choice. And hiding the decline
wasn't hiding a
global cooling trend. It was about dealing with a known issue. Climategate
wasn't proof that climate change was a hoax. It was more like a case study and
how internal scientific discussions and informal language and experimental
messy code can be twisted when leaked into a charged climate where people
are looking to create doubt.
If I were to take a lesson from the climate gate saga, it would be about the necessity
of transparency in science, especially things like climate science.
What if from the start all the raw data and code and statistical methods were out there?
What if they were publicly
accessible to begin with? I imagine them on GitHub, ready for anyone to run and critique.
And actually, as a result of all this, CRU now has the instrumental data available under
an open government license. And while Eric Raymond's initial take on the code file is what caused a big stir, he was
right about one thing.
Because he demanded that they open source the data.
I feel like that's a principle I can agree with him on.
Climate science, with its global stakes and complexities, should embrace open source,
should embrace open access as much as possible.
Science isn't always neat.
It's a human process full of debate and messy data and involving methods.
But like software development, it gets stronger and more robust and more trustworthy when
the process is open, when the data is shared, when the code is available for review.
That's my takeaway from the whole affair. It's not about a conspiracy
revealed but a powerful argument for doing science in the open. We live in a world in
which science is more than ever under attack and underfunded and being questioned and being
politicized. I think the best defense against that is to be open.
That was the show. How many people made it this far? I don't know.
Honestly, I started by diving into this Climagate code and it got more interesting as I went along,
but I'm still pretty unsure about how interesting it is for others.
There's like a lot of interesting tangents I went on that I had to cut as well,
but I came away with one big idea. Climate science is kind of interesting. It's a little bit
like data science, except in climate science you're dealing with messier data and you often have
to gather it and label it yourself. But you get to work with a community that's striving
for shared knowledge. Climategate makes it sound like it's all about global warming
models and politics, but really it's more about diving deep into specific issues like
how the layers of sediment in this certain data set can affect the
feedback cycle in the Atlantic Ocean temperatures. Harry's exasperated cynical
grievances notwithstanding, it actually sounds pretty interesting. But yeah, let
me know what you think of this episode and until next time, thank you so much
for listening.