CoRecursive: Coding Stories - Story: briffa_sep98_e.pro - The File That Sparked a Storm

Episode Date: April 2, 2025

  Can a single line of code change the way we see science, policy, and trust?   In this episode we explore the "Climategate" scandal that erupted from leaked emails and code snippets, fueling doubts... about climate science. What starts as an investigation into accusations of fraud leads to an unexpected journey through the messy reality of data science, legacy code struggles, and the complex pressures scientists face every day.   Along the way, we uncover stories of hidden errors and misunderstood phrases taken out of context, revealing a world where science, software engineering, and human complexity intertwine. This story doesn't just challenge assumptions—it shows the power and importance of transparency in science and technology.   Join Adam as he digs deep into Climategate, uncovering what really happened when code got thrust into the spotlight, and what it means for trust, truth, and open science. Episode Page Support The Show Subscribe To The Podcast Join The Newsletter  

Transcript
Discussion (0)
Starting point is 00:00:00 In Athens, 2013, Maria sits in a cafe. She's 32 and she's been jobless for three years. Like many Greeks, she feels stuck. Each month tougher than the one before. Newspapers drone on about austerity. There's cuts, there's layoffs, there's pensions slashed. Life is closing in on her. That morning, though, she spots an unusual headline. It's a strange story from across the ocean about two Harvard economists, Carmen Reithart and Kenneth Rogoff. Their 2010 paper claimed that when a country's debt tops 90 percent of its GDP, economic growth takes a hit. This idea became a key reason for austerity policies worldwide, including those imposed on Greece. But the news article Maria reads is an update.
Starting point is 00:00:51 A graduate student with his professors found a critical error in Reinhart and Rogoff's analysis, a simple spreadsheet mistake. A miscalculated formula left out significant data, leading to just factual inaccuracies. Instead of economies shrinking when debt topped 90% of GDP as the original paper had claimed, the corrected figures show average growth rates of around 2%. This wasn't just an academic blunder, it had real world fallout. Governments misled by the flawed study where they had just not selected all the rows, rolled out austerity measures. And because of that, there was prolonged recessions. There was soaring unemployment. There was social unrest. In Greece, it was a national
Starting point is 00:01:35 crisis, unemployment over 27%. Public services falling apart, the lives of people like Maria in chaos. It's unsettling, this idea that a simple spreadsheet error, a coding mistake could steer global economic policy could change the lives for millions of people. Maria's story is in fact fictional, just a composition, but the people affected by this error were real. And it makes you wonder what other unseen mistakes or unintentional deviations in code are quietly shaping our world?
Starting point is 00:02:08 And what happens when those lines of code are thrust into the spotlight? Welcome to Co-Recursive. I'm Adam Gordon Bell. Today we're exploring the invisible code that quietly shapes the world around us. Code most of us never think about at all. Here's an example. My mom doesn't own a computer unless you count her flip phone. She's nowhere near the Silicon Valley bro stereotype. But she did write code once in university.
Starting point is 00:02:43 She studied psychometry, measuring intelligence and cognitive skills. And back then writing her research up meant running statistical calculations, which meant writing programs on punch cards and submitting them to batch processors to calculate correlations. A lot of the world's most important code is like that or like that GDP spreadsheet. It's just some simple calculations tucked away somewhere in academia sitting on a co-author's machine that's only pulled up when a diagram needs to be regenerated or a constant needs to tweak or when somebody requests it. It's invisible code but it's
Starting point is 00:03:19 powerful. It affects policies. Often this hidden code stays unnoticed until something goes wrong or until a single line out of context gets thrust into the spotlight. That's what today's story is about. A story about more than just scientific data. A story about the human side of data analysis and the pressures on those who do it. I'm talking about Climategate. Does anyone remember Climategate? It was all over
Starting point is 00:03:46 the news back in 2009-2010, about 15 years ago, and I vaguely recall it being a really big deal. I remembered something about leaked emails and climate change. It was one of those scandals that just happened in my past, and if I thought hard about it, I recall hearing about scientists getting caught red-handed fudging data to make global warming look worse than it was. They were supposed to be truth seekers, but they were twisting the numbers to fit their agenda. I think there was a hack or a leak. At least that's what I remember. But then I looked into it and I realized it all boiled down to a single file, a single piece of code. That Briffa encore CEP 98 underscore E dot pro. Rolls off the tongue, doesn't it?
Starting point is 00:04:32 Climategate was like that spreadsheet error, but on a massive scale because it shifted how people saw scientists. And it sparked distrust in science for some, maybe for many. And that trend continues today, but here's the thing, we can find the truth ourselves. Today, I'm gonna go download the actual leaked Climagate files, open up the controversial code,
Starting point is 00:04:56 and dig through it step by step. All to answer one big question, was Climagate evidence of scientific fraud? Or was it something else entirely? To answer this, we're gonna take some detours. We'll explore strange files with cryptic names, decipher obscure programming languages like IDL, venture into unrelated scandals,
Starting point is 00:05:16 like the Alzheimer's research scandal, and at times it might feel like I'm getting lost in the details, but trust me, we're always chasing the same goal, uncovering exactly what happened in ClimateGate, Like, I'm getting lost in the details, but trust me, we're always chasing the same goal, uncovering exactly what happened in Climate Gate. Because I think it matters. Because I think we live in a world where science itself is increasingly under attack, where
Starting point is 00:05:35 misinformation spreads faster than actual explanations, and where trust in experts is super, super fragile. So no matter what we uncover, the act of careful investigation itself is an essential skill. It will help us figure out what and who to trust in a moment when it feels like the stakes on the truth have never been higher.
Starting point is 00:06:10 It all started on November 17, 2009. Something was wrong at the Climate Research Unit at the University of East Anglia in Norwich, England, a city of about 150,000 people. A backup server holding years of emails and research data had been breached. The university called it a sophisticated and carefully orchestrated attack. 160 megabytes of data were copied, emails, documents, code, everything. And then there were some whispers online, and a curious upload to real climate, and then anonymous posts hinting at secrets, suggesting that climate science was too important to be kept under wraps.
Starting point is 00:06:46 By November 19th, Whispers became a roar. An archive file with everything was copied across the internet and spreading fast. Suddenly thousands of private emails and documents were out there. Climate change denial blogs jumped on it, claiming the truth was finally being revealed. In just a few days, the media picked up the story and headlines ran about leaked emails about a brewing scandal. And this was all just weeks before the Copenhagen Climate Summit.
Starting point is 00:07:12 The University of East Anglia confirmed the breach and the police got involved and the world watched as climate gate erupted. No one knew the full impact yet, but it was clear something big had just happened and the world of climate science was about to be shaken. When these files first leaked, James Dellingpole at the Telegraph reported that global warming was based on one massive lie. Now I just want to say, I believe in global warming and I believe that it's caused by humans. My intent
Starting point is 00:07:42 here is not to give a platform to the science deniers, but I do want to explore how we can move beyond just trusting the experts, how we can look at things ourselves, how we can investigate what is the truth using our own minds and you know, some effort. That's why I found these leaked files and I downloaded them. It's a zip file, foia.zip, and it's packed with documents. It's split into two folders, documents and mail. The mail folder is like 11,060 text files with names like 1, 2, 5, 4, 2, 3, 2, 8, 5,.txt if you open it, it's just a plain text email
Starting point is 00:08:27 between two researchers, usually talking about a paper they're working on. The key file of our story, BRFAA-CEP98E, is in the documents folder in a directory called Harris Tree. And this file is considered the smoking gun that triggered
Starting point is 00:08:45 the controversy and led to an entire university lab being investigated by the UK House of Commons. It led to eight official inquiries, articles in the New York Times, articles in the Washington Post that claimed the climate scientists were lying, that claimed that they were hiding things, that the world was actually getting cooler. And it's just one file. It's just a small file. It's 150 lines of what turns out to be IDL, a programming language that's kind of like MATLAB or like NumPy, but Fortran based.
Starting point is 00:09:18 I guess IDL is mainly used in science for number crunching and graphing. It's imperative code. It's like set this variable, then load this one, loop over these. And it's pretty heavily commented. Although in IDL, the comments start aligned with a semicolon, which I find a bit confusing, but I got used to it. Anyways, in this file, right at the top, in all caps with asterisks before and after to set it off as a heading.
Starting point is 00:09:46 It says applies a very artificial correction for decline. Artificial correction. It's, it's right there in the code. Just two lines down from the top of the file and then a list of values and the values are labeled fudge factor, fudge factor, artificial correction. This wasn't sophisticated climate modeling jargon, right? This sounds like they were just making stuff up. But to get to why this artificial correction
Starting point is 00:10:15 stirred things up, you kind of need to know what was going on at the time. What was happening in the 90s, the late 90s, and the early 2000s, and about the hockey stick graph. In the late 90s, climate scientist Michael Mann, along with some others, Raymond Bradley, Malcolm Hughes, introduced the hockey stick graph. It showed global temperatures holding steady
Starting point is 00:10:38 for a thousand years, and then shooting up sharply in the late 19th and early 20th centuries. Picture a hockey stick lying flat on the ground and then suddenly at the end the blade curving upward. That's the shape. That was temperature, worldwide temperature, for the globe and for the Western Hemisphere. The graph wasn't just scientific trivia, it exploded into the public view. It became this shorthand for the urgency of climate change. Al Gore held it up, it was a big moment in an inconvenient truth. And suddenly this image, this shape was everywhere, a symbol of the crisis that was going on.
Starting point is 00:11:20 But that power made it a target. Almost immediately, it faced fierce scrutiny. Skeptics didn't just question it. They attacked, claiming the data was manipulated to exaggerate warming. To them, this wasn't science revealed. It was a political weapon that had been forged to institute drastic policies. So when Climategate erupted, what was really happening is these phrases like artificial correction and fudge factor popped up in the leaked code and skeptics thought they'd hit the jackpot. Right? They had proof of fraud. They had proof that they didn't have to worry. Here's the critical question. Was the hockey stick graph genuinely compromised? Was somebody mis-presenting things? Or was this controversy more about
Starting point is 00:12:05 misunderstanding? Were people misunderstanding the scientific process? Thankfully, right, we have the code. Now I just need to figure out how IDL works and I need to find the data and understand what's going on here. The fudge factor is actually pretty straightforward. It's a series of numbers from 1400 to 1992. It starts at zero. So we have a zero value from 1400 to 1904. Then it dips negative into the 30s. And then it shoots up in the 50s all the way through the 70s. And then finally leveling off. I couldn't actually figure out how to run IDL, so I did what any developer would do.
Starting point is 00:12:49 I just converted it to Python. If you graph those values, you see a long flat line and then the shaft like a hockey stick starting up in 1950 onwards. It's a blade that tilts sharply upward. The code does more than just graph that fudge factor though. It reads in climate data and it applies a low pass filter to it, basically smoothing it out, and then it applies that fudge
Starting point is 00:13:15 factor over top. So I did the same thing. I made up random climate data from 1400 to now, and then I applied the very artificial correction. And then I can graph both with the correction and without. And without it's a very straight line, but with it turns into a hockey stick. The fudge factor completely overshadows the real data. I can see why the skeptics were concerned. When this surfaced, Eric S. Raymond, a well-known open source advocate, the guy who wrote The Cathedral and the Bazaar, and also a well-known social conservative, he saw it too. He did the same process and found some of the same issues.
Starting point is 00:14:00 This is blatant data cooking, plain and simple. It flattens the warm temperatures of the 1930s and 40s. See those negative coefficients? Then it adds a positive multiplier to create a dramatic hockey stick. This isn't just a smoking gun. This is a siege cannon with the barrel still hot. Eric Raymond's a vivid blogger, right?
Starting point is 00:14:20 Siege cannon, barrel still hot. It's powerful imagery. And it was coming from an expert in software. So it was hard to dismiss. He wasn't just some random internet crank. Eric at the time, he had a big book out. He was a respected figure in the tech world, and he definitely understood code.
Starting point is 00:14:39 His vivid take on the situation helped shape how people first saw the code. He posited that this was an error cascade. The people at CRU had manipulated climate data with this hockey stick fudge factor and that led many people to believe this false narrative about climate gate. And soon the world was buying into this big lie until this leak happened and the deception came to light. Some claimed that climate change was fake and this fudge factor in this code was proof. Climate change of course wasn't fake, but that
Starting point is 00:15:10 didn't mean that scientists weren't nudging the numbers. Both could be true. So what was really happening? With any good investigation, you can't stop at the first piece of evidence that fits the narrative. You have to keep digging, especially when the accusations are this big. And the deeper I dug, the more I kept seeing another infamous phrase, one that seems like a direct confession that kept coming up, hide the decline. And in reference to the original hockey stick graph published in Nature, there was an email in this leak
Starting point is 00:15:40 that talked about Mike's nature trick. To bloggers and to the mainstream media, this felt like a confession. Some thought this hack was an inside job. Maybe someone at CRU was fed up with all the lies, and so they leaked this data out. But before jumping to conclusions, we need to understand what this code is really doing.
Starting point is 00:16:01 You see, climate science is actually pretty complicated. You can't just read the file. You need to understand the context. So heads up, we're about to do a deep dive, but stick with it. I think it's worth it. All right. Imagine this, your uncle, it's 2 a.m. and pager goes off. Main transaction system is throwing errors.
Starting point is 00:16:24 Latency is spiking. You dive in, but something goes off. Main transaction system is throwing errors. Latency is spiking. You dive in, but something's off. The detailed performance logs, the granular stuff you need, they only go back six hours. Before that, you just have daily averages. Nothing useful for debugging this spike. You can see the system is acting weird now.
Starting point is 00:16:44 But the crucial question is, is this spike. You can see the system is acting weird now. But the crucial question is, is this spike completely unprecedented, or is this just Tuesday? And that's when the batch jobs run, and then it throws some alerts like this, and you should ignore them. Without that historical context, without those older logs, you're flying blind, trying to figure out the root cause. Climate science is exactly like this. But the system is planet Earth. The mistakes are considerably higher. We have solid detailed data on the Earth's climate, thermometer readings from weather stations, from ships, from lots of places going roughly back 150 years. This is the instrumental record and it tells a clear story of the planet's
Starting point is 00:17:25 average surface temperature has risen by about one and a half degrees Celsius over the last century. Just like that production system with only six hours of logs, 150 years is a blink of an eye in climate terms. So is that degree and half of warming normal for the earth or is it abnormal? Is it outside of the natural variability? We've got this huge blind spot before the late 1800s. We just don't know. So to answer this question, scientists need to become data detectives. They need to find ways to reconstruct climate history from before the time of widespread measurement. But this isn't like restoring logs from archives. Nature doesn't keep clean, standardized, you know, JSON files.
Starting point is 00:18:13 The data log scientists have to work with were things like the width of tree rings, or the chemistry of ancient ice layers drilled up from Greenland, or the skeletons of corals, or even the temperature profiles found deep underground in boreholes. These are called climate proxies, and they're imperfect, they're noisy, they measure climate indirectly. They're sparsely located around the globe, and they sometimes record things other than temperature. And also they have gaps, and they come in completely different formats. Piecing together the earth climate history
Starting point is 00:18:48 from fragmented and messy data is a huge challenge. Climate science is actually a lot like data archeology. You're using complex statistical modeling and a painstaking process to try to figure out if the picture you're assembling is an accurate representation using all this proxy data. So let's look at some of the main types of data used.
Starting point is 00:19:11 It's really the only way to get an understanding of what's happening in that file. The most famous temperature proxies are the tree rings. This is central to the story because this is actually what CRU focuses on. Trees grow a new layer each year and how thick or dense that layer is often depends on the conditions during the growing season. Maybe how warm the summer was or how much rain fell. So you find some really old trees and you drill out a core and you count the rings back in time measuring their properties. It sounds simple but it's actually messy.
Starting point is 00:19:43 Trees only grow in the mid latitudes of the globe, and you won't find any trees in the ocean or in Antarctica. And even where they do grow, it's not just climate affecting them. Younger trees grow faster, trees get diseases, maybe a nearby tree falls, giving the tree you're measuring more sunlight. It's like a performance metric that's being affected by random GC pauses or network hiccups that you
Starting point is 00:20:10 weren't tracking or the amount of work available to do and a million other factors. But there's actually so many trees so you get lots of data and hopefully with that much data the individual noise can cancel out and you can find the signal the aggregate growth growth rate, year upon year, for an area where the trees are in, going back as far as those trees do. And actually even further, we'll get to that. Next up are ice cores. You drill deep into an ice sheet on Greenland or Antarctica or in a high mountain glacier.
Starting point is 00:20:42 And you can get a lot of data out out because as snow falls and compresses into ice year after year, it traps tiny bubbles of atmosphere from that specific year. And scientists can measure the CO2 concentration from hundreds or thousands of years ago. Ice cores are how we know that today's CO2 levels are unprecedented. The ice itself, the frozen water molecules, also hold clues.
Starting point is 00:21:07 The ratio of heavy oxygen isotopes to light ones change depending on the temperature when the snow originally formed. So that's another proxy. But it's not perfect. The isotope ratio can be thrown off by where the snow came from, not just the local temperature. And the deeper you go, the more the ice layers get compressed together. So the yearly resolution gets fuzzier and fuzzier. It's like a log file where older entries are being aggressively compressed.
Starting point is 00:21:36 For oceans, and especially in the tropics, scientists look at corals. Corals build skeleton out of calcium, and they add layers year by year, sort of like tree rings. So corals give us this precious data that we were missing from the vast ocean areas where trees don't grow. And then there are other types of proxies. There's layers of sediment washed into lakes each year that can tell you about the levels of snowmelt. You can use that to infer temperature.
Starting point is 00:22:03 Fossils and deep ocean mud give clues about temperature over millennium, though often really fuzzy in terms of what year it is. And you can even measure temperature down in boreholes that are drilled deep into the earth's crust. I don't totally get how that one works. But the point is, you've got all these different types of proxies. Tree rings measure summer temperatures in North America, coral skeletons record sea surface temperatures in the tropical Pacific, ice cores log polar temperatures, lake mud will tell you about the spring snow melt. So they're all recording
Starting point is 00:22:37 something about climate, but they're all indirect and they're all noisy. And they all have different time resolutions, some annual, some spanning decades, some spanning centuries. And the dating isn't always perfect. Someone is piecing this data together by hand. Also, they all cover different parts of the globe and different seasons. Some stop abruptly. Some end up with weird glitches in them. So how do you take this mix of messy, scattered, imperfect data and turn it into a clear picture of the climate over time? How do you pull together data from systems
Starting point is 00:23:12 that are so different and that are barely documented and that are sometimes reliable and get out of that a reliable view of the system's past, of the Earth's past? The first problem you have to overcome is uneven data distribution. You may have hundreds of tree ring records from North America, but only a few crucial ice core records from the Arctic, and also a few coral records from the tropics. So you pick a year, you have hundreds of values from different proxies and locations, but most of it's tree rings. If you just toss all this raw data into a model, the tree rings would dominate, skewing the results to reflect only the mid-latitude forests and ignoring all this vital polar
Starting point is 00:23:54 and ocean data. That's not ideal for a global temperature view. So before we put together a model, we need to pre-process the data. We have to transform that chaotic mix of raw proxy measurements into smaller and more structured and representative sets of features. We do this with principal component analysis. It works like this. Imagine you're monitoring again a massive microservice deployment. You've got hundreds, maybe thousands of metrics streaming in, CPU load, memory usage, request latency,
Starting point is 00:24:26 error counts, database connections. For every single service instance. So at one moment, you capture a snapshot. You got 500 CPU metrics from your web tier. You have 10 latency metrics from your database cluster, five error rate metrics from your authentication service. So you have 515 numbers describing your system state at one particular moment in time.
Starting point is 00:24:49 But looking at all 500 of these raw numbers is overwhelming and not helpful. And many of those 500 CPU metrics are probably telling you the exact same thing. If the cluster's under heavy load, most of these CPUs will be high. In other words, they're all highly correlated variables. And you don't necessarily care about tiny variations between CPU 101 and CPU 102.
Starting point is 00:25:13 You care about the overall pattern of the load on that web tier. So principal component analysis, PCA, is the algorithm that spots these patterns or themes in your C of metrics. It would scan all 500 CPU metrics and say, the biggest variation is here. The main signal is whether the whole group is generally high or low. And we'll call that PC1, principal component one for the webs here.
Starting point is 00:25:40 It might capture another pattern, like front end servers are busy, but backend servers are idle, as PC component two, PC two. PCA creates these new synthetic variables, principal components, which are each made up of mixes of the underlying components.
Starting point is 00:25:58 The cool thing about principal component analysis is it figures out patterns without needing to know what's what. It's an unsupervised learning method that extracts correlated information from the data. Crucially, these principal components are ordered by how much of the total information in the original data they explain, and each principal component is uncorrelated with the last. Back to the climate data, for a given year, you have these 500 tree
Starting point is 00:26:26 ring measurements and a few ice cores and coral values. Instead of tossing all 500 noisy correlated tree ring values in the main model, you first extract the principal components. PCA finds the main shared patterns of tree growth across that network. The first few principal components might capture 80 or 90% of the meaningful variation. The first component could literally represent the overall good growing conditions of the season, while the hundredth might just reflect something like rainfall in one very small area of North America.
Starting point is 00:27:01 PCA allows you to zero in on the big consistent patterns in tree growth, cutting through the noise of the individual trees. So PCA doesn't give us a final temperature map from our tree rings. Instead, it gives us a neat simplified data set, gives us just a couple data points to look at. And the cool thing is, it's all here in the data leak. While many climate models mix various metrics together for the most accuracy, our BRFa file is just based on tree ring data. And if you look around, it's not too hard to find the PCA file. It's in Documents Osborne Tree 6 in a file that starts with PCA. It's another IDL file. But getting that data ready for principal component analysis
Starting point is 00:27:48 is no small feat either, because there's another file, documents-osborn-tree-6-rrd-all-mdx1.pro, that does a lot of the heavy lifting to process this raw data. It's nice though, that it's all here. Now that I kind of am starting to understand IDL and how these climate models work, I can look through the files and kind of see what they're doing.
Starting point is 00:28:13 So now that we've got our refined proxy features for each year, we can focus on calibration. Calibration depends upon the overlap in time between when we have actual temperature readings and when we have tree core measurements. In our data, this overlap period is from 1856 to 1990. That's when our tree rings overlap with temperature data, although that's not quite true, and you'll see why as we go. But yeah, that is the period where we both have processed proxy features and reliable thermometer temperatures. That overlap is our ground truth for our climate model.
Starting point is 00:28:53 We're building a statistical model to link patterns in our proxy features with those in the known temperature records from this overlap period. Think of it like training a machine learning model. I mean, in this case, it it like training a machine learning model. I mean, in this case, it's actually not a machine learning model. It's more simple statistics. The idea is the same.
Starting point is 00:29:11 You give it the process proxy features as inputs and the instrumental temperatures as known outputs. The algorithm figures out the complex correlations and the weights, the best way to basically map from those inputs to the output temperatures during that time period. In our data leak, this process is done alongside the principal component analysis. Ian Harris, known as Harry, throughout this leak, checks the principal components that
Starting point is 00:29:39 are extracted against rainfall records. Rainfall being the strongest non-temperature signal that we have records for. This lets him extract the temperature component, which is the non-rainfall component, which is then used in the graph in the question BRFAA-98 file. Now here's where it gets interesting. I feel like this is the part that the skeptics missed. Harry calibrated his statistical model using the overlapping data, and the PCA helped him pull out the signal. So when you feed your trained model only the proxy data from before the thermometers existed,
Starting point is 00:30:18 from like 1000 AD to the start of our measurement area, the model using the relationships it learned during calibration gives its best estimate of the temperature for those years. And just like that you have a curve stretching back centuries showing the estimated ups and downs of the past temperature. You might ask, as I did and I had to look into this, how can you have tree rings that go back to 1000 AD? Well, this tree ring data set is the MXD data set and it actually uses live very old trees but also dead preserved trees that can be exactly dated. And they can be exactly dated via their correlation to live trees. It's more detective work, but basically high altitude, very old dead wood can be found and can be precisely dated. But yeah, building and running the algorithm is just the
Starting point is 00:31:14 start. The next question is, does this work? Is this reconstruction solid or did we just create a complicated statistical illusion? That's where the verification step comes in. The verification step uses holdout validation. Remember that overlap period where we have both proxy data and thermometer readings, instead of using all of that to train the statistical model, you deliberately hold back a chunk of the thermometer data, and then you can test against that to see if your model's working. If the reconstruction can successfully predict
Starting point is 00:31:47 the temperatures in the period that you held out, it boosts your confidence that the relationships that learned are real. It's like using a separate validation data set in machine learning. Model validation is the key. And we have a lot of files in this data leak, Cal, PR, Bandtemp.pro, Calibrate, Bandtemp.pro,
Starting point is 00:32:08 and so on and so forth, many files in this leak all aimed at validating the data. And it's actually in this validation step that we find the answer to hide the decline, the controversial phrase that led to the reporting that climate scientists were hiding the truth. But before we dive into those emails and what hiding the client is,
Starting point is 00:32:31 there's another layer to consider because the past climate data isn't just about pinning down a single global temperature. It's a complicated web. The Earth's climate isn't a simple thermostat that slowly goes up or down. It's a chaotic system that's fluctuating on multiple timescales that are all layered on top of each other.
Starting point is 00:32:52 You have events like El Nino and La Nina that pop up every few years, and they warm or cool big parts of the Pacific and shake up weather patterns around the world. You have big volcanic eruptions that send aerosols into the atmosphere, and these particles reflect sunlight, and they cause global cooling for a year or two.
Starting point is 00:33:11 And that's just two of the timescales at play. There's many more, and the big challenge for climate scientists is pulling apart all these overlapping signals. It's much more complicated than just a global yearly average temperature. But okay, all right, we've circled back. Hopefully you made it through all my background.
Starting point is 00:33:31 With all that proxy data, with all those proxies, and with all that data complexity in mind, let's tackle these infamous phrases. Let's break them down. First, let's break down Mike's nature trick. This sparked huge controversy, right? Was Mike Mann publishing something incorrect? Was he hiding things? Then we'll cover hide the decline, the so-called smoking gun that caused ABC and CBC and the New York Times and the Washington Post all to accuse climate scientists of misleading the public.
Starting point is 00:34:07 But yeah, first, Mike's nature trick. Mike Mann is the man behind the iconic climate change graph. He's the one behind the original hockey stick graph, the one from Al Gore's Inconvenient Truth. And while Mike's nature trick sounds like something from a spy novel, it's not about secret manipulation. It's about taking all this complex data and turning it into a simple graph. Mike had these projections from climate models, right? The proxy data and what they implied. And he also had real temperature data.
Starting point is 00:34:41 Thermometer readings, the straightforward stuff where no crazy stats are needed. You just check the thermometer. His trick was to put both types of data on one graph. Mike used two separate lines, one for real measured temperatures from 1860 to today, and another for proxy temperatures, reaching far back in time, which he also added error bars to.
Starting point is 00:35:07 Mike's trick was putting both sets of data on one graph. The proxy data is complex, but without the real temperature data, which shoots up as a hockey stick blade, that's what makes it have a punch. The thing is that blade was never in doubt. It's just the yearly average temperature any weather station could tell you that Now the folks at the CRU made a somewhat intentional misleading choice
Starting point is 00:35:33 Instead of using two separate lines they combine them into one line The instrumental and the projections now climatologists would understand That when the line hits modern times and the error bars go to zero, that it's showing real data and not a projection. But not everybody would get that. So that is a little bit misleading, but there's no lies involved. But the real kicker, the real thing that upset people was emails that said hide the decline. You know, you would have a cold winter or a snowstorm
Starting point is 00:36:05 and politicians would show up trying to cast a suspicious light at global warming with snowballs. Where's global warming now? So when somebody said hide the decline, they're like, yes, I get it. They were hiding the fact that it's actually getting cold. But as I said, it's easy to verify that the world wasn't getting colder. The world was in fact warming. The year 1999 this data came from was
Starting point is 00:36:34 the hottest year on record. So here's the deal. Highly decline wasn't about covering up a drop in global temperatures. It was about a decision to leave out unreliable post-1960s data. You see, for centuries, tree ring data matched up well with temperature. Warmer conditions meant denser wood formed late in the growing season. But around 1960, this relationship broke down. This is known as the divergence problem, but it does seem like a real issue. We have this temperature data, this tree ring data is being used as a proxy
Starting point is 00:37:15 to project backwards until the temperature 1000 years back, but yet it doesn't even work in known periods, like from 1960 to present how solid is our past reconstruction if these proxies seem flawed and And here's the thing. I actually found an answer for that me just somebody who downloaded this data leak and started poking through and Read a book or two to fill in some information this data leak and started poking through and read a book or two to fill in some information. I figured it out.
Starting point is 00:37:47 It was pretty exciting for me and it involved reading lots of this IDL code. But first, before I share what I find, I want to say that questioning this data, looking carefully at this code, even if I assume that climate change is a given, it's still a good thing. It's not anti-science to check their work. Critical examination, that impulse that I feel to look closer, is a vital thing, even when it's uncomfortable. No field is immune to bad intentions. Sometimes even foundational work warrants a second look, somebody needs to check it.
Starting point is 00:38:25 And a big reminder of this is a major ongoing investigation in a completely different field, Alzheimer's research. So before I tell you what I found in the data, let me tell you about Alzheimer's research. In it, the dominant theory for decades was this amyloid hypothesis. It's the idea that this sticky beta plaque in the brain were what caused the disease. In 2006, Sylvan Lesney and his team published a paper in Nature that seemed to back the amyloid hypothesis. They identified this protein, A beta star 56, and suggested that it caused memory issues in rats. And this paper became a quarterstone. It was cited thousands of times and it ended up directing billions of dollars in
Starting point is 00:39:17 research funding and drug development towards targeting these amyloid plaques. But over the years things didn't quite add up. Top Alzheimer's labs tried to replicate his findings, but often they couldn't do it consistently. Now it's a big warning sign, but yet some labs managed to replicate the results, and then they led to more research. And then there was drug development
Starting point is 00:39:40 based on those findings. Then enters Matthew Scrag. He wasn't digging through emails or private messages. He wasn't trying to read IDL files like me. He was focused on the science. He was scrutinizing published papers in Alzheimer's research and he spotted some anomalies, especially in the images, including the papers. It started with some offshoot papers, images including the papers. It started with some offshoot papers, but the more he dug, the more it led back to Lesney's 2006 nature paper. Basically, he was able to tell that the images were photoshopped. Somebody had used a cloning tool and you could see mismatched backgrounds or lines that appeared too clean. And this wasn't just online talk that he posted on his blog.
Starting point is 00:40:27 No, he was a major investigator and he led to a major investigation that was released in Science Magazine in 2022. It wasn't just misunderstood jargon or internal debates. In this case, it was actually the integrity of visual evidence in peer reviewed studies. It had a huge follow up. The follow up is actually still ongoing. Lesney's University launched an investigation. Nature issued a cautionary editors note to the original paper.
Starting point is 00:40:58 All these things feel pretty mild. But what's now known is that these results don't hold up. This was fraud. The process of retraction is messy and slow because no one wants to admit they've been chasing a lie. It's huge damage done to the field, but there's also a chance for science to self-correct. Scientists are human, right, and some will cheat. And Skrag's investigation shows the danger of a real error cascade. That 2006 paper wasn't just a study, it was a foundation. Thousands of studies built on it. Billions of funding followed.
Starting point is 00:41:36 Patients took drugs based on faulty research. Drugs that were costly, drugs that had side effects and that even led to deaths. Drugs that ultimately failed to cure or help with Alzheimer's. An entire field pouring resources down a path that led nowhere all because of some fraud in a key study. I just mention this because this investigation reminds us that the skepticism is vital. Questioning these findings, even influential ones, is crucial. This impulse to dig deeper is sound. That's why I think I need to apply this skeptical spirit to climategate and to this BRFAA CEP98 underscore E profile. But yeah, I think we can now understand what's happening in that file.
Starting point is 00:42:26 The startling comment that caused such a stir applies a very artificial correction for the decline, followed by the fudge factor array. We can now explain what those are. At first glance, skeptics like Eric Graman said that this was a siege cannon. It seemed super damaging. It looked like clear evidence of data manipulation to force that hockey stick shape. But now we know the decline is not about global temperatures dropping. It's about certain proxies like the tree ring data no longer being reliable indicators.
Starting point is 00:43:01 Here's how I know. Here's what I found. Remember those calibration files I mentioned, like calibrate, band temp pro? They're really crucial. When you run the whole process, PCA, correlate, and then validate on this tree ring data, the predictions that come out are pretty noisy. There's something in the data, especially from the overlap period, that's causing noise and making the predictions inaccurate. So Harry or the team or whoever, after digging into the data, the issue became clear. The post-1960 tree data.
Starting point is 00:43:36 For centuries, these rings matched up with the temperature readings. Warmer summers meant denser rings. But after 1960, that link broke. The thermometers showed warming, but the rings suggested cooling. Something changed. Something changed with how trees were growing in Earth. Maybe the extra CO2 from global warming. Maybe the trees just don't grow the same forever. Maybe pollution. Maybe chemicals, we don't know. But the trees weren't matching predictions. But they found a way to overcome this. They would skip the post-1960s data for principal component analysis. By focusing in on the data before 1960, they could better
Starting point is 00:44:21 extract the signal. If they removed that 1960s data, they could better extract the signal. If they remove that 1960s data, they could better estimate the temperatures going backwards. So that gave them a better ability to project backwards, but it led to a problem, right? When they feed that data forward to the post 1960s, the model predicted lower temperatures. So if the global temperature was 14 degrees in 1972, the model would lower temperatures. So if the global temperature was 14 degrees in 1972, the model would say 12.
Starting point is 00:44:49 They found a way to build a model that predicts past temperatures well, but shows a decline just as the world heats up. That is the divergence, right? That's the failure of the specific proxy data post 1960. That's the decline that they are hiding. The reason it diverges is because of the way they built the model to ignore whatever changed post-1960.
Starting point is 00:45:17 It's actually all in the leak. If you look through the calibration attempts, you can find them performing these. They used the data from 1911 to 1960 to build the model and then calibrated it backwards using data from 1856 to 1910. And that worked better than if they used 1911 to present day. This wasn't a secret. They in fact published a paper on the divergence problem in nature back in 1998.
Starting point is 00:45:43 It was a known issue. But it's fascinating to me that you can dive into the code and you can see how they derived this. It doesn't clear up everything, right? As I said, when a key proxy method goes wonky, just as we have better tools to check it, it does raise real questions about how reliable the method is.
Starting point is 00:46:03 But the puzzle here is just about the limits of this specific proxy. It's not about a lie. And then going further, if we look at our file, our brypha sep98-e, the file name is telling the underscore e is actually some old school version control. There's actually A through D as well. And these are all found in a personal folder named Harris Tree for Ian Harry Harris, the programmer. And that fudge factor, those hard-coded numbers, they look like a hockey stick graph
Starting point is 00:46:38 in the context of the divergence problem that's pretty clear. This is actually Harry manually mixing in the instrumental data, the real world temperature data. As I said, ideally you'd show these as two separate lines, but Harry was just trying to manually hack in the instrumental data into his graph. But here's the real kicker, this wasn't the code that was used for the paper. In the linked files, there's a whole different directory where the actual published data is.
Starting point is 00:47:10 There's brypha, sep98, decline1, and decline2. These files are quite similar, but they tackle the divergence problem differently. They don't have a hard-coded fudge factor. They don't mention an artificial decline. Instead, they read the actual instrumental data from files. There's no fudge factor. There's just reading in the temperature and adding it to the graph. The actual methods used later just use temperature data from a real public source. So the core accusation that scientists were literally inventing numbers to fake warming doesn't hold up when you actually look at what the files were. It's also crucial to just zoom out and remember what this data set is. This is the CRU high latitude tree ring
Starting point is 00:47:55 density data. This is the stuff with the divergence problem. It's just one single thread in a vast tapestry of climate science. The overall conclusion that the Earth is warming and that humans are the primary cause doesn't rest on this file, or in fact, on this leak. It comes from the convergence of many independent lines of evidence gathered and analyzed by all kinds of scientists worldwide. In fact, the graph that Al Gore used was based on
Starting point is 00:48:25 ice core samples, not this data at all. So there's no air cascade here. The CRU data matters, especially for reconstructing detailed temperature maps of the northern hemisphere land temperatures over the last millennium. But that's just a part of the story. The attackers who leaked these files and the bloggers who spread the story weren't actually doing a thorough review of the CRU's work. Perhaps that's not surprising. Likely they just ran keyword searches for terms like trick or hide or artificial. And in this massive dump of emails and files, they found some juicy snippets. In one file that was never used for a published paper,
Starting point is 00:49:06 and they took them out of context and claimed that they found a lie and that they found a conspiracy. Here's where the Alzheimer's story stands out as being quite different, right? Matthew Schrag wasn't sifting through stolen emails for dirt. He was carefully examining published scientific evidence one by one. He was questioning its integrity through complicated visual analysis. This was skepticism aimed at the science itself leading to potential corrections for the field. In fact, he did it because he wanted to get the field back on track. Climategate was driven by a specific code file it used out of context chatter. It used experimental code to target scientists and to sow doubt rather than engaging with the full body of the published work.
Starting point is 00:49:53 And in fact, it was timed for this all to happen right before the Copenhagen climate conference. So there's some pretty strong hints that there was a political agenda here. Find a lie and then you can say that they're lying about everything. But here's the cool part. Maybe the real story in the Climacate files isn't about conspiracy or fraud at all. Maybe it's about something far more mundane, yet I think profoundly important. The unglamorous, often frustrating reality of being a programmer trying to make sense of messy scientific data. Because Ian Harry Harris, the CRU programmer whose name is on this folder, HarrisTree, in the leak there's another file, a massive text document, 15,000 lines long, called Harryreadme.txt. And it's basically Harry's personal log
Starting point is 00:50:46 stretching over the years, documenting his day-to-day struggles to maintain and update and debug these climate datasets and to work on the code that's used to process them. And reading it is like, well, if you've ever worked on a legacy code base or if you've ever tried to integrate data from dozens of different inconsistent sources, I think you can feel a deep sense of empathy for Harry.
Starting point is 00:51:11 Harry wasn't writing about grand conspiracies, he was writing about the grind of data wrangling and the challenges of software archaeology. He writes about an Australian data set being a complete mess, that so many stations have been introduced and he can't keep track of it. He complains a lot about Tim and Tim's code, and I assume that Tim is somebody who came before him and didn't sufficiently document what he did. Sometimes he just writes, oh fuck this, all in caps. As in, oh fuck this, it's Sunday night and I've worked all weekend and just when I thought it was done, I'm hitting yet another problem and it's the hopeless state of our databases. There's no data integrity. It's just a catalog of issues that keep growing on and on. Reading Harry's log, you don't see this cunning manipulator working to hide
Starting point is 00:52:01 inconvenient truths. You see just an overworked programmer, likely under-resourced, grappling with complex, messy real-world data and imperfect legacy code. And he's just, he's doing his best to make sense of it all. He's dealing with inconsistent formats. He's dealing with missing values, undocumented changes. It's just the kind of stuff that data scientists and that legacy software engineers everywhere deal with daily.
Starting point is 00:52:27 And he leaves all these exasperated comments and they don't sound like admissions of fraud, just like the slightly cynical remarks of someone deep in the trenches of doing the difficult work of climate change. Maybe this is the real story of climate gate. It's not a scientific scandal, but a human one. A story about the immense and often invisible technical labor required to turn noisy observations into scientific understanding. And the pressures faced by those tasked with doing it often without recognition or even the resources they need. And then after all that, they get attacked and their private work files become the hot
Starting point is 00:53:10 topic on ABC News. So where does all this leave us? After all, the sound and the fury and the investigations and the accusations, I mean, what did Climategate really reveal? At first, the media jumped on this idea that this was a smoking gun. Nobody wanted to deal with global warming. I mean, nobody still wants to deal with it. Al Gore called it an inconvenient truth.
Starting point is 00:53:36 So there was hope. There was hope that it was all a mistake or a fraud. And people ran with that. Newspapers churned out stories of deception for weeks after the leak and the investigations came much slower. There was eight official inquiries, yes eight, and all came to the same conclusion. No fraud, no scientific misconduct. Climate Sciences core findings stood firm. The hockey stick graph could be
Starting point is 00:54:07 debated for some of its statistical details. You can debate the limits of some of these proxies but it's backed by many other studies that use different methods and use different data. The trick wasn't a deception, it was just a graphical choice. And hiding the decline wasn't hiding a global cooling trend. It was about dealing with a known issue. Climategate wasn't proof that climate change was a hoax. It was more like a case study and how internal scientific discussions and informal language and experimental messy code can be twisted when leaked into a charged climate where people
Starting point is 00:54:46 are looking to create doubt. If I were to take a lesson from the climate gate saga, it would be about the necessity of transparency in science, especially things like climate science. What if from the start all the raw data and code and statistical methods were out there? What if they were publicly accessible to begin with? I imagine them on GitHub, ready for anyone to run and critique. And actually, as a result of all this, CRU now has the instrumental data available under an open government license. And while Eric Raymond's initial take on the code file is what caused a big stir, he was
Starting point is 00:55:26 right about one thing. Because he demanded that they open source the data. I feel like that's a principle I can agree with him on. Climate science, with its global stakes and complexities, should embrace open source, should embrace open access as much as possible. Science isn't always neat. It's a human process full of debate and messy data and involving methods. But like software development, it gets stronger and more robust and more trustworthy when
Starting point is 00:55:56 the process is open, when the data is shared, when the code is available for review. That's my takeaway from the whole affair. It's not about a conspiracy revealed but a powerful argument for doing science in the open. We live in a world in which science is more than ever under attack and underfunded and being questioned and being politicized. I think the best defense against that is to be open. That was the show. How many people made it this far? I don't know. Honestly, I started by diving into this Climagate code and it got more interesting as I went along, but I'm still pretty unsure about how interesting it is for others.
Starting point is 00:56:50 There's like a lot of interesting tangents I went on that I had to cut as well, but I came away with one big idea. Climate science is kind of interesting. It's a little bit like data science, except in climate science you're dealing with messier data and you often have to gather it and label it yourself. But you get to work with a community that's striving for shared knowledge. Climategate makes it sound like it's all about global warming models and politics, but really it's more about diving deep into specific issues like how the layers of sediment in this certain data set can affect the feedback cycle in the Atlantic Ocean temperatures. Harry's exasperated cynical
Starting point is 00:57:30 grievances notwithstanding, it actually sounds pretty interesting. But yeah, let me know what you think of this episode and until next time, thank you so much for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.