The Science of Everything Podcast - Episode 79: Basic Concepts in Statistics

Starting point is 00:00:33 you're listening to The Science of Everything podcast, episode 79, key concepts in statistics. I'm your host, James Fodor. So in this episode, we're going to look at some of the key conceptual ideas in the discipline of statistics. Needless to say, since this is an audio-only podcast, there's not going to be any explicit mathematical formula here. I mean, it is essentially mathematics, but a lot of mathematics can be described conceptually, even if the calculations can't be performed. But what I'm interested in for this show is to explain the concepts of statistics in a way that's hopefully fairly clear and straightforward.

Starting point is 00:01:14 So statistics is the concept or the study relating to the collection analysis and interpretation and presentation of data. There are many different types of data, and we'll talk a little bit about some of those differences shortly. But the basic purpose of statistics is to take numbers, so data, information, of some form that we've gathered in some way, and use that to produce useful information that we can utilize for some particular purpose, to make a decision to increase our knowledge about something. Because data is not the same as knowledge. Data is just numbers, or categories, we have to put it into a form that is useful and allow us to extract knowledge from the data. That's essentially what statistics is about, loosely speaking.

Starting point is 00:02:03 Now, statistics is in use everywhere, wherever you look there's statistics. Science, natural and social sciences in business, government, medicine, engineering, advertising, and the military, wherever you go there is statistics, whether you're aware of it or not. So it's useful to have at least some understanding of the key concepts in statistics. Statistics are increasingly finding application even in things like law, law, elections, campaign promises and things like that. So regardless of where you are or what you're doing, it's, it's, you're using. It's, you're going. to have at least some basic grasp of these concepts. Now, many people find statistics confusing, whether they first encounter it in an introductory business course or at high school or psychology or engineering or wherever else. Now, it's certainly true that some of the calculations can be complex or tedious, depending on the level of detail that you're going to and how sophisticated are the models that you use.

Starting point is 00:03:00 That being said, I think the basic core concept, of statistics are relatively straightforward. And one of the big problems is just that they're often not explained very well, or perhaps even at all, at the expense of just going through the motions of doing the calculations. You can do the calculations without understanding what you're doing. I mean, that's what computers do, right? They don't understand statistics. They just do number crunching. But what we're interested here is not number crunching, it's understanding the use and abuse of statistics. So in this episode we're going to talk a little bit about the basics of probability to set us on a good grounding. Then I'll talk a bit about some of the

Starting point is 00:03:35 different types of data, sampling methods that we use to get that data, and then some of the statistical operations and tests that we perform on that data, including a look at descriptive and then inferential statistics. And then to conclude, I'll talk about a few of the most commonly used statistical tests, focusing on the kai square test, T test, and linear regression. So, that being said, let's make a start. Statistics begins with probability. So what's probability? Well, probability is essentially the study of chance.

Starting point is 00:04:12 Modern probability theory originated actually in the study of gambling, but it can be applied much more broadly than that. A probability is just a number, essentially. It's a number between zero and one. Zero indicates that something cannot happen, whereas one indicates that it certainly will happen, certainty versus impossibility. So probabilities are just a number,

Starting point is 00:04:31 always between zero and one. The higher the probability, the more likely it is, or the more certain, that the thing will occur or that it will actually happen. In probability, we often talk about events. An event is just something that can happen, or another way that you can say it is it's a set of outcomes. So, for example, an event could be rolling an even number on a six-sided die, or tossing two heads in a row in a series of coin tosses. That's an event. You notice that an event's not just a single outcome. It's not just one thing that happens, but it's some collection of outcomes that we're interested in. So rolling an even number, for example, that could be a two, four, or a six. There's three different outcomes that correspond to that event. And obviously in the real

Starting point is 00:05:16 world, we can have much more complex events. But it's just something that can happen and you can assign a probability to that event. And often in probability we're interested in what is the probability of a certain event occurring? How likely is it, you know, that there'll be a hurricane of a certain size within the next 10-year period. For example, that's an event that we assign a probability to and when then we make decisions on that basis. Another important concept in probability is the idea of statistical independence. Now, two events are independent if the probability that one event occurs has no effect or no influence or no bearing on the probability that the other event occurs. So, for example, if we have a coin toss, a common example of probability

Starting point is 00:05:55 theory, two successive tosses of the coin should be independent of one another. That is, Whether I get a head or a tail on the first toss, that should have no bearing on what I get the second toss. There's no reason that those should be connected in any way. Unless there's something weird going on with the coin, but in general, coin tossers should be independent of each other. Many events in the real world are not independent of each other. So, for example, the weather today is not independent of the weather yesterday, because the weather yesterday affects the weather today. And similarly, the weather today will affect the weather tomorrow.

Starting point is 00:06:26 So those events are not independent. When the probabilities of two events are independent, you can calculate the probability of both events occurring by simply multiplying the probabilities. So, for example, the chances or probability of tossing heads with a fair coin is 50%, or 0.5. And since two toin costs are independent, the probability of tossing two heads in a row is just 0.5 times 0.5, which is 0.25, or 25%. So a very simple calculation there. It's often easy to assume that events are independent of each other because it makes their calculations much similar. But in practice, events are often not independent, and therefore we should not make that assumption too readily.

Starting point is 00:07:06 And this will be, I'll give some examples of where this can be important later on. So, just one last conceptual point before we move on. Probability theory is the study of events, so things that can happen, whose outcome we don't know beforehand. Now, beforehand doesn't necessarily mean before in time. It can just mean before we know what the information is, even though it might have already happened. happen, we don't know what it is. So before we find out, in a sense. Probability doesn't depend on whether the events are truly random in some sort of philosophical sense, whether they're non-deterministic or not. That actually doesn't matter. All that matters

Starting point is 00:07:43 is if we can treat the events as if we don't know what they will be, as if we don't know what will happen until that information is given to us. So, for example, you might think that the weather tomorrow is, or the weather in a week's time, is completely determined by the weather today. If we knew the position of all the particles and the atmosphere and their velocities and the amount of radiation coming in from the sun and all of that stuff, if we knew everything about it, then we could forecast with complete exactness what the weather would be a week's time, a year's time, whatever. But the fact is we can't do that. We don't know all of those things. We don't have access to all information. Also, we don't have the ability to perform

Starting point is 00:08:25 the computations that would be needed to make those determinations. So in practice, we can analyze whether in probabilistic terms, the probability of it raining tomorrow or seven days from now, even if we think that in some deeper level, that event is fully determined. So just bear that in mind that probability is not, probability theory is not sort of a philosophical exercise in what is determined or not. It's just, you can think of it as an epistemic exercise relating to what we know about what information we have about what's going to happen.

Starting point is 00:08:57 As long as you don't know what's going to happen, we can apply probability theory to it. Obviously, if you do know, then you can't assign probabilities because the probability of something that we know is one. That's certain, right, because we already know it. There's another important concept in probability theory, which links us directly to a lot of statistical analyses. That is, we often describe probability outcomes in terms of what's called a probability distribution. In essence, a probability distribution is just a list,

Starting point is 00:09:22 or slightly more technically, it's a mathematical function. But for the moment, we can think of it just as a list of each possible outcome and a number associated with how likely it is to occur. This is not, there's more details to it if you get into the mathematics, but this will do for us at the moment. Now, the most widely used distribution in probability and statistics is called the normal distribution. It's also sometimes called the Gaussian distribution after the guy who came up with it

Starting point is 00:09:49 or first popularized it, I'm not sure exactly, but it has a characteristic bell shape. So if you've heard of a bell curve or seen something that looks like that, you should know what I'm talking about. Basically, it's high in the middle and then quickly tapers out to either side. What that means, essentially, is that numbers near the middle are much more likely than numbers further away from the middle, and it sort of increase, the probability decreases exponentially

Starting point is 00:10:13 as you move further away from whatever the center is. That's the normal distribution. It describes the relative likelihood or probability of different outcomes. Now, in many phenomena in physics, chemistry, biology, psychology, economics and elsewhere are normally distributed or close to normally distributed, and so it's very commonly used, that is the normal distribution is very commonly used in all sorts of applications. Many statistical tests, or at least the most basic form of many statistical tests, assumes that the variables in question are normally distributed.

Starting point is 00:10:42 So that just means that the possible outcomes of the events or the variables that we're measuring follow the normal distribution, that is, they're most likely, the middle and increasingly unlikely as you move further and further away according to that particular bell curve shape. That's all it means for a variable to be normally distributed. It's a possible value to take that shape. Okay, now let's move on from probability and talk a bit about data, because statistics is built on analyzing data, but we have to know what data is before we can analyze it, right? So data is just some information that we have. Usually we think of it as a raw form. So it could be a list of people's heights and weights, for example, or a list of people's

Starting point is 00:11:22 incomes, or even a list of categorizations. You know, if you have a bunch of countries, you could list what international organizations, they're members of that, all of that type of data. You can have data about pretty much anything. Now, there are different types of data, and the type of statistical analysis that we can conduct depends upon the type of data that we're dealing with, so we need to know what type of data that we're dealing with, obviously. The two main types of data that people usually talk about are categorical and numerical data. categorical data can only have certain values. It can't just have any old value.

Starting point is 00:11:55 The values can be numbers or they can be just categories. It doesn't matter so much. But the point is there are only certain ones that it can be. For example, country of birth. There's only a certain number of countries in the world. You can't be born half in one country and half in another country. Now, numerical data, by contrast, can take values over some, any value over some range. Now, income is a good example of numerical data, because your income is a good example of numerical data,

Starting point is 00:12:18 because your income can essentially be anything from zero up until, well, whatever the highest income is, tens of millions a year. There is some ambiguity between the boundary between categorical and numerical data, because usually there's a smallest unit that you can meaningfully measure something. For example, you can't measure income or even meaningfully talk about income less than, say, in increments less than one cent. It doesn't really make any sense. So you could say that, well, income is actually categorical because you can't have an income of between, $10 and $10 and $1.

Starting point is 00:12:52 So there's some ambiguity there, but usually it's pretty clear which type a given data set is closest to, whether it's categorical or numerical. Well, if I'm looking at data that's being collected, survey data, for example, people mark in their gender, that's almost certainly going to be categorical, male or female, neither, perhaps. Age, that's numerical, even though probably you can only choose whole number of years, but nevertheless we would usually regard that as, numerical because there's sort of enough variability across numbers. And so on, you get the idea. Usually it's relatively clear whether something's categorical or numerical. Now, there is a third type of data, there are others that have been proposed as well, but ordinal data is sometimes important. Ordinal data can be placed in, well, order, hence the name, but the difference in the size between categories is not meaningful. So example of ordinal data would be, if I were to rank athletes

Starting point is 00:13:48 in a sprint in terms of the order that they pass the finish line. So first place, second place, third place, and so on. Now, the order of those possible categories means something. First came before second, which came before third, and so on. However, if I just tell you the order, that doesn't tell you anything about how far in front of the second place, first place was, and so on. I would need to tell you the times that they took for the sprint, which would be numerical data. Okay, so that's a brief overview of different types of data, and as I said, the different types of data that you deal with has a big effect on what types of analysis you do. So that's generally the first step in any statistical analysis is figuring out what type of data you have, what types of question you're trying to answer, and then working out what statistical test is most appropriate to use. If you use the wrong statistical test, you'll get misleading or even completely uninterpretable results.

Starting point is 00:14:40 So that's a very important step in any analysis, or interpreting an analysis if you're interpreting someone else's results. Now, another important aspect of a statistical analysis is knowing where the data came from and how it was collected, and this relates to a field called sampling methods. So, before we conduct a statistical test, we obviously need data to analyze, but where does the data come from? How did we get it? Statistics, like pretty much everything else, very much follows the Gigo principle, that is garbage in, garbage out,

Starting point is 00:15:12 meaning that no matter how sophisticated the statistical tests we use and no matter how well we do it and how careful we are in reporting our results, if the data that we use is rubbish or was poorly collected, then our results are going to be rubbish. Simple. Your results can't be any better than the data that you put in, essentially. They can be worse, but they can't be any better.

Starting point is 00:15:34 So before we explain where the sampling methods and where the data comes from, I need to explain the difference between a sample and a population. The population is just in statistics, it's just the group of things that we're interested in, the whole group. It doesn't have to be a population of animals or people, although it could be. It could be the total number of people in a country, for example, the population of the country. Or it could be the total population of rabbits that live in a particular field, or the total set of bacteria species that live in your colon. Or it could be the set of people in a school, a set of countries in the world,

Starting point is 00:16:06 the set of transactions that happened last year in a given business, the set of all the atoms in a particular crystal, the set of proteins in the human body, and so on and so on. So a population can be pretty much anything, just depends on what you're interested in looking at, but the key thing about a population is that it's always inclusive, that is includes everything within the confines of what you're interested in.

Starting point is 00:16:26 Now, a sample differs from a population, and that a sample is a subset of a population, it's selected from the population, but does not include all members of the population. So, for example, If I want to learn about the crystalline structure of a crystal, I could examine every single atom in the crystal. That's probably not going to work because there are too many.

Starting point is 00:16:48 So I might just look at a few atoms in a particular region of the crystal, or maybe just pick atoms at random over the crystal structure and examine them in detail, likewise with proteins in the human body or people in the population. Now, a telephone survey is a very good example of a sample that is drawn from a population. When people are forecasting elections, they don't ask every single person in the population who they're going to vote for. Instead, they ask maybe a few hundred or a few thousand people

Starting point is 00:17:13 in the population who they're going to vote for, and that constitutes their sample. So a sample is simply a subset taken from the population, from which data is recorded. Now, why, you might be asking, don't we just record it from the whole population? Then we wouldn't need a sample. We would just have the population, right? Well, in some sense, recording from an entire population would be easier, because then we wouldn't have to do a lot of statistics. A lot of it would be rendered unnecessary if we just had data from the entire population. However, the trick is that in nearly all cases, it's impractical to gather data from an entire population, and possibly may even be impossible

Starting point is 00:17:52 in some cases. So, for example, if we're conducting a survey for an election, you can't go and ask X number of million people who are going to vote for. That would just be holding the election again, right? Likewise, if we're dealing with populations of animals or of humans or even of chemicals or types of molecules or whatever, there were just too many of them to possibly analyze all of them. So we have to use subsets or samples in order to do anything meaningful. The point is it's impractical and sometimes impossible to measure entire populations in nearly all cases. So that's why we take samples. But that then gives rise to the question, what can we say about the population on the basis of a sample? If I've only asked a thousand

Starting point is 00:18:33 people who are going to vote for, then what can I say about the 10 million who are going to vote next Tuesday. It seems I shouldn't be able to say anything on the basis of just such a small subset about the whole population, but it turns out that using statistics we actually can, and if we do our job properly as statisticians, we actually can make very reliable claims about populations

Starting point is 00:18:52 based on subsets from those populations. There are many different ways of collecting a sample from a population, and I won't discuss them all here. That's really a whole show unto itself. Probably the gold standard, in quotes, is a random sample. That is that a sample selected on the basis that each member of the population has an equal chance of being selected. And then I just select a bunch of them at random. You know, I can roll a die

Starting point is 00:19:18 or use a random number generator or whatever. And then I only take data on that little subset that I randomly select. Now, again, in practice, it's often difficult, if not impossible, to use purely random samples. And the reason is because often we don't have a way of truly randomly selecting from the population. You think about animals, for example, we might want to track the population of an endangered species

Starting point is 00:19:45 in a particular habitat. And one way that they do that is by randomly capturing a subset of their population, or at least tagging them. They capture them, tagging them, and release them or something, and measure their age and their weight and other things like that to see how the population is doing.

Starting point is 00:20:01 Trouble is, though, if we try and capture animals, for example, we could use traps or however, use whatever other mechanism we would like, that might have a greater probability of catching, say, smaller animals, or maybe larger or slower ones, for example. Some animals will have a greater chance of being captured than others, and therefore we won't necessarily get a truly random sample of the population. Likewise, if I'm doing a survey for an election, even if I use random digit dialing,

Starting point is 00:20:27 which is one of the most common ways of conducting a sample now, That is very good at collecting a random sample of telephone numbers, but it's not necessarily great at getting a random sample of people on the end of those telephones, because not everyone has a telephone or answers their telephone. Not everyone has only one. Some people have multiple numbers, some people have none, some people are more likely to answer than others. So again, not everyone in the population has an equal chance of being selected,

Starting point is 00:20:53 so it's not truly random. Nevertheless, though, we can try to sample randomly from populations, and usually that's the best method. What's the most important about sampling is not really whether it's random, often because you can't get true randomness, random selection, but rather the important thing is that the sample is what we call representative of the population.

Starting point is 00:21:13 Now this means that the sample looks like the population as a whole. It doesn't look different in some way. And that's actually crucial, because obviously if the sample looks different, even just slightly different from the population as a whole, then we're going to make incorrect inferences about the population. This is how statistical inferences often go wrong. Classic example was, I think, back in the 20s or 30s, there was a famous poll

Starting point is 00:21:35 conducted, I think it was by Rita's Digest for the presidential election, and they said, one of the candidates is going to win by a landslide, and they said that they were very confident in this because they had a huge sample, like a couple of million people, I think, in their sample, but they were very wrong. The other candidate won, and by a significant margin they were off, so why did that happen? Well, it wasn't because their sample was too small. In fact, their sample was far larger than it needed to be. That wasn't the issue. issue was their sample was not representative of the population as a whole. Because I think the way they did it was use telephone directories, or it might have been people who subscribed to their magazine.

Starting point is 00:22:11 I've forgotten exactly how they selected their sample, but it doesn't matter. The point is, it wasn't a representative sample. Some people in the population were more likely to be selected than others, and particularly, I believe it was people who subscribed to their magazine, who tended to be wealthier than the average Americans, this was in America, so therefore their sample looked wealthier than the population as a whole. So if it looks wealthier, it may well vote differently to the population as a whole. So their sample may well have voted for one candidate, but the population didn't, and because their sample was different to the population, they made an incorrect inference. So that's why it's very important for your sample to be representative. And one of the best ways

Starting point is 00:22:47 to make it representative is just to choose randomly, because then it really should be representative as long as you pick a big enough sample. Often it's not possible to choose randomly, so other methods are constructed to try to ensure that populations, sorry, that samples are representative of the population. So, for example, if I know that X percentage of the population is male, and X percentage is of a certain age, and Y percentage is certain race and so on, I can try to deliberately survey people so that my sample looks like the population. That's not the best way of doing it, because then, of course, well, what if there are other things that you haven't considered other ways that the sample could differ from the population? But nonetheless, there are

Starting point is 00:23:25 various ways of doing this to try and get samples that are as representative as possible. As I also mentioned, sample size does matter. Usually for any sort of meaningful statistics to be done, you want to have sample sizes of at least 20 or 30. I mean, you can do statistical analysis with smaller sample size, but the smaller it gets, the more iffy it gets. Sample sizes of a few hundred are quite good. Sample sizes of a few thousand, say, in election surveys, are about as big as you see. You, of course, can have much larger samples if the data are collected from things from some sources of big data, so internet data sets, for example, or maybe a cosmological data set if you're surveying stars or something like that.

Starting point is 00:24:05 It very much depends on the source how many data points you're getting. The rarer, the thing that you're trying to observe is, obviously the larger your sample needs to be, because if, on average, only 1 in 100,000 people or stars or molecules or whatever exhibits the behavior that you're interested in, then you're probably going to have to survey at least 100,000 to see one. and then if you want to see at least a handful of them, then you need to survey several hundred thousand.

Starting point is 00:24:29 So the samples each you need depends upon how rare the thing is that you're looking for. Okay, so that's a bit about sampling methods. That is where we get our sample from and how we get our data. So once we've got our sample, and hopefully it's representative of the population, otherwise we're not going to get very much traction out of this, so we've got our representative sample, and again, hopefully it's large enough so that we can make meaningful claims. What happens next?

Starting point is 00:24:52 well there are two main branches of statistical analysis one called descriptive statistics and the other is called inferential statistics statistics. Descriptive statistics is in some sense the easier but in another sense also the more important because it comes up probably more often, particularly for sort of everyday people. It's more likely you're going to encounter descriptive rather than inferential statistics and it's also still although it's sort of quote unquote simpler, very easy to abuse. You've probably heard the phrase lies, damn lies, and statistics. Well, all too true, unfortunately, because statistics can be so easily abused, and that starts even with fairly simple descriptive statistics. So descriptive statistics simply describe things that we measure in the sample. Again, sample, not population. That's very important. So there's a number of things that you might want to describe. You might want to describe what are called measures of central tendencies. So that is, where's most of the data sitting. Another thing you might want to do is describe spreads. So what's the range? How does it vary? What are the bigger

Starting point is 00:25:51 and smallest values and so on. So some key concepts here that are relevant. There are three main measures of central tendency or averages, as they're sometimes called. Now, I don't like the word average because it's ambiguous. There are three different types of average. If you ever hear someone talking about the average in any sense where they're actually giving your numbers, you should always ask them what measure of average they're using. You probably have heard these three, the mean, the median, and the mode. They're all sometimes called average. It's all different ways of measuring the average or the middle, sort of what's the central tendency, as it says, of the data.

Starting point is 00:26:27 Mean is the most common form of average that's used. You calculate the means simply by taking all the values, adding them up, and dividing by the number. So, for example, if I say the average height of Western males is 178 centimeters or whatever it is, that's taken essentially just by measuring the height of a whole bunch of people, adding it up and dividing by the number of people. Median is the middle value So if I measure the heights of a whole bunch of people And I find the middle one

Starting point is 00:26:54 Or if it's an even number I just take the two middle ones and average those That is take the mean of those two Then the middle one is the median It's the one that sits in the middle And the mode is the value that occurs most often Mode is more useful for categorical analysis So for example I could ask

Starting point is 00:27:11 How many children do you have Of say couples between the ages of 25 and 45 And I don't know what the mode is these days maybe one or two or something like that. And that's going to differ between countries. But it's the value that occurs most often. Now, if the data is normally distributed, remember if it follows that nice bell curve

Starting point is 00:27:28 that I mentioned before, mean, median and mode will all be identical. It will all be the same. However, if the data is not normally distributed, if it's skewed, then mean median and mode will not be the same. So one example of a variable that is normally distributed is heights.

Starting point is 00:27:46 So there, the mean, median, and modal height should all be the same, pretty much, because they do follow that normal distribution. It's just as likely to have a very high height as it is to have a very low height, and most people sit in the middle. A good example of a variable that does not follow the normal distribution is income, and that's because incomes are skewed. That is, it's true that most people have incomes sort of around the middle,

Starting point is 00:28:09 but there are some people who have very, very high incomes. You can't correspondingly have very, very low incomes, because essentially you can't have less than zero, income. It is possible to have negative incomes if you have a business, say, but often that's not measured. But there are a small number of people who have exceptionally high incomes, and that skews the data. So if you add up all of the incomes and divide it by the number of people, you actually get a higher number than if you take the middle value. That's very intuitive if you just imagine. If I have five people, you know, one of them has 40,000, earns 40,000 a year, one of them earns 50,000 a year, one

Starting point is 00:28:42 earns 35,000 a year, one earns 48,000, and one earns 10 million dollars a year. You know, one earns $40,000 a year. Now what's the average of that going to be? Not sure exactly, but it's going to be much close to a million because it's pulled up by that very high number. The median of course is going to be around 40,000 because most people earn around that value. But the mean is pulled way up towards a million by that one very outlier, essentially, one very high value. So that's called a skewed distribution and results in a median and mean that are very different to each other. So you've got to watch out for that. Skewed distributions will have different means and medians and

Starting point is 00:29:19 modes, and if it's not reported, which type of average is being utilized, then you should try and find out because they can differ substantially. Especially when it comes to anything that's measured in dollars, that's very important to bear in mind. So those are some measures of central

Starting point is 00:29:35 tendency. There are things we measure about the sample that we've collected to describe where the middle of it sits. Measures of spread describe how, well, the spread or how the data lies around the mean. A common measure of spread is range. The range is simple. You just take the biggest number in your sample

Starting point is 00:29:51 and subtract the smallest number, so it just tells you the difference between the biggest and the smallest. The interquartile range is essentially similar to range, except it restricts itself to the middle half of the data, if you like. So it's a way of cutting off outliers. So never mind where the very highest and the very lowest are. What's the difference between the 25th and the 75th percentile?

Starting point is 00:30:13 or basically the middle half of the data, what is the difference between the highest and lowest, considering only the middle half of the data? That's the interquartile range. So it's really similar to range. Variance is another very common measure, or standard deviation, they're very similar. This measures the average variation or average difference between each value and the mean. So essentially the variance is just the average deviation from the mean. On average, how far away from the mean is each observation or is each value that we've measured? And that's what the variance tells you, the standard deviation. The difference in standard deviation variance is one of them is the square root of the other.

Starting point is 00:30:50 But for our purposes, they measure the same thing. The average distance away from the mean, basically. That's a very useful measure of spread. High variance means data is very spread out. Small variance means they're all clustered closely together. So human height has a relatively low variance. You know, most people are between, say, 180, give or take 20 centimeters or something like that, 20, 30 centimeters. Whereas things like income have a much higher variance, because there are people just all over the map from zero up to many millions.

Starting point is 00:31:21 Okay, so those are some measures that we use to find out information about the sample that we've collected from our population. Now, there's a few other key concepts that I want to talk about here before we move on to inferential statistics. First, the difference between a dependent and an independent variable. Now, a variable is just something that we measure whose value changes. So, again, income, height, weight of an animal, whatever. The dependent variable is usually the variable, or the number that we're most interested in. We're interested in learning about the dependent variable. So perhaps it's health outcome, for example, the health status of a person.

Starting point is 00:32:00 The independent variable is usually, some other variable, or variables, you can have more than one independent variable, you can have as many as you like, really, other variables that can change and that affect the dependent variable. The idea is that the dependent variable is dependent upon the independent variables, whereas the independent variables are generally not dependent upon the dependent of anything is at least the usual idea. In practice, they can be dependent on all sorts of things including each other, but that just complicates the analysis. is the independent variables just do their thing. The dependent variable depends upon the independent variables. So, for example, if I wanted to measure the health outcome, say obesity, like example, the dependent variable could be the BMI, the body mass index. That's the thing I'm interested

Starting point is 00:32:50 in. That's my measure of health. That's not actually a very good measure of health, but whatever. We'll use it for now. Independent variables could be anything that you think might affect that dependent variables. So, for example, the amount of exercise that they get on an average week, or their age, whether they're diabetic, whether they have certain genetic predispositions, whether they're a smoker, how much alcohol they drink. All these sorts of things could be the independent variables that can affect the dependent variable. Now, the sort of the purest form of this is when you actually manipulate or can manipulate the independent variables, that you can change them at will. So in an experimental setup, you're able to do that. So I can give one person

Starting point is 00:33:27 a drug and I give another person a placebo, a fake version of the drug. And so the independent is whether they got the real drug or not, or maybe even what dosage of the real drug they got. And then the dependent variable is their, some health outcome that we think will depend upon the dose of the drug that they got. But in many circumstances, we can't actually set the independent variable. You know, you can't decide how much people smoke or how often they exercise. That's just given to you. You measure it. But nevertheless, it's independent in the sense that it's not the main thing we're interested in, but we think that it might affect that main thing we're interested in. So it's sort of, as I described it as, does its own thing. It's given to us.

Starting point is 00:34:06 Either we set it or it's given to us, and then we look at what effect does it have on the thing I really care about, which is the dependent variable. Another important concept is correlation. Correlation measures the relationship between two different variables. Specifically, it measures the linear relationship between them, which just means a straight-line relationship. If two variables are positively correlated, it means that when one goes up and the other goes up in some linear relationship. So there's a straight line going up that relates the two. So for example, there's a positive correlation between the amount of sunlight that their location gets every year and the skin cancer rates in that area. I actually haven't looked that up, but I assume that's the

Starting point is 00:34:47 case. It makes sense because they would go together, right? The more of one, the more of the other. A negative correlation is when there's a negative relationship between the two, so more of one means less of the other. So you'd expect there'd be a negative correlation, between weekly exercise and body fat, a proportion of body fat. Generally, if people do more exercise, they probably burn more energy and have lower body fat. Now, a correlation isn't perfect, obviously. It's usually not perfect. That doesn't mean that it's the only thing that matters, but it means on average there's a relationship. So, correlations can be stronger or weaker. Weak correlations mean that, you know, there's only

Starting point is 00:35:24 a small relationship between the two variables. Strong correlation, there's a very tight relationship between them. Or in other words, most of the variation in one variable can be explained by the variation in the other. The correlations come all over the map. In terms of strong or weak, or you can, of course, have no correlation, which means one variable doesn't tell you anything about the other correlation, like I color and IQ, for example, as far as I know are not at all correlated, nor would you expect them to be. Now, it's important to understand that, first of all, correlation only measures linear relationships, so variables can have all sorts of other relationships between them. You can have a quadratic relationship, for example, which,

Starting point is 00:35:58 is that when a variable first starts increasing, then the other variable increases. So the sort of of linear at first, but then it sort of flattens out and actually decreases if you get too much of something. So, I mean, pretty much any substance that you ingest in an animal is, well, not pretty much any, but many substances are like this. For example, I could measure

Starting point is 00:36:16 some health outcome against, say, fluid intake per day. Now, if I give hardly any fluid intake, the animal's going to die. If I give just a little bit, it's going to scrape by, but it's not going to do very well. If I give a bit more, then it's going to do okay. A certain amount of fluid intake, like water I'm talking about here, will be optimal. But then if I give too much, it's going to start to have detrimental effects,

Starting point is 00:36:37 and eventually it will die if I give it too much. So there's going to be a non-linear relationship there, at least if I vary the amount of fluid intake over a wide enough range. So in that case, the correlation might actually be zero. The correlation can be zero if you have a non-linear relationship, even though if you plot the data, you can see clearly there's a relationship between these variables. It's a more complicated, it's not a straight-line relationship. So you've got to watch that.

Starting point is 00:36:58 Just because two variables aren't correlated doesn't mean there's no relationship between them. It just means there's no linear relationship between them. So that's the first point. Second point is that correlation does not imply causation. I've probably heard this before. It's a little bit of a catchphrase these days. And that's okay if it helps people to remember. But it is important to remember what that actually means.

Starting point is 00:37:17 It's a bit unfortunate because imply has different meanings. Imply, in this sense, just means the strict logical sense. That is, it doesn't follow logically that one variable causes enough. merely because they're correlated with each other. However, correlation does imply causation in the sense that if two things are correlated, that is an indication that there may be a causal connection between them. So it indicates causation, depending on the context, of course, but it doesn't prove it.

Starting point is 00:37:44 You need to go and look at some more details in order if you want to get stronger evidence of causation. Now, the reason correlation doesn't imply causation in this sense is because there could be other confounding factors. For example, there could be a third variable. which is affecting both of the variables in question. So one of the famous examples of this is there's a negative correlation between global temperatures over the past couple of centuries

Starting point is 00:38:08 and the number of pirates in the world. So does that mean that there's a causal connection that pirates cause global cooling or that hot weather causes prevents pirates or something? No, of course not. There's no direct causal relationship there. The reason for this correlation, sorry, The reason for this correlation is because there's a third factor, namely industrialization.

Starting point is 00:38:31 Industrialization pumps greenhouse gases into the atmosphere, which warms the planet, but it also makes it easier to police piracy, communications and other technologies that allow us to more easily get rid of pirates, and probably reduces the returns on piracy as well. So there's a third factor here that explains the correlation without there being any direct causal relationship between the two variables in question, and that's quite common, especially when we're conducting complex correlations in social and political arrangements. You've always got to look for confounding factors and confounding variables. If you don't do that, then there's no way of knowing the correlation that you've identified as causal or not. Okay, so having outlined those basic

Starting point is 00:39:11 concepts of descriptive statistics, now I want to move on and talk about inferential statistics. Inferential statistics is probably the harder aspect of statistics, and it's the bread and butter that you would study if you do a statistics course. Most of it's probably going to be inferential statistics, at least if you do it at a university level. So what is inferential statistics? Well, remember, descriptive statistics is just about describing my data in a sample. I've got all the data from my sample.

Starting point is 00:39:40 I can just look at it and do whatever calculations I like, and it's all there. I'm just describing what I see in different ways to help tease out interesting things about it. Inferential statistics is different because it goes beyond the data that I actually have and tries to say things about the population as a whole. Remember that the only numbers I actually have are numbers about my sample. That is my sample data. And I can produce descriptive statistics to describe the sample. But in inferential statistics, I want to move beyond that and say things about the population.

Starting point is 00:40:08 Because remember, it's the population that we actually care about. I don't actually care who the people that I just, the hundred or so people I just telephoned. I don't really care who they're going to vote for. I care about who the countries a whole is going to vote for. I don't really care what effect the drug has on those. on these few dozen people I tested in the clinical trial. I care about what effect the drug is going to have on the population of people who are actually going to use it

Starting point is 00:40:29 if I were to release it. So the point of inferential statistics is to take the information from the sample and use it to say meaningful things about the population that we actually care about. Now, the trouble with doing this, of course, is that we don't have information directly about the population as a whole.

Starting point is 00:40:46 We only have information about the sample. So inferential statistics is all about going from that sample information to population information, and what we can and we can't say about the population based on the sample. And this is why it's so important, as I said, before, to have a representative sample.

Starting point is 00:41:00 Because if your sample doesn't look like the population, if it's not representative, then you can't really say anything about the population as a whole based on that sample. Because it doesn't look like the population. You can only really say meaningful things about the population if the sample is reasonably representative. And even there, there are strict limitations

Starting point is 00:41:15 about what you can say. So how do we do that? We use these things that are called estimators. These are basically formulas. So if you do statistics, a lot of the formulas that you see are actually estimators, or formulas related to the estimators.

Starting point is 00:41:27 They use information from the sample to calculate likely values of population parameters. So, for example, we might be interested in the proportion of people who vote for one party in an election. Now, we can calculate that in the sample, but what we actually care about is the population

Starting point is 00:41:43 parameter of how many people in the whole country vote for this party. So I can use an estimator to go from the sample mean to estimate the population mean. The estimator is just a formula that does that. Some estimators are very simple and some are much more complicated, it depends on what we're doing. The best estimate of the population mean for many sorts of data is just to use the sample mean. That's called a point estimator. You just use one value. Of course, that's usually not enough because it's very unlikely

Starting point is 00:42:11 that the population mean is going to be exactly the same as the sample mean. There's going to be some variability, right? So we usually want a range of value. So we give what are called confidence intervals where we put limits on what the population value is likely to be. That's why when they're reporting survey results, they'll typically say plus or minus a certain number of percentage points, because that's the error bars essentially around the confidence interval that they've constructed. Another thing that we often want to do, and this is what I want to focus mostly on here in inferential statistics, is compare two things, or two groups, two or more groups, actually, but two is the simplest case. So in an experiment we'll often have a treatment and a control group.

Starting point is 00:42:46 one group that gets the real medicine, this is the treatment or the intervention, whatever it is, and the other group that does not. Or we might want to compare wages of workers in different countries or environmental measures, rainfall, for example, in one year compared to another year, returns on different types of investments, one portfolio and the other portfolio. So wanting to compare different things is very, very common in inferential statistics. We want to say in the population as a whole, is there a difference between these two cases? Again, not just in the sample, we can easily make comparisons between samples by just looking at the number.

Starting point is 00:43:16 in one and in the other. But we want to infer, is there a real difference in the real populations, or is the difference I'm seeing here just due to chance? Because, of course, when I take a sample out of a population, I could just by chance happen to get slightly different or even very different values. For example, if I'm measuring the heights of people in a population, I could by chance happen to pick a slightly taller sample than the average population. There's a probability of doing that. The bigger the sample is, the less likely it is that all of them are sort of above average in terms of tallness, but it is possible. So, in a lot of inferential statistics, we want to compare two different groups and see if there's a

Starting point is 00:43:56 difference. This is the crux, really, the crucial thing that's done in a lot of statistical tests, which is a lot of what people study in statistics. They study these tests. Statistical tests use what are called test statistics to make this decision about whether the difference between two groups is big enough. So here's the basic idea. He's the crucial. behind like 90% of inferential statistics. First of all, you take your samples and you calculate means and medians and so on, whatever you need from your samples. But then what you do is you compare the difference between those, say Truman and Control. You take the difference in those groups, or country A and country B, the difference in the average between those two groups.

Starting point is 00:44:34 And you calculate what's called a test statistic. So again, usually that's just the difference in the means or something like that between the two samples. That's our test statistic. Then I compare that test statistic to some critical value, it's called. And I ask, is this test statistic big enough? That is, is the difference between these two groups big enough to say that it's meaningful, or is it so small that it's probably due to chance? The smaller the difference, the more likely it is it's just due to chance. If I only have a very small difference between my two groups, it's quite possibly due to chance. But if the difference is very big, it's much less likely to be due to chance. So the bigger is the difference, the bigger the difference, the less

Starting point is 00:45:15 it is due to chance, and therefore the more likely there is a real difference between the two groups in the population as a whole. So when we have a big enough difference, when this test statistic that we use is big enough, we say that we have a statistically significant result. And that's the basis

Starting point is 00:45:33 of inferential statistics, it's just comparing whether a difference is big enough or not. That is big enough, where big enough is determined relative to some critical value. Now, the critical value that we pick depends upon what's called the level of significance that we're interested, in, or the significance value it's called. Usually 5% is used, or at least in many applications

Starting point is 00:45:52 it's used, but of course that's rather arbitrary. You can pick whatever value you want. Basically, the smaller the significance value that you use, the less likely that you're going to falsely declare something to be different when, in fact, it is not. So it depends on how cautious you want to be, basically. The basic idea is just to understand what the point of test statistics and critical values is. The point is to see whether this test statistic is big compared to the critical value. If it is, then we say it's statistically significant. Why does that matter? Because it means, if a result is statistically significant, that the difference between two groups, two or more groups, is probably not due to chance. It's probably not due to us having happened to

Starting point is 00:46:32 have picked unlikely samples from the population. It's probably, rather, due to a real difference between the populations. And that's the fundamental of all's, or nearly all statistical inference. It's about the size of the difference in groups compared to what we would expect due to chance. Okay, so that's the crucial idea that I want to get across. Now, let me give a couple of examples about statistical tests that you may have heard of. If not, it doesn't matter too much, but this might be helpful to some people. So one is called the kai square test. It's used when the dependent and independent variables are both categorical,

Starting point is 00:47:08 so they can only take on a few set values. So, for example, I might have, I might classify my workfeworthy. force into college-educated and not college-educated. And then I could ask whether their incomes are greater or less than $50,000 a year, for example. So there are only four possibilities there. College or no, greater or less than $50,000, as an example. Or it could be male or female smokers and non-smokers. That's another example. Now, the basic idea of this Kai Square test is that you just calculate how many observations you would expect to fall in each class if the variables were independent of each other, that is if they didn't have, you know,

Starting point is 00:47:40 there was no relationship between them, and then compare this to the actual frequencies. So, for example, I think, what, 25% of the population in Western country smokes, let's say that, and 50% male, 50% female, roughly. So that means if I divided my population on the basis of smoking and non-smoking, male and female, then I would expect to see 12.5% in the male smoking, 12.5% in the female smoking, and then, let's see, what's left, 75%. So 37.5% in the male non-smoking and 37.5% in the female non-smoking, or they're about. out. So that's what we would expect to see if there was no relationship between gender and rates of smoking. But maybe there is, and maybe, in fact, that's what I want to test.

Starting point is 00:48:28 Are males or females more or less likely to smoke? So I could do a Tye Square test, and what I would do is I simply say, okay, well, I know what I would expect if there's no difference, if it's just the same for males and females. Now I take a sample. Hopefully, again, it should be representative of the population, ask them whether they smoke or not. and then put them in one of these tables and work out what I actually, what the frequencies I actually observe and compare them to what I would expect. And then I ask, is that difference big, or is it not very big? That's the fundamental of a kai square test.

Starting point is 00:49:02 Is the difference in the expected values from the actual observed values, big or small? As I mentioned earlier, we call this the test statistics, so that the actual thing that you calculate. I'm not giving you the exact formula here, but essentially you just take the difference in the observed from the expected in all of the cells. And that forms your test statistic, and then you just compare it to the critical value of your test statistic, which depends upon your significance level. So let's say you set the significance level at 0.05, which is the standard.

Starting point is 00:49:34 So then you just look up on a table or find online what the value is. Now, the critical value is actually determined by complex calculations that you will not have to do, right? because it depends on the type of distribution that you expect your data to follow. In this case, it's a kai squared distribution, and that has a particular equation for it, and you have to calculate what proportion of the curve fits under a certain value. You don't have to worry about that. These have all been calculated before. So all you have to do when you're actually conducting this is just,

Starting point is 00:50:05 or even looking at someone else's results, analyzing results, right? you check, you calculate your test statistics. So how big is the difference between what I expect to see, if there's no difference between males and females in terms of smoking, and what I actually do see? So how big is that total difference? And then compare it to my critical value, which I look up from somewhere. If the difference, if my test statistic is less than the critical value,

Starting point is 00:50:30 then I say, well, there's probably no difference. If, on the other hand, the test statistic is bigger than the critical value. I say is I reject the null hypothesis. The null hypothesis is just what you start with, the baseline assumption. Usually it's that there's no difference between the two groups. So my null hypothesis would be males and females smoke at an equal rate. And then I look at my data and I perform the kai square test and then I say, well, can I reject that null hypothesis? If the test statistic is not very big compared to the critical value, then I can't reject the null hypothesis. In other words, there's not enough evidence to say that there's any difference between males and females in terms of smoking rates. On the other hand, suppose the test statistic is big compared to the critical value. This means that there's quite a lot of difference between what I would expect to see if there was actually no difference and what I actually do see in terms of the frequencies. In that case, I say, yes, I do reject the null hypothesis. And in further, there is a difference between men and women in terms of smoking rates.

Starting point is 00:51:31 That's the basic idea of how statistical inference works. So that's the kai squared case. That's when you have both your dependent and independent variables of kater. You may also have heard of a T-test, which is just a similar idea, except in this case, the dependent variable is continuous, so it can take on all sorts of different values. And you might have, for example, one or two type of categorical independent variables. So, for example, it could be that I've treated a bunch of mice with some sort of potential carcinogen, and I want to know how long they live after the treatment. So the dependent variable would be lifespan and the two groups would be treated with the carcinogen

Starting point is 00:52:10 and treated with the placebo. So again, I need to get a sample of rats. I give half of them the carcinogen, half of them a placebo. Then I measure how long it takes them to die. So now I have two samples, treatment and control. I calculate, first of all, the descriptive statistics for each of these groups, each of these sample groups. So the sample mean in each case.

Starting point is 00:52:32 Now, I can't just ask, is there a difference? difference in sample means because I don't have the populations. I haven't measured every rat in the universe. I've just measured a small number of rats. So I need to use inferential statistics to find out is there likely a real difference in the populations or is it probably just due to a fluke, just due to chance that I've observed these differences that I have in the samples. So to do that, I need a test statistic. That is, I need a formula that tells me how I go from the sample statistics I have, the sample means to figure out whether the difference between those men's, and to figure out whether the difference between those means is big enough or not. The way it's done in the case of the

Starting point is 00:53:07 test, the T test, is you just take the difference between the two means, so subtract them, one from the other, and then divide the result by essentially the standard deviation of the samples. There's a couple of extra numbers in there if you look at the formula that I'm going to ignore, but conceptually this is what you do. You take the difference between the average lifespan of the group of rats treated with a carcinogen and the average lifespan of the group that was not treated with the carcinogen. So you take that difference, divided by the standard deviation, and then see whether the number is the result,

Starting point is 00:53:42 the test statistic, is bigger or less than the critical value. Now, why would you divide by the standard deviation like this? Well, the basic intuition is very simple. What I'm asking when I do this, when I calculate this test statistic, is, is the difference in the treatment and control groups large or small compared to the variation in lifespan within each of the groups. So say my rats, within each group, there's a variation of between,

Starting point is 00:54:10 I don't even know how long rats live, let's say five to ten months, just make up a time span. So one rat dies within five months, the oldest rat dies within ten months, and then the rest vary between there. So that's a five months variation. So let's say the standard deviation is five months. That's quite a lot of variation in lifespan. Now then suppose that the difference between the treatment and control groups in terms of the average lifespan, is only half a month, so two weeks. So if the variation within each group is five months, but the difference between the two groups is only half a month, then the difference between the two groups is tiny compared to the variation within each group.

Starting point is 00:54:47 And what does that tell me? It tells me that it's quite likely that I got the result they did just due to chance, because there's so much variation. That small difference is just not big compared to that variation. Here's another example that might help with that intuition. if I get two students, and they're quite unreliable students, sometimes they do really well on a test, another time they would do not so well on a test. So they'll have big variation in terms of their test scores, a big standard deviation.

Starting point is 00:55:11 Now, suppose I wanted to ask which is the better student. If I took the average of their tests, say over a year, and one student got 86 average and the other one got 84 average, that's a tiny difference compared to the big variation that I've hypothesized that they have because they're unreliable. So I can't say on that basis that the 86 students better than the 84 student, because it's probably just due to chance, because they're both very inconsistent. If they're varying between 64's one week and a hundred's the next week, a difference of two between them is just not meaningful. That's a tiny difference compared to how much each of them varies on their own.

Starting point is 00:55:48 On the other hand, if I had a different group of two students, one of whom is pretty consistent, they generally get between, say, at 75 and an 85, so it's a pretty small variation. The other one pretty consistently gets between, say, 55 and a 65. There's a very clear difference in those. The difference between those two students is much bigger than the variation within each student. So I can pretty confidently say the one is stronger students than the other. So you see the clear difference here is whether the difference in the groups is big compared to the variation within each group. If it is, then my test statistic will be big, and then it will likely be larger than the critical value.

Starting point is 00:56:23 That's just sort of the threshold, the cutoff, right? which is arbitrary, as I said, but you do need to pick one. And then if the test statistics bigger than the critical value, we say, well, we reject the Nile hypothesis, we reject the idea that there's no difference between those groups of rats or those students or those populations of people, whoever we're talking about, and conclude that there is, in fact, a difference. That's the basis of the T-test. The final example of inferential statistics that I wanted to briefly mention is called linear regression.

Starting point is 00:56:52 Linear regression is a very fancy title for drawing lines through dots, because that's essentially what it is. You've probably seen scatterplots before where I draw an X-axis and a Y-axis and plot data on them. Heights and weights of students in a class, for example, might be one that you've seen, or maybe it's a years of education and income of a person, population and land area of a country, all sorts of things. Now I can ask if these two variables are correlated, and we talked about correlation. but another thing I can do is ask whether there's a specifically what relationship exists, what linear relationship exists between the two.

Starting point is 00:57:27 And so I can do that by drawing a line through those dots. You know, the dots, they're sort of scattered all over the place, but I can sort of eyeball it and draw a line through it and say, well, roughly this is the relationship I would expect. So, you know, for every one centimeter taller that I get, I expect the student to be, you know, however many grams heavier. On average, obviously there's variability around that. Now, regression is just all about saying,

Starting point is 00:57:48 how do you draw that line? Intuitively, you can just say, well, it looks like this is a good fit through the dots. Regression has more sophisticated and more exacting methods for determining where that line goes. Usually what's used is what's called the least squares method, that is it tries to minimize the sum of squared residuals, which is essentially you pick the line such that the average distance between each point and the line for that value of the x-axis is minimized. Now, don't worry if you can't understand.

Starting point is 00:58:18 exactly what that means. It's very hard to explain that with that a diagram. But the basic point is simply that linear regression attempts to fit the best line through a set of points according to that criteria of minimizing the average distance between the point and the line, basically. Minimising how wrong you are on average. And then once we take that linear regression, that the fitted line, then we ask, well, is there a positive or negative or no relationship between these variables? One of the advantages of linear regression is that it can allow you to control for different variables. So to take a more realistic example,

Starting point is 00:58:54 if I want to show that smoking cigarettes causes lung cancer, it's not enough just to say that there's a correlation between the number of cigarettes you smoke and the incidence of lung cancer. And this is what the tobacco companies pointed out in the early days of this research. Correlation is not enough, because maybe smokers don't exercise as much,

Starting point is 00:59:11 and that can correlate with cancer. I'm not sure of lung cancer specifically, but definitely other types of cancer. Maybe they have unhealthy diets. Maybe they drink more alcohol. quite plausible because people often drink and smoke together. Maybe they're poorer or richer on average, and maybe that makes a difference. It could be all sorts of other things that make a difference.

Starting point is 00:59:26 So what you need to do, or what you could do, is conduct a linear regression. So basically you get data from a variety of people concerning all of these things. So diet, alcohol, consumption, income, race, anything that you think could have any bearing on lung cancer, put all those variables in and then ask, after controlling for all these other variables, after already including their impacts or their effects on the dependent variable, does smoking or does a number of cigarettes smoke still have an effect by itself on the incidence of lung cancer? And the answer is yes, even controlling for everything else, it still has an effect.

Starting point is 01:00:06 Now, that's pretty strong evidence that smoking does cause lung cancer. So, to sum up, the main thing I wanted you to take away from this episode was an understanding of what we're doing when we're doing statistics. Fundamentally, we're making statements about probability, what's likely to have happened, what's likely to be the case in the population as a whole. That is, the set of all the things we're interested in, especially on the basis of samples,

Starting point is 01:00:30 subsets of that population that we collect information about. Inferential statistics is the process of making inferences about the population from a sample selected from that population, a sample which hopefully should be representative of the population as a whole. The basic core idea of inferential statistics is simply to say, is the difference between these two groups, bigger or less than what we would expect by chance, two or more groups? If it is bigger, then we can, as we say, reject the non-hypothesis and conclude that the result is statistically significant. If not, then we can't reject

Starting point is 01:01:02 the null hypothesis, then we can't say anything about whether there's likely to be a difference. So it's all about fundamentally measuring whether the difference is big enough. And often that's on the basis of, as I talked about in the case of the T-test, is the difference between groups big compared to, the variation within each group. If it's not, then probably the difference between groups is just due to chance. On the other hand, if it is big compared to the variation within each group, then it's probably not due to chance. And we can therefore reject the null hypothesis and conclude there probably is a real difference in the populations here. That's the fundamental basis of inferential statistics. And if you're ever studying statistics or trying to understand or interpret them, just

Starting point is 01:01:39 ask yourself this question. What is it that they're trying to compare the difference between, and how are they deciding whether this difference is big or small? And essentially that constitutes asking what is their test statistic and what is their critical value, what's their null hypothesis and so on. But you don't almost have to formulate it such explicit terms. Unfortunately, statistics can often be used to obfuscate more than it's used to clarify, depending on the purpose. So it's very easy to make statistical analysis very opaque.

Starting point is 01:02:05 That is, it's very unclear exactly what is being done or why or what they're looking for, or what they did, or where the data came from, and what the data means and so on and so on. So it should always be cautious of results like this that are not reported clearly and explained clearly. But anyway, hopefully this episode has given you some tools to better understand statistics, more and critically evaluate statistical results that are given to you.

Starting point is 01:02:31 I'm thinking that there's quite a few things that I didn't get to discuss in this episode, and so it might be good to do a follow-up, maybe looking at common statistical fallacies and pitfalls or common abuses of statistics. statistics or something like that, because there's some very interesting cases that I could focus on. But anyway, hopefully you found that interesting. You may have noticed that this has been the first episode for a little while now.

Starting point is 01:02:50 I have been fairly busy with actually a book that I've been writing over the past, actually 18 months, really. However, that's mostly done now, so I'm hoping that I might have more time to focus to devote to the podcast, because I really would like to return to a more regular episode schedule. I know I have said that before, but, well, before I hadn't finished my book, so maybe this will actually happen this time. So anyway, I've got quite a few topics that I'm thinking of preparing episodes on, so stay tuned for that. As always, if you enjoy the show, please take the time to rate or review the podcast on the aggregator of your choice. iTunes is still one of the most popular, I believe. If you'd like to contact me, you can do so at FODs12 at gmail.com.

Starting point is 01:03:31 That's FODS12 at gmail.com. Thank you for listening, and I'll talk to you next time.

The Science of Everything Podcast - Episode 79: Basic Concepts in Statistics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.