Naked Statistics: Stripping the Dread from the Data | Charles Wheelan

Saurav Poudel
16 min readJul 8, 2024

--

Introduction:

Let’s be honest. Math isn’t always easy, and not everyone finds it fun. Yet, math surrounds us, especially statistics, which we use in our daily conversations without even realising it. Beneath the complex-looking equations and formulas, the underlying ideas of math and statistics are not only relevant but also interesting. The key is to separate the important ideas from arcane mathematics and technical jargon. Hence the name, Naked Statistics.

Chapter 1: What’s the Point?

This is key to learning any concept. Our first question should always be, “What’s the point of this?” Statistics exist to make our lives easier by helping us navigate the information around us. Let’s say you’re traveling to a distant tropical island and encounter people of varying heights. The first use of statistics is to help you describe your observation using a few summarised metrics, such as the average height of the people. Next, you can infer the average height of the entire island population based on what you observed. Then, you can even investigate if there’s a relationship between height and social hierarchy on the island. As we see, statistics can be used for description, inference, and even detective work, depending on our needs.

Chapter 2: Descriptive Statistics

As mentioned above, an important application of statistics is describing information using simplified metrics. For example, the average height on the island might be 5'6". You could also use the median, which separates the data into two equal halves (e.g., a median of 5'4" means half the data points are below 5'4"). Alternatively, you could use the mode, the most frequently occurring data point. These three metrics are collectively known as measures of central tendency.

Sometimes, absolute scores (like a height of 5'6") aren’t as informative as we’d like. For example, it’s hard to gauge if a score of 75 out of 100 is good without context. In this case, relative scores, like percentiles, can be more informative. A score in the 75th percentile means it’s higher than 75% of other scores.

Variance, or the spread of data points, also provides valuable information. The useful standard deviation, which you might remember from high school, tells us how much data points vary from the mean.

Chapter 3: Deceptive Description

Using different scores to describe the same information can also lead to the deceptive use of statistics. As Mark Twain famously said, “There are lies, damned lies, and statistics.” Depending on the intention, people can use statistically correct but misleading metrics.

For example, using the average wealth in a room with Bill Gates gives a distorted picture, where everyone seems to be a multimillionaire. You could use the median to protect against these outliers. But this very feature of the median — being immune to outliers — can be problematic if the outliers contain important information. For example, if you run a course to improve the scores of weak students, and the course significantly helps the bottom 5 or 10 percent, using the median won’t show this improvement (which the mean would). The secret is to use these metrics appropriately. Mathematics alone cannot replace human judgment, a fact that will become even more important in an AI-driven world.

Other examples of deceptive description include using nominal values in comparison, like comparing the non-inflation-adjusted box office collection of Citizen Kane (1941) with Avatar (2009). Or metrics like success rates, which can be manipulated by only taking on cases with a high chance of success. This can have serious repercussions in fields like medicine where serious patients might be excluded to maintain high success rates.

Percentages can also be used deceptively, often in a clickbait manner. For instance, saying the use of “X” increases cancer risk by 400% (from 0.005 to 0.025) or that fines have increased by 300% (from $0.1 to $0.4). The same problem of low baseline scores can occur with small percentages applied to massive amounts, like a “measly 2%” increase in a large defence budget.

Chapter 4: Correlation

Often in life, we are interested in more than one variable and the association between them. In our island example, we might be interested in the relationship between height and weight. One way to capture this association is to see how one variable varies with respect to another, calculated as their covariance. To make this metric scale-invariant (so it doesn’t matter if you measure height in meters or feet), we use a slight modification known as correlation. Yes, the same correlation that we use in our daily conversations.

Correlation is easy to use (a score from -1 to +1, where close to -1 means a strong negative association and close to +1 means a strong positive association). Unfortunately, it is also easier to misuse (as we will see in later chapters).

Chapter 5: Probability

When we think of probability, the classic coin toss example often comes to mind. The same coin toss that appears in real life only during cricket matches. So, it’s worth asking again, “What’s the point of it?” Here’s a secret: the coin toss example is used in many important daily life applications. You can turn “What is the probability of getting 70 heads in 100 tosses?” into “What is the probability of having 70 deaths out of 100 virally affected individuals?”

In life, we often deal with uncertainties and need to make the most of our known information to quantify our belief about the unknown. This is essentially what probability does for us. If you want to infer the average of an entire population using a sample, it’s important to quantify your belief, such as coming up with a 95 percent confidence interval. Using probability is our option.

Another important application of probability is in calculating the expected value, which is the weighted average of each outcome based on its probability. For example, if someone offers you two options: an 80 percent chance of winning 1000 dollars or a 95 percent chance of winning 800 dollars, which should you pick? Expected value helps in these situations. (To make it interesting, let’s add a third option: a 100 percent chance of winning 700 dollars. Our thinking can often conflict with the rules of probability, as explained in the book Thinking, Fast and Slow.)

Chapter 6: Potential Mistakes in Using Probability

The use of probability in dealing with uncertainty can easily lead to trouble, sometimes with grave consequences. One reason for the 2008 financial crash is attributed to the misuse of probability. In math, we can use complex equations to arrive at neat, precise numbers, giving us a false sense of confidence. But remember, precision has nothing to do with accuracy. (You can be absolutely precise and horribly inaccurate by saying the square root of 50 is 13.875.) When calculating VAR (Value at Risk), similar to calculating expected value, people failed to account for the risk of seemingly unlikely outcomes, known as fat tail risks. For a deeper understanding of fat tail risks, read The Black Swan by Nassim Nicholas Taleb (No, not that Natalie Portman movie!). In dealing with unknown unknowns, it’s important to consider how costly that 1 percent unknown could be. This is why you should never play Russian roulette, no matter how favorable the odds seem.

Other potential mistakes in using probability include assuming independence when it doesn’t exist. For example, if the probability of a jet engine failing is 1 in 10,000, that’s a significant risk given the number of flights every day. If the safety engine also has a failure probability of 1 in 10,000, you might wrongly conclude that the probability of both engines failing is 1 in 100 million, a seemingly safe number. But the two events are not independent; conditions causing one engine to fail (like adverse weather) could also cause the other to fail.

The opposite mistake happens when we treat independent events as dependent. For example, my innocent gambler friend thinks the next spin will be red because the last two have been black (known as the gambler’s fallacy).

Chapter 7: Garbage In, Garbage Out

When generating insights from data, the most important thing is the quality of the data itself. Statistics is just a tool to extract insights, much like how a recipe is a tool for cooking. Even the best recipe will fail if the ingredients are rotten.

This is especially important when inferring beyond your sample to a population. It highlights the importance of careful sampling techniques and awareness of biases. There are many biases to watch out for, such as selection bias (asking opinions from your bubble) and recall bias (relying on people’s memories, which can be inaccurate). Make sure your sample is representative of the population you want to infer about. This means investing more time and resources in your data.

Chapter 8: Central Limit Theorem

Time to discuss a powerful technique in statistics, which the author calls the LeBron James of statistics. One important use of statistics is to infer about the unknown (population) using the known (sample). The Central Limit Theorem (CLT) gives us the power to do this. In simple terms, it says that sample means are normally distributed around the population mean. Okay, too many jargons there, so let’s go back to our island for a simple example.

Let’s say the population average height is 5'7", which we can’t know directly from our sample. But if you take a sample of 100 people, it’s intuitive to think that the sample average would be around 5'7", let’s say 5'6". If we take another sample of 100, what do you think is more likely: a sample average of 5'6" or something far off like 5'3" or 5'10"? As you might infer, repeatedly calculated sample means will be different but close to the population mean, with most values lying just below or above the mean. This is the “normal distribution,” a bell-shaped curve where 68% of values lie within one standard deviation of the mean and 95% within two standard deviations.

The coolest thing about the CLT? Your original distribution (both the population and sample) doesn’t need to be normal for it to work. Let’s consider a Utopian Island where everyone has the same income, meaning the population distribution is uniform. Even in this case, if you repeatedly take samples of 100 and calculate their mean income, these sample means will normally distribute around the population mean. Using the 68–95 rule, you can infer powerful insights, like whether the sample you observed comes from the population based on how far its mean is from the population mean.

If you’re worried about taking repeated samples, here’s a secret: you don’t need to. If you take a large sample (at least 30, the more the merrier), you can replicate the scenario of taking multiple sample means. Your sample mean is already the best estimator of the population mean. To get the standard deviation of multiple sample means, divide the standard deviation of one sample by the square root of n. Intuitively, the standard deviation of sample means is less than the standard deviation of a single sample, hence the division. This standard deviation of sample means is commonly known as the standard error. (I know, the names can get confusing!)

Chapter 9: Hypothesis Testing

Phew, that was a lot of information. So, let’s take a little break and a huge sigh of relief! Because guess what? We have already laid the groundwork for one of the most important concepts in any research: Hypothesis Testing.

Let’s go back to our island again. Let’s say you were told that the average height on the island was 7 feet (probably to scare you!). This would be your default hypothesis before going there, which we call the Null Hypothesis in statistics. Then you start seeing data points (people, of course) and find that most of them are around 5'6". Naturally, you take more time and collect more data. But at some point, you will compare what you find with that default Null Hypothesis and ask a simple question: How likely is my observation if the Null Hypothesis were true? If it’s highly unlikely, like in your case here, you have two options. The first option is to distrust your eyes and reject what you have seen as some illusion. The second option is to reject the original Null Hypothesis. (I know what you’d do!)

You can calculate that “how likely” part (quantification of belief) using the Central Limit Theorem and probability. Say your sample mean comes out to be 5'6" with a standard error of 0.5; your 95% confidence interval would be 4'6" to 6'6" (i.e., 5'6" ± 2*0.5). Since the 7-foot Null Hypothesis falls outside of this range, you can relax with a coconut in your hand and reject the Null Hypothesis.

Of course, the very nature of making predictions about the unknown means there’s always a chance of making an error. There are two types of errors we could make in predictions. The first error is rejecting the Null when it is true, which we call Type I Error. This is similar to predicting cancer for a patient who doesn’t have it, hence also known as False Positive. The second error is failing to reject the Null when it is false, which we call Type II Error. This is similar to predicting no cancer for a patient who has it, hence also know as False Negative. While both errors are costly, one error may be more costly than the other, depending on the context. (Reminds me of the Hindi movie dialogue: “It’s better to let ten criminals go free than imprison one innocent person.”) This is another situation where judgment trumps math.

And remember, there’s no free lunch. Reducing one type of error means increasing the other. Just think of that criminal-innocent example and see how reducing one error means increasing the other.

Lastly, this seemingly simple technique of hypothesis testing goes beyond statistics. It is the bedrock of scientific knowledge in general, as it provides a method to test and potentially falsify hypotheses using empirical data. (Read Karl Popper’s principle of falsifiability to delve into this in detail!)

Chapter 10: Polling

Let’s talk about polling, an interesting concept that often pops up during elections (or any clickbait articles about what 80 percent of people think about a certain topic). Even with these polls, if you go into the details, there is always some confidence interval or margin of error (like 60% ± 3%). This is almost the same as the 95 percent interval we calculated above for height. The only difference is that instead of a mean, we have a percentage (proportion).

Just as we calculated the standard error using the sample standard deviation, we will do the same here, with one caveat. The way we calculate the standard deviation is a little different since we are dealing with proportions. The standard error here is the square root of (p * (1 — p) / n), with p being the probability of an event happening. I know, I know, it has popped up out of nowhere, so just give me a minute to explain.

Let’s say you have three different coins: the first one having a 50–50 head/tail probability, the second one having a 90–10 probability, and the third one having a 10–90 probability. If we toss all three coins 100 times each, which one do you think will have more variation? Obviously, the 50–50 one, because the other two will mostly be Head, Head, Head, or Tail, Tail, Tail. As you see, the variation depends on the probability of Head (or Tail; usually, we care about Head and call it the probability of success). The closer this probability is to the extremes (0 or 1), the less the variance, and the closer it is to the middle (0.5), the more the variance. To ensure the formula gives the highest variance at 0.5, we calculate variance as p * (1 — p). Try using a calculator and see if you can beat 0.5 * 0.5 with any other probability!

Now try a simple task: Let’s say you have a sample of 2,000, and 75 percent of your sample thinks statistics is fun. Now calculate the 95 percent interval for this, so that you can be confidently cool about this cool discovery!

Chapter 11: Regression

Till now, we have briefly talked about the relationship among variables, with the concept of correlation discussed above. But correlation only gives us the strength and direction of the relationship. What if we want to go a level up and quantify that relationship? That is, find out how much change we see in one variable for a unit change in another variable. This is basically the concept of the slope that we dealt with in high school (or the derivative in calculus). Regression is a tool that gives us that slope-type relationship. You even get to have a neat y = mx + c equation.

Let’s say you want to compare income with years of education. So, linear regression gives you: Income = Coefficient * Years of Education + Intercept

The coefficient that we estimate gives that slope-like association, and the intercept gives the average value of income when years of education is zero. I wrote “estimate” above and not “calculate” because, unlike the neat slope equation where we could calculate the exact slope, here we can find the estimated slope that fits most of the data points.

Okay, time to drop a truth bomb here. Just like correlation, even linear regression doesn’t give us causality. Even with the greatest linear regression equation, we can still only find the association among variables. So, again, you might ask: “what’s the point of this?” The truth is, causality is difficult to capture in the real world (more on this next), and in most cases, the best we can do is find the association among variables.

But even if you care about association, you can easily find other pitfalls in the above income and years of education association. Like what if years of education have very little to do with income, and it’s mostly about the university reputation? Or what if it’s because of the professional field (like being a doctor that takes more years) and not the years of education? Well, this is where you encounter the true power of regression. You can find the association between years of education and income while controlling for other factors. All you need to do is add these other variables into your equation. This type of regression with multiple variables is called multiple regression.

Chapter 12: Mistakes with Regression

Conducting regression on statistical software is super easy, which is usually a good thing. But it also makes it the most abused statistical tool of all. Like some electric appliance that comes with a disclaimer (like do not use in water), linear regression also has many disclaimers. The problem arises when people use it everywhere, without caring about the disclaimer.

The first common mistake is using it for a non-linear relationship, which should be obvious given the name (linear) but sadly is not.

The second common mistake comes from reverse causation, where the output you want to calculate affects the input. For example, let’s say you want to capture the association of change in unemployment for change in GDP. The problem is, change in GDP itself changes the unemployment rate of a country. This is where the concept of dependent and independent variables is important. In regression, the input variables should be independent, and the output variable should be dependent.

The third common mistake arises due to a strong correlation between the input variables, a condition known as multicollinearity in regression. Remember, multiple regression gives the association of a variable while keeping other variables constant. This is difficult to achieve when you have two variables that are strongly correlated with each other.

There are several other mistakes in the usage of regression. One way to avoid these mistakes is to give due diligence in understanding the variables before putting them in the equation. This is where domain knowledge and common sense come in.

With the intermingling of big data and statistics, it’s easy to just use whatever variables you have and make predictions. If all you care about is prediction, and you’re getting better predictions using whatever variables (like using the charge in your phone to predict your credit score), then maybe it’s not much of an issue for you. But if you also care about the association among variables and need to make decisions based on those associations, then it’s important to give due diligence.

Chapter 12: Different Research Techniques

Finally, let’s talk about causality. As you have already seen, causality is a difficult nut to crack. To capture causality, our last resort is doing controlled experiments. You divide people into two groups, give treatment to only one group (called the treatment group), and compare the result with the other (called the control group). The idea is simple but, unfortunately, not feasible in many situations.

First, unless the treatment has a positive effect (like giving a medical drug), it’s unethical to give treatment (like measuring the causal effect of cocaine on mental health). Second, the two groups might have other differing qualities, making it difficult to pinpoint if our treatment, in particular, is making the difference. Fortunately, for this second case, we have the easiest solution of all: randomisation.

The idea is that for randomly selected people, these other factors get distributed across both groups, with the only difference being the treatment. This Random Controlled Trial (RCT) is often held as the golden standard in research and is used in various disciplines, from medicine to A/B testing to economics (read the work of Abhijit Banerjee and Esther Duflo for this).

But conducting RCTs is expensive and sometimes not even possible (like our example of giving negative treatment). In that case, the trick is to find natural experiments in the world. If there are two nearby villages with almost all features in common (geography, culture, economy, etc.) but one particular difference (let’s say some policy on education), then we can use them as our treatment and control groups. This is still difficult to achieve in our example of cocaine and mental health, as various factors could still be at play here (like maybe it’s the habits and mindset that drive people to both cocaine use and poor mental health).

As you have seen, it often requires dedication, creativity, and interdisciplinary knowledge to find these natural experiments. Thus, the most important thing in research is not to create some fancy statistical or machine learning model. Rather, it is to frame the proper research question and then look for these counterfactuals.

Conclusion

For a long time in research, the biggest problem was the lack of data. Thus, people were making important decisions based on limited data, “expert knowledge,” whims, and instincts. Today, the situation is the opposite. But with great power comes great responsibility, and this is where we need to be more careful than ever in using statistics. Else, we would not only make wrong decisions, but wrong decisions under the false ignorance (and arrogance) of using data.

Lastly, as the author stresses in the book, statistics, at the end of the day, is like a knife. It can go either way, especially in today’s data-driven (read data-mad) world. It’s upon us how we use this tool!

My Takeaway

I just love this book and often recommend it to my friends and students. So, the entire summary is my takeaway.

Originally published at http://nepaliwanderer.com on July 8, 2024.

--

--

Saurav Poudel
Saurav Poudel

Written by Saurav Poudel

Books. Travel. Data. Stories. Experience above everything else. https://nepaliwanderer.com/

No responses yet