Which measure is best used when data are skewed? Understanding the Median and Beyond

Which Measure is Best Used When Data Are Skewed? Understanding the Median and Beyond

When we look at data, we often want to find a single number that represents the "typical" or "central" value. For neatly arranged data, this is usually straightforward. But what happens when your data isn't so neat? What if it's skewed?

Skewed data means that the data is not symmetrical. It has a longer tail on one side. Imagine a bell curve – that's symmetrical. Now imagine stretching one side of that bell out, making it longer. That's skewed data. This asymmetry can throw off some of our most common measures of central tendency, like the mean (the average).

So, when your data is skewed, which measure is best used to represent its center? The answer, in most cases, is the median.

Understanding the Mean, Median, and Mode

Before we dive deeper into skewed data, let's quickly review the three main measures of central tendency:

Mean: This is what most people think of as the "average." You add up all the numbers and divide by how many numbers there are. Example: If your scores are 70, 80, 90, the mean is (70 + 80 + 90) / 3 = 80.
Median: This is the middle value in a dataset when the numbers are arranged in order from smallest to largest. If there's an even number of data points, the median is the average of the two middle numbers. Example: If your scores are 70, 80, 90, the median is 80. If your scores are 70, 80, 90, 100, the median is (80 + 90) / 2 = 85.
Mode: This is the number that appears most frequently in your dataset. A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode at all. Example: If your scores are 70, 80, 80, 90, the mode is 80.

Why the Mean Struggles with Skewed Data

The mean is highly sensitive to extreme values, also known as outliers. In skewed data, these outliers are precisely what create the "tail." Let's look at an example.

Imagine you're looking at the salaries of people in a small company:

Salaries: $30,000, $35,000, $40,000, $45,000, $50,000, $250,000 (the CEO)

Let's calculate the measures of central tendency:

Mean: ($30,000 + $35,000 + $40,000 + $45,000 + $50,000 + $250,000) / 6 = $450,000 / 6 = $75,000
Median: To find the median, we order the data: $30,000, $35,000, $40,000, $45,000, $50,000, $250,000. Since there are 6 numbers (an even set), we take the average of the two middle numbers: ($40,000 + $45,000) / 2 = $42,500.
Mode: In this specific example, there is no mode, as each salary appears only once.

Notice how the mean ($75,000) is significantly higher than most of the salaries. This is because the single high salary of $250,000 pulled the average up. This means the mean is not a good representation of a "typical" salary in this company.

The Power of the Median with Skewed Data

The median, on the other hand, is much more robust to outliers. It simply looks at the middle position, unaffected by how extreme the values at the ends are. In our salary example, the median of $42,500 is a much better reflection of what a typical employee earns than the mean of $75,000.

Types of Skewness and Their Impact

There are two main types of skewness:

Positive Skew (Right Skew): The tail of the data is on the right side. This means there are some very high values pulling the mean to the right. In positively skewed data, you'll typically see: Mean > Median > Mode. Our salary example above is a classic case of positive skew.
Negative Skew (Left Skew): The tail of the data is on the left side. This means there are some very low values pulling the mean to the left. In negatively skewed data, you'll typically see: Mode > Median > Mean. An example might be test scores where most students do very well, but a few perform very poorly.

In both types of skewness, the median remains the most reliable measure of central tendency because it is not distorted by extreme values in either direction.

When to Consider Other Measures

While the median is generally best for skewed data, there are nuances:

Understanding the Distribution is Key: Sometimes, knowing the mean and the median together can tell you more about the data's shape than either measure alone. A large difference between the mean and median clearly indicates skewness.
Specific Applications: In some specialized fields or for certain types of analysis, the mean might still be used, but with careful consideration and often alongside other statistical techniques to account for the skewness. For instance, if you are trying to understand the total impact of income in a population, the mean might be more relevant, even if it's skewed. However, for understanding a "typical" experience, the median is superior.
The Mode: The mode can be useful if you are interested in the most common category or value. However, it's not always a good indicator of the center, especially in continuous data or when there isn't a clear dominant value.

In summary, for general understanding of a typical value in skewed datasets, the median is the go-to measure. It provides a more accurate representation of the center by ignoring the influence of extreme values.

When your data looks like a lopsided bell, don't let the outliers fool you. The middle ground – the median – is your most honest guide to the center.

Frequently Asked Questions (FAQ)

How does skewness affect the mean?

Skewness, which is the asymmetry in data, pulls the mean towards the tail of the distribution. In positively skewed data (a tail to the right), the mean is pulled higher than the median. In negatively skewed data (a tail to the left), the mean is pulled lower than the median. This makes the mean less representative of the typical value in skewed datasets.

Why is the median less affected by outliers than the mean?

The median is calculated by finding the middle value after sorting the data. It only considers the position of the data points, not their specific values. Therefore, extremely high or low values (outliers) at the ends of the dataset do not change the median's position, whereas they can significantly shift the mean.

When would you still use the mean even if data is skewed?

You might still use the mean if your goal is to understand the total sum or average contribution across the entire dataset, especially in contexts like calculating total economic impact or average group performance where the extreme values do contribute to the overall picture. However, it's crucial to report the skewness and often the median alongside the mean in such cases to provide a complete understanding.

Can a dataset be both skewed and have a clear mode?

Yes, a dataset can be skewed and still have a mode. For example, in a positively skewed dataset representing customer purchase amounts, the mode might be a common small purchase amount, while the median would be higher, and the mean would be even higher due to a few very large purchases.