Learn the fundamental metrics required to interpret your data like a pro.
Measures of central tendency
The three main measures of central tendency – meaning the methods for establishing the distribution of your data – are the mean, the median, and the mode. Now, while they all tell you about the central point of your dataset, they’re also all very different. So, how can your data have three different centers?
In short, the mean is what most people call the average, the median is the middle value, and the mode is the most commonly or frequently observed value.
Mean and Median
The mean is what most people call the average. It is the sum of all values divided by the number of observations. For example, add everybody’s height up like – 185 + 175 + 194 – and then divide it by the number of people – 3.
The median is the measurement or value in the exact middle of your data when you order your data from low to high. For example, by lining all your friends up from shortest to tallest, and then taking the person in the middle – their height is your median.
Calculating the Median
There are actually two different formulas for calculating the median depending on whether the number of observations in your dataset is odd or even.
If odd, then you take the number of observations in your dataset – n – add one, and then divide that by 2.
If it is even, then the formula is a little bit more complex. But we have it shown for you below. In reality, in data science, your statistical program will handle all the calculations for you.
Mode
The mode is the most popular value. If 80 percent of customers rate your store as 4 out of 5 stars, then 4 is the mode – because more people chose that value than any other value.
But it doesn’t have to be a majority. Perhaps you have 100 people, and you asked each to choose their favourite number. Let’s say that 95 people all chose a completely unique number, in that nobody else chose the number that they did.
But 5 people all chose 44. Only 5% of people chose that number, so it’s certainly not a majority. But, more people chose that number than any other number, so it is the plurality.
Advantages of using the median
The mean is heavily influenced by really large values when it comes to things like income, for example if Elon Musk walks into a small cafe, at his current net worth in 2023, the mean salary for everyone in the cafe would be billions of dollars.
This means the mean can be unhelpful in dealing with datasets with very large or very small values at the extreme ends.
In contrast, the median would be much less affected by Elon’s presence. This is why when we report income we tend to use the median, not the mean. If Elon entered the cafe, he would skew the distribution with his huge income, pushing the mean much higher than the median.
On the other hand, the mean is the most commonly used measure when a distribution is not skewed. When you see an average reported, it is most likely the mean. The mean is also needed for use in some statistical tests, where the median cannot be used. And the mean, not the median, is used to calculate standard deviation – a measure of how spread out your data is.
Fortunately it is easy to calculate both using statistical software or data science programming languages such as Python. So you’re not limited to using one or the other. It is however important to understand the difference, and when it is best to use the median.
Skew
Skewness measures the symmetry of a distribution. For example, the normal distribution – otherwise known as the Gaussian, or Bell Curve – is a symmetrical distribution. This means it has zero skew, or very close to zero. So you are just as likely to find a value 30 points above the mean as you are 30 points below the mean. Normal distributions are found everywhere in nature and daily life – birth weight, job satisfaction and IQ all have a normal distribution.
But many distributions are not symmetrical, meaning that they can skew to the left or the right. This can often be seen by the naked eye during data visualization, and there are other easy rules to test whether your distribution is skewed or not.
In a symmetrical distribution, the mean should be equal to the median – or at least pretty close to it. However, in a non-symmetrical distribution, the two things are likely to be very different.
The direction of skew
How do you know which direction your dataset is skewed in? A good way to remember, is to look where the long – fatter – tail is pointing. If it’s pointing to the left, then your distribution is left skewed, if it’s pointing to the right, then it is right skewed.
Another way is to look at the mean and median. If the mean is greater than median, then your distribution is right skewed. If the mean is less than the median, your distribution is left skewed.
An example of right skewed data in real life is income, because really rich people like Bill Gates and Elon Must skew the distribution.
On the other hand, scores on an easy test might be left skewed if most people pass the test with a high score, and only a few people fail it with a score of less than 50%.
Calculating skewness
While you can generally visualize when your distribution is skewed, and use simple rules like checking if the mean is greater than the median, or vice versa, to find which direction the skew is in, that still doesn’t help you properly quantify skew. You don’t know how skewed your data really is.
One of the easiest ways to actually quantify skew – for interval and continuous data only – is Pearson’s median skewness. It’s an easy to understand measure, and you can even calculate it yourself. The equation is as follows:
Pearson’s median skewness = 3(Mean – Median)/Standard Deviation
Pearson’s median skewness tells you exactly how many standard deviations there are between the median and the mean. If the value is really close to 0 – between -0.4 and 0.4 – then you can consider that to be a symmetrical distribution, and not meaningfully skewed.
If your result is greater than 0 that means your data is positively skewed – right skewed. If it is less than 0 then it’s negatively skewed – left skewed.
Sample variance
Variance, as the name might suggest, measures the amount of variation in your data. By that, we mean how far your values are from the mean, on average. Variance shows you how ‘spread out’ your data is.
A dataset with a high variance has a wide range of values, whereas a dataset with a low variance has a narrow range of values. If you take the age of everyone in a primary school class, then the variance will likely be low. However if you take the age of everyone in a company it’ll likely have higher variance.
Calculating sample variance
To calculate the variance of a data set, you need to:
1. Calculate the mean of the data set by adding all the values together and dividing by the number of values.
2. Subtract the mean from each value in the data set and square the result.
3. Add up all the squared differences.
4. Divide the sum by the number of values in the data set minus 1.
This gives you the variance of the data set. The equation looks like this:
Sample Standard Deviation
The standard deviation is another measure of the spread of a dataset, and relies on knowing the variance of your data. Specifically, the standard deviation is how far your values are away from the mean, on average. In short, it measures the dispersion of your data.
The key difference between standard deviation and variance is that the results of your variance calculation are presented as a squared value, where as standard deviation is in the same units as your data. Once you understand one, it’s easy to understand the other.
Calculating Sample Standard Deviation
So, standard variance is calculated by finding the average distance of all your values from the mean. The standard deviation is calculated by taking the square root of variance.
So, if the equation for variance looks like this:
Standard Deviation is the square root of variance. So the equation looks like this:
The formula for standard deviation is
sqrt(sum((x – mean)^2) / (n – 1))
Hopefully you recognize a similar version of this formula was used to calculate the variance. All we do is find the square root of the variance calculation to find the standard deviation.
The difference between variance and standard deviations
Variance and standard deviation are basically siblings. It’s just that one – standard deviation – is a lot easier to interpret, so you and everyone can understand what it means, it makes preparing and interpreting results much easier.
The standard deviation is just the square root of variance. If you’re wondering ‘well what is the point of that? If you have one why do you need the other, they both tell you the spread of the distribution?’.
But, by taking the square root of the variance you get a value – the standard deviation – that is in the same measurement units as your original values – for example minutes, seconds, centimeters, or inches.
So, if you want a measure that speaks your language, you should use the standard deviation. It makes interpretation and reporting much easier! For example, it will enable you to report that the mean height of giraffes in Africa was 5.5 meters, with a standard deviation of 0.5 meters. This gives us a quantifiable number on how the data is distributed. Plus, having everything in the same measurement units just makes things easier.
That doesn’t mean you don’t need variance, though. As an example, it is used in some statistical tests to test whether two samples might come from different populations.
The empirical rule
The empirical rule tells you that for a normal distribution, 68% of all values are within 1 standard deviation of the mean, 95% of all values are within 2 standard deviations of the mean, and 99.7% of all scores are within 3 standard deviations of the mean.
The empirical rule is helpful because if you know that your data is normally distributed, and you know your sample mean, and your standard deviation, you can begin making predictions about outcome probability.
For example, if you have a herd of zebras at the zoo, and they live 20 years on average, with a standard deviation of 5 years, you can begin to understand the probability that a zebra will live beyond a certain age.
Due to the fact that 95% of values fall within two standard deviations of the mean, you can subtract 2 times the standard deviation from the mean – 20 – 10 = 10 – and likewise add two standard deviations to the mean – 20 + 10 = 30 – to discover that, based on your data, it is likely that 95% of Zebras will live between 10 and 30 years.
Questions such as these are popular on statistics exams.