Correlation measures the relationship between two variables.
Correlation measures the relationship between two variables. More precisely, it calculates the level of change that you can expect to see in one variable due to a change in another variable.
Imagine a scatter graph where your independent variable is “time spent studying” – your dependent variable is “number of questions answered correctly”.
Do you think that the amount of time you spend studying is related to the number of questions you answer correctly in Kinnu? It probably is!
That means that ‘the amount of time spent studying is correlated with the number of questions answered correctly’. The more you study, the more questions you answer correctly.
The more X you have, the more Y you also have.
If as one value gets higher, the other one does too, then you have a positive correlation. An example of a positive correlation is height and weight. As people get taller, they also tend to weigh more.
But it works both ways, because as one value gets lower, and the other one does too, that is also a positive correlation. So, a positive correlation is when both variables move in the same direction.
On a scatter plot, a positive correlation slopes up and to the right.
A negative correlation is when as one value moves in one direction, the other moves in the opposite direction. As an example, if one gets higher, the other gets lower. Like when you climb up a mountain and get higher above sea level, the temperature gets lower.
On a scatter plot, a negative correlation slopes down and to the right.
What does no correlation look like? When there is no correlation between variables, a scatter plot looks like somebody has just randomly thrown darts at it. There is no real pattern to be seen in the data. This shows that your data is not correlated. For example, there is no correlation between the amount of tea you drink and how long my commute is.
The line of best fit in scatterplots
Often when viewing scatterplots you will see a line of best fit through the centre of the mass of data points. This is called the line of best fit, and represents a linear estimate of the dependent variable based on the value of the independent variable. It enables a visualization of the general trend in your data.
When data is so tightly clustered together as it is in the image above, it’s relatively easy to visualize the general trend in your data without the line of best fit. However, in cases where your data is messier, it serves as a useful visualization tool. It is also used in predictive statistical models to mathematically – specifically algebraically – represent the relationships between variables.
Does the slope matter when visualizing correlation?
When you look at a scatter plot, you will be able to visualize the strength of a correlation. Often on a scatter plot, you will also see a line of best fit, that’s the line that runs through the middle of the data points.
When visualising a correlation, the steepness of this line does not affect the strength of the correlation. What affects the strength of a correlation is how closely related the data points are to one another, which represents how reliably a certain change in one variable predicts a change in the other.
Pearson’s Correlation Coefficient, otherwise known as Pearson’s r, is a common way to calculate the correlation between two quantitative variables. It was developed by Karl Pearson in the 1880s. The Pearson Correlation Coefficient tells you in which direction two variables are correlated, positively or negatively, as well as the strength of that correlation. Pearson’s Coefficient is a key tool for data scientists to quantify the strength of a coefficient, instead of guessing based on visual representations.
Correlation coefficient interpretation
Pearsons’s correlation coefficient ranges from -1 to +1.
A correlation coefficient of less than 0 signifies a negative correlation, while greater than 0 signifies a positive correlation.
But, the strength of a correlation is also important. The table below shows you how to define the strength of your correlation.
Correlation is not causation
If you calculated the correlation coefficient – the strength of relationship – for your two continuous variables and saw that the more people studied, the better test scores they got, you could say that there was a correlation between time studying and test scores.
However, you can’t ever say that one caused the other from the correlation coefficient alone. This is true no matter how intuitive or obvious it might seem.
‘But of course studying causes better test scores’ you say.
What if I told you that per capita cheese consumption was correlated with the number of people who died by getting tangled in their bedsheets? Would you be so sure that cheese causes this?
What about the fact that the number of films Nicholas Cage appears in is correlated with the number of people who drown in a pool? Would you tell me that Nicholas Cage films cause drownings?