Advanced Properties of Your Data

More complicated distributions and methods of analysis.

Line of best fit
Least squares method

Kurtosis

Kurtosis is a statistical measurement that tells us about the shape of a distribution. It specifically tells us how “peaked” or “flat” a distribution is compared to a normal distribution. A normal distribution has a kurtosis of zero.

If a distribution has a positive kurtosis, it means that it is more peaked than a normal distribution. This is often referred to as a “fat-tailed” distribution because the tails (or extremes) of the distribution are “fatter” than in a normal distribution. An example of a fat-tailed distribution is a distribution of stock returns.

On the other hand, if a distribution has a negative kurtosis, it means that it is more flat than a normal distribution. This is often referred to as a “thin-tailed” distribution because the tails are “thinner” than in a normal distribution. An example of a thin-tailed distribution is a distribution of IQ scores.

A normal distribution is a symmetric bell-shaped curve that has a kurtosis of 0. It has the same amount of data on both sides of the mean, median, and mode.

In conclusion, kurtosis can help you identify the shape of a distribution and help you tell if it is fat-tailed, normal-tailed or thin-tailed.

Labelling Kurtosis

If kurtosis is greater than 3, then the distribution is Leptokurtic. A Leptokurtic distribution has a high peak, declines rapidly as you move away from the mean, and has heavy tails – more outliers.

If kurtosis is less than 3, then it is Platykurtic. It will have a flatter top – not always as flat as the uniform distribution – and it will be mostly body, no long tails.

And what if kurtosis is exactly 3? Then it is Mesokurtic. It has a moderate peak, and it’s best represented by the normal distribution.

For a real-world application of the interpretation of kurtosis – it is often used as a measure of financial risk. The higher the kurtosis, the higher the risk because the asset is more volatile. You can make high returns, but it can also generate large losses.

Clearing up the confusion about Kurtosis and fat tails

While a Platykurtic distribution might look like it has fatter tails, it is actually a thin-tailed distribution because outliers are infrequent. Many people get confused because the tails can look thicker – because they can be higher on the x-axis. However, a Playkurtic distribution is like an elephant – a very small proportion of its weight is in the tail.

In contrast, a Leptokurtic distribution is fat-tailed because there are a lot of outliers – not to mention these outliers can be very large and far away from the mean. A Leptokurtic distribution is like a leaping Kangaroo – a large proportion of its weight is in the tail.

Line of best fit

The line of best fit is drawn on scatter plots and represents the best prediction of the dependent variable that could be made, based on the value of the independent variable.

Consider for example that we have two different dependent variable values for the exact same value of the independent variable across two observations; any estimate must therefore fall somewhere in between the two points.

When we only have two values, we can estimate simply by taking the average of our two values. For example, if we have age on the X-axis (our independent variable) and length of commute on the Y-axis (our dependent variable). We have two data points for age 30. One commutes 30 minutes, one commutes 60 minutes. Our estimate must fall somewhere between these two values – the average would be 45 minutes.

However, when we have many values, we need to create a reliable rule for estimation and prediction. That reliable rule is the line of best fit. In a regression analysis, it is called the regression line.

Regression analysis

Regression analysis is a statistical technique that is used to model the relationship between a dependent variable and one or more independent variables. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is used to make the prediction.

The goal of regression analysis is to find the best fitting model to describe the relationship between the dependent and independent variables.

For example, regression analysis can be used to understand how the price of a house (the dependent variable) is influenced by multiple independent variables like the size of the house, the area, the age and the number of bedrooms.

Simple and multiple linear regression analysis

Simple linear regression is used when we want to predict a single dependent variable using a single independent variable. For example, we might use the number of hours studied to predict a student’s test score. In this case, the number of hours studied would be the independent variable, and the test score would be the dependent variable.

Multiple linear regression is used when we want to predict a single dependent variable using multiple independent variables. For example, we might use a student’s number of hours studied, their class attendance, and their previous test scores to predict their next test score. In this case, the number of hours studied, class attendance, and previous test scores would be the independent variables, and the next test score would be the dependent variable.

In both simple and multiple linear regression, we use statistical analysis to find the best-fit line (or equation) that describes the relationship between the independent variables and the dependent variable. This line can then be used to make predictions about the dependent variable, given a set of values for the independent variables.

Residuals - how the line of best fit is found in regression analysis

Residuals, in statistics, are the difference between the actual value of a data point and the predicted value of that data point. The line of best fit is the line that minimizes the sum of the squared residuals.

You may hear the term ‘error’ when discussing residuals. The error is, as you might have guessed, the difference between the actual and predicted value, otherwise called the residual.

The residuals are illustrated by the red and green lines shown in the image above. It is the distance between our line of best fit, and our value.

You will notice the blue line intercepts the y-axis at 3. This is called our ‘intercept’ and the steepness of the line is our ‘slope’. These are represented in a regression equation as follows:

M = slope

B = intercept

X = the value of our datapoint

Y = mX + b

In minimizing the residuals via the ‘least squares method’ as it is called, we are finding the values of M and B that minimize the sum of squared residuals.

Our line of best fit must be straight – it cannot curve. That is why it is further away from some points than it is from others. But overall, its position and slope is one that minimizes the sum of the squared errors for all points.

Heteroscedasticity

In statistics, homoscedasticity – also called homogeneity of variance – means constant variance within groups. In other words, it gives you an idea of how spread out your data is – or how different it is to the mean. It is a requirement for some statistical tests for the results to be reliable. If the variances are not homogeneous, the results of these tests may be biased.

In heteroscedastic data, some data points are close to the mean, whilst others are far away.

As shown in the image above, in our heteroscedastic example, for lower values of the independent variable, the spread in values is very small. But for higher values of our independent variable, the spread of values is very large, the data seems to ‘fan out’. That’s an indicator that the data is not homoscedastic.

Homoscedasticity

”

Homoscedasticity or Homogeneity of Variance is when your data exhibits equal variances. This means all the data is equally spread out and on average is similarly close to the mean. It can refer to within groups variance, like in the images above. In this case, it refers to the spread of values within one group. In this case, it has nothing to do with the mean, variance, or standard deviation.

But it can also refer to between-groups variance, in which case it refers to the spread of values – how far away they are from the mean, measured via the distribution’s variance – between two different groups.

Causes or sources of heteroscedasticity

Heteroscedasticity can be caused by several different factors. It can result from differences in time series data – like seasonal fluctuations – or inaccuracies in your measurement tool – for example, your measurement tool might become more and more inaccurate due to changes in the external environment over time.

It could also be that your measurement tool exhibits greater variance as the inputs it is supposed to measure become greater. For example, a device might measure the wattage of batteries, but be less accurate and exhibit higher variance in readings for higher wattages.

Implications of heteroscedasticity in predictive statistics

Heteroscedasticity occurs when the spread or dispersion of the residuals differs systematically from one part of the dataset to another.

When conducting predictive statistics, for example by using a regression analysis, this means that your model may provide more accurate predictions at one end of the data range, while at the other end of the data range, the predictions are less accurate.

You can still perform a regression analysis on such data, but your results will be less accurate outside of a certain range.

You might also like

Samples and Populations (Metrics);

How samples and populations work for advanced statistics.

Probability;

Learn how to model and predict the world around you – from card games to medicine.

Bayesian Probability;

Calculate advanced conditional probabilities using Bayes' Theorem

Hypothesis Testing;

Learn the foundations of statistics – the foundational methodology of science’s greatest achievements.

Variable Types and Effect Size;

Learn advanced concepts in statistics such as confounding variables, z-scores, and effect size

Features;

Learn how to manipulate and transform variables for statistics and data science

Leave a Reply

Your email address will not be published. Required fields are marked *

Scan to download