Working with data can be difficult – avoid common data traps in your analysis
What is missing data?
Missing data is, well, exactly as it sounds. It’s data that should be there, but isn’t! In data science, for one observation, you might have many many different variables, be they continuous, count, nominal, or ordinal.
It’s all well and good if your dataset has all the data points you need, but what do you do when it is missing? It’s an important question because often for an observation you will have data for one variable, but not another. Maybe you have somebody’s height but not their weight. Or perhaps one of the people you sampled refused to answer your survey questions.
You can see missing data represented by the ‘NaN’ values in the image below.
In general, you either remove missing data or you impute it, which means you replace it with a value like the mean, median, or mode for that variable.
In practice, it actually gets very nuanced and complicated. You wouldn’t believe how much trouble pesky missing data can cause.
For now, just remember, that the two most common ways of dealing with missing data are removal and imputation.
Missing Completely at Random Data
Missing Completely at Random (MCAR) data is when there is no pattern in your missing data. MCAR data seems unrelated to any observed or unobserved factor.
Missing Completely at Random data is pretty rare in reality, but an example might be if data just somehow got lost and couldn’t make it to your final dataset, then that would be a case of MCAR data. Perhaps your colleague accidentally lost the only USB with the file at the restaurant after work.
Generally, if your data is MCAR then you don’t need to use methods like removal or deletion of data, or imputation of values to clean up your dataset. You can proceed with your analysis as you please.
Missing Not at Random Data
Missing Not at Random ‘MNAR’ occurs when there is a pattern behind the data you are missing and it is related to the very data that you are missing. Consider, for example, if you were doing a housing survey and low-income households were less likely to report their income. Your survey would be biassed.
This is no small problem. In fact, some estimates say around 10-15% of income data is missing from surveys because people don’t answer it.
Missing Not at Random data can present problems because your sample may not be representative of your population. If you have Missing Not at Random data in your dataset, then methods like removing data or imputing values might need to be used.
Missing at Random Data
Missing at Random (MAR) data is when there is a pattern or a cause behind your missing data, but it’s not directly related to the variable you’re missing data for.
As an example, young people might fail to answer questions related to their income on surveys. But this is not necessarily because they’re ‘low income’, but because, as a cohort, they value their privacy and are less likely to discuss these things.
Alternatively, men might not answer income data because they simply don’t like to. So your data is reliably missing based on another value in your dataset, like age or gender.
What should be done with Missing At Random data?
Missing at random data can generally be left as it is because the pattern behind the missing data doesn’t have anything to do with the variable of interest itself, for example income. Instead the pattern is based on an unrelated variable, like age or gender.
However, even if that variable is theoretically unrelated, it still might still be correlated to the variable of interest in some way. As an example, older men might be more likely to earn higher salaries. As a result, we need to be careful handling MAR data.
Imputing and Removing Missing Data
If your data is Missing Completely at Random (MCAR) or Missing at Random (MAR) then you generally don’t need to remove or impute, and can proceed with your analysis.
However, if you have Missing Not at Random (MNAR) data, then it’s possible your sample isn’t representative of the population, and you need to either remove data listwise, pairwise deletion, or impute data via calculating the mean or median and imputing that as the value for any missing values.
Imputing just means that you fill the empty space for that variable with a value like the mean or median.
Like in the example above where we take the average value of each column, and add that into the missing data cells, which are indicated by the ‘NaN’ value. Consider for example, in column three ‘col3’ the mean of 3 and 9 is six. Therefore, the number 6 gets imputed into the cell with the missing value in the bottom row of col3.
Listwise deletion is the most common method to use when removing data. In listwise deletion, you delete every single observation that has missing data.
Imagine your data is laid out in a table, with the rows being the people you’ve surveyed and the columns being the data categories you want to collect. Listwise deletion would delete the entirety of any row that has an empty box.
Listwise Deletion – Bias and Power
The problem with listwise deletion is that it can create bias in our results. This is because the individuals or observations that are missing data may be different in some way from those that are not missing data. For example, imagine we are studying the relationship between a person’s height and their income. If taller people are more likely to leave their income blank on a survey, then using listwise deletion would make our sample of people with income data shorter on average, leading to bias in our results.
Power refers to the ability of a statistical analysis to detect a real effect if one exists. When we use listwise deletion, we are throwing away a lot of data, which can decrease the power of our analysis. This means that even if there is a real relationship between height and income, our analysis may not be able to detect it because we have less data to work with.
This kind of bias is only created when there is a pattern behind the missing data. If the data is Missing Completely At Random (known as MCAR), then listwise deletion won’t cause any biases.
It might be helpful to think of pairwise deletion as ‘Available Case Analysis’, because when you analyse a relationship between variables, you take every observation that has available data for all of your variables of interest, and leave the rest.
Using the example above, to analyse the relationship between Weight and Lung Capacity, you can use observations 1 and 2 because only these observations have data for both those variables. But to analyse Height and Lung Capacity you can use only observations 1 and 3. And for analysis of Height and Weight, only observation 1 has data for both those variables.
When you have datasets with missing observations scattered across different columns, the number of observations in each analysis can vary greatly. Moreover, every sample used is different, as in the example above.
Fun fact: observation 1 is based on British Olympic Rower, Peter Reed OBE1, said to have the largest recorded lung capacity (at least as of 2022) at 11.86 liters. Given that the average lung capacity is six liters, that’s quite impressive. But, you can see how having a superhuman like Peter in one analysis but not in another could affect your data, and add bias to your results.
When it comes to your data, you can never be too curious, because inaccurate data can pop up in the most unexpected ways. One such example is ‘truncated data’, which is data that has been cut off from your dataset. It’s hard to see, because, well, it’s not there!
Truncating data means that values above or below a cutoff have been excluded. For example, if you are collecting data on salary ranges within a company but only record people earning above $30,000, your data would be truncated at $30,000.
Inaccurate and Censored Data
What exactly is ‘Inaccurate data’? Imagine that you’re gathering information on car owners and their pets. Your hypothesis is that people who drive Teslas are more likely to own Labradors. But, inadvertently, in your ‘dog breed’ column, you seem to have a lot of some weird new dog breed called ‘Model 3’, which is a model of Tesla. That’s inaccurate data. Inaccurate data can be caused by poor data entry, poor data measurement or due to unconscious biases of the person collecting the data.
Censored data is a form of inaccurate data. It will show in your dataset as a range. For example, you could have a recorded height in centimetres listed as ‘>200’. This can happen because your measurement instrument might not actually measure higher than that. Alternatively, maybe your measuring tape ran out.