Uncover hidden gems in your data, and learn how to visualize relationships between variables
What is Exploratory Data Analysis?
Exploratory Data Analysis ‘EDA’ involves locating and correcting missing values, checking relationships between variables, and extracting the most important or relevant variables that you will use for later statistical analysis or machine learning models.
When conducting EDA, you will do things like create graphs and plots to see the distribution of continuous variables, or locate and potentially remove things like outliers in your dataset, which are values that are different from your average or typical observations and that can affect your statistical test results or machine learning models.
What is the goal of exploratory Data Analysis?
The goal of exploratory data analysis is to mine golden nuggets of insight from your dataset, as well as minimize any potential error that your dataset might cause when it comes to running statistical tests or passing it into machine learning models.
If building a machine learning model is like making a fruit salad, you don’t want rotten fruit, in this case bad data, in your salad. You also don’t want the wrong ingredients, like tuna, in your fruit salad. These would be the wrong data points altogether.
As we say when it comes to data science, ‘garbage in, garbage out’, which means if you don’t properly explore, clean, and select your variables, which are your inputs, your model performance, which is your output, will suffer.
Data Cleaning
Data cleaning is an essential process in any statistics, data analytics, or data science workflow. You may also hear it referred to as data cleansing or data scrubbing.
During data cleaning, you fix or remove all kinds of data, to give yourself a clean dataset to work with. Examples of data that are problematic and need cleaning are incorrect data, missing values, duplicate values, or incorrectly formatted data.
You need to clean your data because if it is incorrect, the results from your statistical tests or machine learning algorithms will be unreliable.
During data cleaning, you may also check for and remove outliers. These are values that are really far away from the mean. They might be due to measurement error, or you just happened to get a freakishly tall person in your small sample, so it might be best to remove them, otherwise your results might not generalize to the population.
It is estimated that 26% of time during data science workflows is spent on data cleaning.
Descriptive Statistics
Descriptive statistics take all of your complex data and provide you with a simple and easy way to understand the most relevant points. For your continuous variables, you will be able to quickly see the average for each variable, the minimum and maximum values, and how many observations you have in your dataset for each variable.
With descriptive statistics, you can succinctly describe your dataset in a way that enables comparison between variables and other datasets. And while it won’t enable you to draw statistically significant conclusions, it will enable you to gain insights into where you should explore further.
Descriptive statistics will show you things like the mean, median, or mode for your data. Alternatively, they might show how spread out your data is through things like the variance or standard deviation.
Descriptive statistics can also show you the shape of your data, giving you an idea of whether it clusters around an average result, or whether it has a different distribution, which might cause problems for some statistical tests.
Graphical analysis
Much like descriptive statistics enable you to get an overview of your data with simple numerical summaries, graphical analysis enables you to get a visual overview of your data.
Graphical analysis is an important first step in data analysis, statistics, and data science. This is because without looking at them, you don’t really know what ingredients you’re putting in your fruit salad, and you might reach for the hot chilli if you’re cooking in the dark.
Visual representations you can use include box plots, bar charts, pie charts, scatter plots, and histograms.
Spotting errors in your nominal data
Nominal categorical variables require special attention when conducting exploratory data analysis, because you cannot calculate the mean or standard deviation, nor the minimum and maximum. This is because nominal categorical variables are things like “car model” or “dog breed”.
So how do you conduct EDA for nominal variables? What should you look for? The first thing to do is check all of your unique variables. Imagine that you’re doing a survey of what pets people have and you get these answers:
{“Cat”, “cat”, “Dog”, “parrot”, “Tesla”, nan}
You have a few problems. Well, actually, you have a lot of problems.
First, you have both “Cat” and “cat” as options. Presumably these are the same thing. So you need to transform all instances of “cat” to “Cat” so that the two will rightfully be grouped together in any analyses.
Next, you have a “Tesla” in with your animals. Assuming Elon Musk has not started creating self walking Robot Dogs – though we don’t exclude the possibility – that means there’s some mistaken data.
Lastly, you have a missing value which in many statistical programs will show as ‘NaN’ or ‘null’.
Why you can’t calculate the mean for nominal data
It is not possible to calculate the mean, median, or percentile values for categorical variables such as car model, or day of the week. However, for continuous variables like stock price or height, you can calculate those metrics.
Even if a nominal variable is represented numerically, like a zip code, you still can’t calculate mean, median, or percentile values. This is because the average of a zip code number doesn’t mean anything, because they are not logically organised. A mean zip code of 7023.5 doesn’t convey any useful information.
The zip code most commonly found in your dataset does however tell you useful information.
Exploratory Data analysis for nominal variables
When conducting EDA for a nominal variable, after exploring and cleaning the data, you run a frequency analysis to show you both the gross count for each category, as well as the percentage of the dataset that each category makes up.
This can give you useful information, like understanding which models of car are most popular for your customers, or where visitors of an international festival have come from.
Just remember that when reporting on information from your frequency analysis, you can only make statements about your data set. You cannot estimate the frequency of different car models in the broader population just from your sample data, for example. Exploratory analysis can only draw conclusions about the dataset it is working with.
Ordinal variables
During analysis, it’s important not to fall into the trap of treating ordinal variables such as the numbers on a die exactly as you would treat a continuous variable like height.
As an example, let’s say you have a die, and you’re playing a game with your friends. You want to know which number you are most likely to roll. Maybe you suspect that the die is loaded, and your friend has been cheating.
So you roll the die lots of times, and record each value. Now you need to check which value is most common.
Do you calculate the mean? No, because that doesn’t tell you which value is most common, it tells you the average. Moreover, it will give you a decimal value, like 2.5, which is not a number that exists on the die.
So while a dice roll can be represented numerically, you can’t treat it like continuous data.
The analysis you need is a frequency analysis, which counts how many times each discrete value was seen in your dataset. If you see the number 6 in 60 of our 120 rolls, equivalent to 50% of the time, then you might start to get suspicious that the die is in fact loaded, because you would expect to see it approximately only in 20 rolls, or 16.66% of the time.
Presenting count data as a rate
There are instances where you will be able to present your frequencies of observations as useful summary statistics in the form of rates. Let’s look at a grim example: the murder rate, which is sometimes called the homicide rate.
The homicide rate is calculated as follows: (Number of murders / total population)*100,000
Why is it multiplied by 100,000? This is because that gives us the number of murders per 100,000 people. You could also choose a different number for your data, but for homicides we normally use 100,000.
While you input into this equation a discrete ordinal variable, the number of murders, the result is a rate, for example “0.2 murders per 100,000 people”. This happens to be the homicide rate in Japan as of 2022, which is the safest country in the world when it comes to this statistic.