Learn how to visually summarise your data for analysis and reporting
Histograms
Histograms are used to show us the distribution of continuous variables in our samples. It helps you visualize the frequency of your continuous data. Each observation in your sample should have a numerical value attached to it.
For example, a histogram could show the weight of each dog in your neighborhood. The weight is a continuous variable, each dog is an observation in your sample, and the neighbourhood is your population of interest.
How does a histogram work?
A histogram groups your data into buckets and counts how many of your observations fell into that bucket. First, you set ranges for groups. They might be 0-5kg, 5-10kg, 10-15kg and so on. Then you count the number of dogs that had a weight within that range.
These buckets need not be of equal width. If they are, then the number on the y-axis is equal to the frequency. If they are not, then the number on the x-axis is not the raw frequency, but the frequency density. Keep that in mind when interpreting your histograms.
With a histogram you can easily visualize the average, how spread out the data is, and even whether it seems to be normally distributed, or skewed. These are all very important things you need to know about your data!
Bar charts
Bar charts are the distant cousin of the histogram. However, bar charts are used for categorical data like the number of ginger cats in your neighborhood, and not continuous data like their weight.
Bar charts are a nice visual way to represent the frequencies of each category, for our categorical data, meaning how many times they occurred, in your dataset. You can see which categories were most commonly found, and which were least common in your data. And you can see how much of each there was.
Your data can be either nominal – where there’s no hierarchy, like car color – or ordinal – where there is a hierarchy, like educational attainment.
Horizontal bar charts
Generally, on a bar chart the y-axis shows you how many observations you counted within that category. On the x-axis are the different categories in your data.
However, you can also have it the opposite way around to make a horizontal bar chart. This is something you can’t do for a histogram, which is used for continuous data, not categorical data. While in a histogram, you can only present the frequency counts on the y-axis, both horizontal and vertical work with a bar chart.
Differences between bar charts and histograms
Bar charts are used to count frequencies of things, like the number of blue, red, and white cars you see on the highway. A histogram doesn’t count members of a category in the same way that a bar chart does. It counts observations that have been measured first, like the weight of each dog in your neighborhood.
You’ll only ever see a gap in a histogram if there’s no observations counted for that range. Otherwise, the bars of a histogram are always right up against one another and always vertical.
Values on a histogram are always ordered from lowest to highest. On the other hand, the bars on a bar chart can be ordered any way you please.
The basics of boxplots
A boxplot is a type of chart that is often used in data science to visually display a dataset’s distribution. It consists of a “box” that is defined by the upper and lower quartiles of the dataset.
The “whiskers” extending from the box represent the range of the data, while the line through the center of the box represents the median. The Interquartile Range (IQR) is the distance between the upper and lower quartiles.
One of the key things that boxplots are used for is identifying outliers. Outliers are data points that are unusually high or low compared to the rest of the dataset. Boxplots usually depict outliers as points outside the whiskers, and it’s important to take note of them as they can skew your analysis.
Reading a boxplot
When reading a boxplot, it’s important to pay attention to the different components and what they represent. The median line in the center of the box will tell you the midpoint of the dataset, while the quartiles can tell you how the data is distributed. Within the ‘box’ is the middle 50% of data.
To calculate the whiskers, you’ll want to use the Interquartile Range (IQR). Typically, the upper whisker will be located at the smaller of either the maximum data value or Q3 + 1.5(IQR), where Q3 is the upper quartile. The lower whisker is typically located at the larger of either the minimum data value or Q1 – 1.5(IQR), where Q1 is the lower quartile.
As an example, if you have a dataset with a lower quartile of 20, an upper quartile of 30, and an IQR of 10, the upper whisker will be located at 45 (30 + 1.5(10)) and the lower whisker will be located at 5 (20 – 1.5(10)).The whiskers are known as the ‘maximum’ and ‘minimum’ points on your graph, but you may still have data points beyond these – they are known as ‘outliers’.
What is a scatter plot
A scatterplot shows us the relationship between two continuous variables. It’s often the first step in visualizing correlations in your data. Correlation is the degree to which two variables are seemingly related, like how long you spend working out at the gym and how many calories you burn.
But sometimes things can be correlated but unrelated, like ice cream sales and shark attacks. Both happen to increase in summer, but one doesn’t cause the other.
When to use a scatter plot
When should you use a scatter plot? Let’s say we have a sample of observations. For each observation we have two measurements – both should be continuous variables. As an example, we could have data on weight and the amount of swimming time it takes to fatigue. We want to know if lighter mice can swim for longer than heavier mice.
To plot the data, we use a scatter plot. Each dot represents both the weight of the mouse and minutes spent swimming. If weight is on the X-axis, the horizontal line, then the further to the right the dot is, the more the mouse weighs. And the higher the dot is on the Y-axis, the line pointing vertically, the longer the mouse swims.
As an aside, mice can swim for a super long time. Some for over ten hours, because they’re naturally very buoyant. But please don’t go throwing any into the bathtub.