Step into the world of probability distributions – learn how real world events are modeled and visualized.

## What are probability distributions?

Probability distributions show you the spread of results from a sample population – the lowest and highest values observed and everything in between – as well as the likelihood of observing a particular value – for example, a person who is 175cm tall.

The higher the line is on the y-axis – the vertical line – means the more values with that value on the x-axis – the horizontal line – were counted. So, it shows how frequently that value was observed.

Imagine you measured the height of everyone in your town. You would have a lot of people who were average height, and a few who were really tall or really short. A probability distribution allows you to easily visualize all this data.

## The Central Limit Theorem

The central limit theorem states that no matter what the probability distribution of population from which the 100 observations came, the distribution of the sample means will always be a normal distribution. This allows us to make inferences about the population’s mean and standard deviation – and even conduct statistical tests that require data be normally distributed.

## How the Central Limit Theorem works

It’s time to picture the Central Limit Theorem in action. Imagine that we have a population of 2000, and we’re going to tackle a sample from that population. Let’s say we have a sample size of 100 – just 5% of the population – and measure the length of human index figures.

We’re going to record the mean of that sample. Then we’re going to take another sample of 100 random observations, and record the mean for that too. But, we have to put the first 100 samples we took back into the population: that’s called ‘sampling with replacement’. We do this lots of times, at least 30. And we end up with a lot of sample means.

If we plot these sample means as a distribution, no matter what shape the initial distribution was, we will end up with a normally distributed set of sample means.

## Probability Mass Function

The Probability Mass Function -PMF- tells you the probability of observing a particular discrete value.

An example of a discrete value might be whether a family has 5 members or that it has 6. However, you won’t get a number in between, like 5.54, because that would be a continuous variable, and you can’t have 5.54 people.

Unlike the Cumulative Distribution Function, the Probability Mass Function (PMF) doesn’t tell you the probability of seeing a value that is X or less. When it comes to a PMF, if you choose a value on the X-axis – the horizontal line – let’s say you chose 9, and find its corresponding Y-value – on the vertical line – then you have the probability that you will see a family of exactly 9 people out on your walk.

If you did the same on a Cumulative Distribution Function, you would get the probability that you saw a family of 9 people or less – which is much more likely than seeing a family of exactly 9 people.

## Probability Density Function

The Probability Density Function – PDF – tells you the likelihood that you will observe a variable with a certain value – like a basketballer with a height of 210cm – within a population – like all basketballers in the USA.

“)

The population doesn’t have to be every living human. It just needs to be a small sample of your population of interest – although the more observations you have, the better – as long as they are randomly sampled.

The Probability Density Function is used only for continuous variables like a person’s weight – for example, your friend that is 70.54kg. In comparison, an example of a discrete variable is a six-sided die – it can land on 1 or 2, but not 1.5.

Sometimes the PDF and the Probability Mass Function ‘PMF’ get mixed up. They’re similar, just used for different types of variables. The PMF is used only to describe discrete probability distributions, the PDF for continuous probability distributions.

## Cumulative Distribution functions

Cumulative Distribution Functions – CDF – don’t tell you the probability of observing a certain value on the X-axis – like a Probability Density Function ‘PDF’. Rather, Cumulative Distribution Functions tell you the probability of observing that value or lower.

“)

Let’s take rolling dice for example. If you roll a die many times and record each result, and create a CDF from your data, look at the 3 on the X-axis – the horizontal line – of the CDF and its corresponding value on the Y-axis – the vertical line. The number on the Y-axis is not the probability you will roll a 3, it’s the probability you will roll a 3 or lower.

Cumulative Distribution Functions ‘CDF’ can be used for both Discrete variables – like the numbers on a die that can be 1 or two but never 1.5 – and Continuous variables like your friend’s weight, that can be 60kg, 61kg, or anything in between like 60.17kg.

## The Normal Distribution

The normal distribution is often called the ‘bell-shaped’ curve because it looks like, well, a bell. It is also known as the ‘Gaussian’ distribution, after mathematician Carl Friedrich Gauss.

Let’s think about human height for a moment. It’s pretty rare to see someone only 4ft tall right? They do exist, you just don’t see a lot of them.

The same goes for really tall people, like basketballers.

Most people are average, or pretty close to average.

Human height is something that creates a bell curve when observations are randomly sampled from the population.

You have the left and right hand side tails. These represent the number of really short people and really tall people respectively. This is due to the fact that there aren’t as many of them. In fact, you probably have as many people who are extremely tall as ones who are extremely short. Then, you have a large mass in the center, which represents average people. The further away from the center you go, the rarer it is to find someone with that height.

## Binomial distribution

A binomial distribution is a type of probability distribution that is used to describe the number of successes in a fixed number of trials, where each trial has only two possible outcomes: success or failure. It is named after the word “binomial”, which means “two names” in Latin, referring to the two possible outcomes of each trial.

For example, let’s say you flip a coin 10 times. The number of times the coin lands on heads can be described by a binomial distribution. Each flip is a trial, and the possible outcomes are heads or tails. The probability of getting heads on one flip is 0.5, and the probability of getting heads on 10 flips is determined by the binomial distribution.

One important thing to note is that in a binomial distribution, the trials are independent. This means that the outcome of one trial does not affect the outcome of another trial. For example, the fact that a coin landed on heads on the first flip does not affect the probability of it landing on heads on the second flip.

The binomial distribution is also characterized by two parameters, n and p. n is the number of trials, and p is the probability of success in each trial. Knowing these two parameters, we can calculate the probability of getting a certain number of successes in n trials. This can be useful in many real-life situations, such as in business, medicine, and engineering.

## Assumptions for the binomial distribution

There are some assumptions for binomial distribution.

Firstly, the population should be fixed – meaning you don’t have new balls sneaking in or out. Secondly, in a binomial distribution, the observations should be independent of each other.

That’s why we sample with replacement: by removing one ball from the bucket, you would change the sample from which you drew and that affects the probability of drawing a ball of the same color from the bucket, because there’s one less in there now. Finally, we assume there are only two possible outcomes – only red and blue balls, or only heads or tails on a coin.

## Poisson process and distribution

A Poisson process is a type of random process used to model the number of events that occur within a certain time interval. It’s named after the French mathematician Simeon Denis Poisson. The events can be anything, such as the number of phone calls received by a call center, the number of cars that pass by a certain point on the road, or the number of goals scored by a soccer team.

The Poisson process has two key features: the average rate at which events occur (lambda), and the fact that the time intervals between events are independent. For example, if a call center receives an average of 5 calls per minute, then the Poisson process would model the number of calls received in any given minute. The time between the calls received is independent, meaning the time between the first and second call does not affect the time between the second and third call.

The Poisson distribution is closely related to the Poisson process. It’s a probability distribution that describes the number of events that occur within a certain time interval. It’s determined by the lambda parameter, which is the average rate at which events occur. The Poisson distribution can tell us, for example, the probability of a call center receiving 6 calls in a minute, given that the average rate of calls is 5 per minute.

## Weibull

The Weibull analysis is used to model the amount of time it would take for a process or event to occur. Unlike the Poisson process, which models the number of times an event occurs within a given timer period, the Weibull process models the time it takes for an event to occur. This distribution shows how we know when spare parts will be needed for your new Toyota – before the existing ones fail.

We have the Weibull to thank for much of our machinery running as smoothly as it does. Without it, we’d really just be guessing when plane parts needed to be replaced. It’s also how technology and other companies have a pretty good idea what kind of warranty they can offer without losing lots of money!

With the Weibull you can find out how likely it is your machinery will fail at a certain time, the average life of your parts, the rate of failure – how many times you can expect it to fail during a specified timeframe – and how likely it is that your product will still be working at a certain point in time.

## Bernoulli distribution

The Bernoulli distribution is another method of helping us model the probability of something happening or not happening. The Bernoulli distribution is a discrete probability distribution – meaning it can only predict discrete values. Discrete values are things like categorical variables, or even the numbers on a die.

But the Bernoulli distribution can only have one trial, and two possible outcomes. As an example, a single coin flip fits this requirement, as it only has two possible outcomes. But, a Bernoulli distribution can be anything with two outcomes, like success and failure. For example, if you set your success criteria as rolling a six with a die, and failure as anything other than 6, even though there are 6 possible outcomes, you can frame it as a Bernoulli trial.

It is important in Bernoulli trials that each outcome is independent – so the two outcomes can’t happen at the same time, and a future outcome can’t be affected by a previous outcome. This is like rolling a die or flipping a coin.

The Bernoulli distribution is really a calculation that enables you to calculate the probability of each outcome.

## Pareto

You might have heard of the Pareto Principle before. The Pareto principle states that 20% of the inputs are responsible for 80% of the results. That means 20% of employees do 80% of the work, or 20% of goldmines hold 80% of all the gold, or 20% of the diamonds account for 80% of the diamond mass in the world.

The distribution comes from Vilfredo Pareto – hence the name – who noticed that wealth distribution in Italy followed this rule. It was 20% of the landowners who owned 80% of the land.

Pareto distributions appear as heavily skewed to either the right or the left, and have heavy tails – very large outliers, like Elon Musk and his multi-billion dollar net worth for example.

The majority of us earn a lot less than Elon Musk, and a similar amount to one another, so we are all clustered together on the left hand side of the distribution, there are lots of people here so most of the area under the curve is here, too. But Elon Musk sits waaaaay out to the right, with his very large income, creating skew.