Samples and Populations

How to choose the subjects of your analysis, and avoid common errors.

Populations

A population is a group of interest for your research, like everyone in your city or country. It doesn’t necessarily have to be a group of people either, any list of items works.

As an example, it could be all laptops of a specific model produced by a company in 2021. In this example, the time period depends on what you’re interested in finding out.

If you wanted to know the number of defects for just one year, then the time period you used for your sample would be one year. If you wanted to know the number of defects for all time, then your population would be all laptops of that model ever produced.

These are some of the reasons that it’s important to know your research aims before you select your population.

Samples

In statistics, you often want to know more about an entire population, but to survey them all would be too expensive, or take too long, so you take a smaller sample instead.

A sample is the portion of that population that we gathered data on for our research. As an example, imagine that there are 4 million people in your city, but it is only realistic for you to collect data on 400 people.

Your population is 4 million but those 400 people are your sample. Even though you didn’t get data on everyone in our city, with good sampling and statistics we can reliably make inferences about the population based on what we observe in our sample.

Observations and Units

A population is all the people you are interested in finding something out about. A sample is a small group we take from our population so that we can analyze and test the data we gather from them.

An observation is the term we use for one data point – meaning one element that we are observing within a sample. It is not to be confused with a unit within that sample – meaning one specific member of the group being observed.

Imagine you are conducting a study of people’s heights on a dating app – strictly for research purposes, of course. The population would be the group on that app you are interested in – perhaps ‘men’, or ‘women’.

The sample would be a selection of people within that group who you choose to gather data from. The observation would be the height of each person in that group. A unit would be one individual from that group.

Representative samples

A good sample is representative of the population you’re interested in studying and learning more about.

Look in the mirror, now back at me. Now look back in the mirror and then back at me again. A sample is representative when it mirrors or reflects the characteristics of the population which you would like to learn more about.

”Samples

Was what you saw in the mirror representative of yourself? Sure it was… but was it representative of everyone in your neighborhood? Well… probably not.

To get a representative sample you would need to gather lots more people. Ideally your sample is completely representative of the population you are studying, a term known as your ‘target’ population. So if you are conducting a study on how young people spend their time, you need a sample that is representative of everyone that fits your definition of ‘young’ – not just people from your immediate environment.

Generalization for statistics

Statistical generalization means using the results we obtained from a sample and inferring characteristics about a population from those results.

It’s important for a sample to be representative of its population so that we can generalize results from statistical testing to the population at large.

Otherwise, the population could be too different from our sample, and the pattern or effect we saw in our sample might not exist in the population.

Generalization for data science

Generalization for data science is how well a machine learning model – the algorithmic recipe you use to create predictions or classifications – adapts to new data it hasn’t seen yet.

This is because just like we use samples in statistics, we can also consider the data we use to train our machine learning models as a sample. This is because we can’t possibly use all the data that exists for our sample and because machine learning is often about making predictions or classifications on data that doesn’t even exist yet.

If your model generalizes well, then the results you see in the real world will closely match the results you saw in training and testing the model.

How samples can be used for training data models

In order to create accurate models using data science, it’s important to carefully select and prepare your data. Generally, you’ll want to start by creating a training dataset. This dataset will contain the samples of data that your model will learn from.

By analyzing these samples, your model will learn to identify patterns and make predictions. But to ensure your model is accurate, you’ll also need to test it against a separate dataset, called the test dataset.

The test dataset should contain data that wasn’t used in the training process. Because your dataset is complete, including the correct value for the variable you are predicting, you can compare how accurate your predictions are against the actual observations

This process of training and testing is critical to building a model that is both accurate and generalizable.