Learn how to manipulate and transform variables for statistics and data science
Feature engineering
Feature engineering is the process of finding, creating, and selecting the best data for your model or analysis. This is helpful for statistics and machine learning because using only your raw data might not be optimal for your model’s performance.
Just because you have ‘big data’, that doesn’t mean you have to use it all. It would be kind of like if you were baking: you don’t want to grab all the ingredients in your kitchen and add them to the cake just because you have them there. Also, you might be able to get better results by engineering your features, which is changing or altering them in some way.
The ingredients you use will determine whether you end up with some delicious results, or a culinary disaster. So, put your chef hat on – let’s get started!
Ordinal encoding
Ordinal encoding is not something you will have to do every time you run a statistical analysis or create a machine learning model. But, it is helpful to know, because many machine learning models require all inputs to be numeric. Ordinal encoding turns a categorical variable into a numerical one.
Wait, what? How is that even possible? Well, it’s a lot simpler than it might sound. For every option you have for your categorical variable, let’s say {‘High School’, ‘College’, ‘Bachelor’s Degree’, ‘Master’s Degree’, ‘PhD’} indicating your observation’s level of education, you create a column named after High School, ‘College’, ‘Bachelor’s Degree’ etc. So where your data previously looked like the table below:
After assigning each category option a number, it looks like this:
Ordinal encoding is simple, and easy to reverse. But, if your data is not ordinal in the first place, it will apply to an ordinal relationship where one does not exist. For example if your variable was instead car color or transport type. In this case, ‘one hot encoding’ might be a more suitable encoding option.
One hot encoding
A common method used for feature encoding is called ‘one-hot encoding’. It is useful when your data is not ordinal, and you don’t want to use ordinal encoding and introduce ordinality where there is none. Basically, you take your categorical data that looks like this:
You then turn it into something like the data in the table below, where 1 equals “true” and 0 to “false”. Each person now has a numerical value for true or false depending on whether they selected that option as their preferred mode of transport, or not.
This is particularly useful because computers are designed to interpret binary data.
Feature Scaling
Feature scaling is the process of altering your data in some way through normalization or standardization to achieve uniformity in the shape of your distribution. For example, it allows you to prescribe the minimum values, the maximum values, and the variance. It makes your distribution the same shape.
Feature scaling is useful because whenever distances are used for calculations and conclusions within a machine learning algorithm, there is the possibility that one variable can dominate another due to its sheer scale, rather than importance. Some algorithms that benefit from feature scaling because they use euclidean distance as a measurement for comparison. The Euclidean distance is just the length of a straight line drawn between two points.
For example, age and salary are measured on very different scales. Age can reach just over 100, while for salary you could have multiple millions. The range of distances possible for one variable are much greater than they are for the other.
So, we feature scale to give every variable a fair chance at influencing results – to show us what really is most important, and were the relationships are. This stops the biggest bully in the dataset having all the say.
Normalization
Normalizing the values in your distribution recales them so that they are all between 0 and 1. While previously you might have had income data that ranged from $10,000 to $2,5000,000, it creates an easier scale. However, by doing this, you will lose the outliers in your dataset.
Normalization is otherwise known as min-max scaling, and by looking at the equation below, you will see why. You use the minimum and maximum values of your variable of interest to normalize each datapoint.
Pros and Cons of Normalization
You should use normalization when your data is not normally distributed, and your model does not make assumptions about the distribution of your data.
There are some cons to normalization – for example, you will lose your outliers, which may have been important for an understanding of your data. You also lose your original values, they are still there, but on a different scale – so you can’t really interpret the new values in terms of the original measurement variable, like feet or liters.
Standardization
Standardizing your data rescales your data to conform to the standard normal distribution. That means it will have a mean of 0 and a standard deviation of 1. It’s useful when the model you intend to use requires that your data be normally distributed and have similar scales. We do this by subtracting the sample mean from each datapoint, and dividing that by the standard deviation.
Pros and Cons of Standardization
You should also use standardization as opposed to normalization when your data is normally distributed, or has outliers. This is because with normalization, you will lose your outliers.
There are some cons of standardiazation – for example, you lose your original values. They are still there, but on a different scale – so you can’t really interpret the new values in terms of the original measurement variable, like centimetres or dollars.
Standardization should also be used if you plan to do statistical tests like the Analysis of Variance – ANOVA, and use models like regularized linear and logistic regression which assume that your residuals – the distances between your line of best fit and your values – are normally distributed.
Which models / algorithms need feature scaling
There are many models that work by computing distances between data points – if the scales used vary then the results obtained from these models won’t be accurate. This is why we scale our data via normalization or standardization to create uniformity between variables.
Some examples of models that rely on computing distance include: K Nearest Neighbors – KNN – which is a supervised machine learning algorithm that classifies/categorizes new data based on its distance to existing data clusters for which we already know the category.
Support Vector Machines – SVM – which are also a supervised algorithm that uses distance to separate, group, and classify data points. And finally, K-means clustering – which is an unsupervised machine learning algorithm, meaning you don’t need to have labelled data, it will find patterns in the data for you based on distances.
Other examples of algorithms sensitive to variables with different ranges include dimension reduction algorithms such as Principal Components Analysis.
Dummy Encoding
Dummy encoding is used for regression models that are used to make predictions about one value based on another value, when one of those values is a categorical variable. For example, you could predict your exam score, which is a continuous variable, based on your favorite Kinnu tile, which is a categorical variable. This is because without dummy encoding, the correlation coefficient for the model cannot be calculated.
Due to the fact that a regression analysis requires a numerical value as the input – we need to transform our categorical variable to an integer. Dummy encoding enables us to do that. For example, let’s say our data looks like this:
Once dummy encoding has been performed on the data above, our data will now be represented numerically like so:
But, you might notice that one of our options is missing. Where did ‘Private Jet’ go? No, it didn’t take off to the Maldives! It went missing because if all fields equal 0 for the other three columns, we know that the value for ‘Private Jet’ must be ‘true’ – 1. This is the case with Elon in our data above. It is this step that is crucial for enabling the calculation of correlation coefficients in regression models.