In machine learning projects, models generally work well if the proportions of the classes of the dataset used are relatively similar. However, imbalances sometimes enter the equation. This tutorial will examine the synthetic minority over-sampling technique and other methods.
From a technical point of view, an imbalanced data set (or unbalanced dataset) designates any dataset where the proportions of the different classes are not strictly identical.
However, in Data Science, this designation will designate data sets where one of the classes (or more) is extremely small compared to the others. For example, a dataset where 98% of the data belongs to class “A” against only 2% to class “B” is a highly unbalanced dataset.
The notion of “class imbalance” is fundamental in machine learning, and in particular for “supervised” type models that involve two classes (or more)
Why is this important?
As a general rule, most models work well if the proportions of the classes in a dataset are relatively similar: slight class imbalanced data issues are well managed. However, past a certain point, machine learning models will struggle to identify the minority class(es) correctly. However, this is a frequent situation encountered in a variety of real problems: detection of fraud or defects, medical diagnosis, etc.
If we take the typical example of the classification of fraudulent emails (spam, scam, phishing, malware, etc.), then only a very small proportion of them turn out to be fraudulent. This type of email is therefore rare in the datasets, and the models have difficulty classifying it: they learn a bias towards the majority class (“non-fraudulent” email), and they tend always to predict the latter.
Accuracy is a measurement often used as a performance indicator. However, it is contraindicated for unbalanced datasets: it will be straightforward to obtain a very good accuracy score on this type of dataset, without the models having learned anything if this is only to play it safe by always predicting the majority class! This metric can therefore be misleading because it will not reflect the training state of the model.
The different steps and/or methods to follow
There are various techniques that can be used to compensate for an imbalance of classes in a dataset. We present them here along with some tips:
First of all, it is necessary to check that the imbalance of the dataset is representative of what we expect to find in reality, and therefore of the data that we will present to a model put into production. In other words, it must be ensured that the distribution of data does not come from an “artifice” such as, for example, poor filtering upstream of the data (during the data collection or cleaning phases, for example). If so, then it may be interesting to see if it is not possible to collect more data. For instance, in the case of temporal data, does going further into the past allow us to increase the proportion of minority classes? Or would one of the filters applied upstream during the data cleaning phase not particularly delete such and such a class?
It may also be possible to group minority classes into a single class if they have similarities. And, if a more precise classification of minority classes is essential, it will always be possible to retrain another model downstream on the predictions of the first model. This solution, which involves training two separate models, can sometimes be the simplest.
It is very important to choose a metric adapted to this type of problem because, as we have seen above, accuracy measurement is not sufficient. Therefore, to clarify ideas about these measures, we introduce some terms here:
True Positive (TP): These are the positive observations that have been predicted correctly by the model (or, to simplify, the observations predicted as being “yes” and being true “yes”).
True Negative (TN): similarly, these are negative observations correctly predicted by the model (observations predicted as “no” and actually being “no”).
False Positive (FP): these are the negative observations predicted (erroneously) by the model as being positive (an actual “no” predicted as a “yes”).
False Negative (TN): these are the positive observations predicted as being negative by the model (an actual “yes” predicted as a “no”).
Accuracy, which is the most intuitive model performance measure, can be defined using these terms: it is simply the ratio of predicted correct observations to total observations, i.e., accuracy = ( TP+TN)/(TP+TN+FP+FN). It is a very effective metric in the case of balanced datasets.
For unbalanced datasets, standard metrics are used to measure model performance.
The combination of Precision, Recall, and F1-score
Precision is the ratio of correctly predicted positive observations to the total number of positive predicted observations, that is, P = TP / (TP + FP). This metric allows us to see how “correct” our positive predictions are.
Recall (or Sensitivity) is the ratio of correctly predicted positive observations to all truly positive observations, i.e., R = TP / (TP + FN). This metric measures how well we capture all the true positives in our predictions.
The F1-score is a weighted average of Precision and Recall. Its expression is: F1 = 2*RP/(R+P). This metric is generally more useful than accuracy because it takes into account both false positives and false negatives.
Confusion matrices are tables that visualize the performance of a model by displaying the measurements of TP, TN, FP, and FN. All observations that fall on the diagonal of the matrix were correctly predicted by the model, while observations that fall outside the diagonal correspond to errors in the model. Therefore, a perfect model would have all of its predictions on the diagonal of a confusion matrix.
Random oversampling is the most straightforward sampling technique to balance out the unbalanced nature of the data set. It balances the data by replicating the samples of the minority classes. This does not cause any loss of information, but the dataset is subject to overfitting as the same information is copied.
In the case of random oversampling, it was prone to overfitting because the samples of minority classes are replicated where SMOTE comes into play. SMOTE stands for Synthetic Minority Oversampling Technique. It creates new synthetic samples to balance the dataset.
SMOTE works by using a k-nearest neighbor algorithm to create synthetic data. The example steps are completed using Smote:
Identify the feature vector and its nearest neighbor
Calculate the distance between the two sample points
Multiply the length by a random number between 0 and 1.
Identify a new point on the line segment at the calculated distance.
Repeat the process for the identified feature vectors.
Popular Libraries and Packages Used to Tackle This Problem
In Python, the imbalanced learn a package is an excellent option.
It features undersampling and oversampling techniques and has utilities for Keras and TensorFlow, as well as functions for workout metrics.