LoadingLoading
Date2022-04-04 09:00
Read7min

Cross Validation in Machine Learning: What You Need to Know

What exactly is Cross Validation, and why is it important in Machine learning? Learn more about its importance and how it can improve a Machine Learning models' accuracy in production.

Cross Validation in Machine Learning: What You Need to Know

Introduction

Let's start at the beginning: What is cross-validation? What does cross-validation mean in machine learning?

Cross-validation is a statistical technique employed to estimate a machine learning's overall accuracy. It is a valuable tool that data scientists regularly use to see how different Machine Learning (ML) models perform on certain datasets, so as to determine the most suitable model.

Cross-validation (CV) is a simple concept to comprehend and execute, making it a convenient method for comparing distinct ML models' predictive capabilities.

This article will explain what a CV is, its significance to data science, and how it can greatly benefit an ML models' overall accuracy.

What exactly is Cross-Validation?

CV is a technique used to train and evaluate an ML model using several portions of a dataset. This implies that rather than splitting the dataset into two parts only, one to train on and another to test on, the dataset is divided into more slices, or “folds”. And these slices use CV techniques to train the ML model so as to test its predictive capability and hence accuracy.

Cross-Validation Data Flow Overview

Different portions are gathered to build a training set, and the remaining ones are used to build a validation set. This process ensures that the model is utilizing new data with every training and testing step.

CV is employed to guard a model against what is called overfitting: which is a scenario where a model would merely “learn by heart” the samples it has seen in the training set. Consequently, the model would continuously maintain a perfect predictive test score but would be unable to predict useful results on unseen data.

The three key steps involved in CV are as follows:

  1. Slice and reserve portions of the dataset for the training set,

  2. Using what's left, test the ML model.

  3. Use CV techniques to test the model using the reserve portions of the dataset created in step 1.

The Advantages of CV are as follows:

  1. CV assists in realizing the optimal tuning of hyperparameters (or model settings) that increase the overall efficiency of the ML model,

  2. Training data is efficiently utilized as every observation is employed for both testing and training.

The Disadvantages of CV are as follows:

  1. Increases Testing and Training Time: CV significantly increases the training time required for an ML model. This is due to the numerous test cycles to be implemented along with the test preparation and examining and analyzing of the results.

  2. Additional computation equates to additional resources required: CV is computationally expensive, requiring surplus processing power; add the first disadvantage of extra time, then this resource requirement will add further cost to an ML model project's budget.

Two Types of Cross-Validation

So, how do you cross validate in machine learning?There are two primary types of CV testing methods, and they are categorized into two general categories called Non-Exhaustive and Exhaustive Methods.

The five key types of CV in ML are:

  1. Holdout Method,

  2. K-Fold CV,

  3. Stratified K-Fold CV,

  4. Leave-P-Out CV,

  5. Leave-One-Out CV.

Holdout Method

The holdout method is a basic CV approach in which the original dataset is divided into two discrete segments:

  1. Training Data, and 

  2. Testing Data.

The Hold-out method splits the dataset into two portions

As a non-exhaustive method, the Hold-out model 'trains' the ML model on the training dataset and evaluates the ML model using the testing dataset.

In the majority of cases, the size of the training dataset is typically much larger than the test dataset. Therefore, a standard holdout method split ratio is 70:30 or 80:20. Furthermore, the overall dataset is randomly rearranged before dividing it into the training and test set portions using the predetermined ratio.

from sklearn.cross_validation import train_test_split
 
# Use Holdout data split with a 70:30 ratio set – using a value of 0.7
 
a1, a2, b1, b2 = train_test_split(a, b, random_state=0, train_size=0.7)
 
# fit the model on one set of data
 
model.fit(a1, b1)
 
# evaluate the model on the second set of data
 
b2_model = model.predict(a2)
accuracy_score(b2, b2_model)

Holdout cross-validation python code example

There are disadvantages to the holdout method:

  1. As the model trains on distinct data point combinations, it can display inconsistent results. This introduces doubt into the validity of the model and the overall validation process. 

  2. There is no certainty that the training dataset selected symbolizes the complete dataset. If the original data sample is not large enough, there is a possibility that the test data may contain information that the model will fail to recognize. This situation occurs because it was not included in the original training data portion. 

However, the Holdout CV method is ideal if time is a scarce project resource, and there is an urgency to train and test an ML model using a large dataset.

K fold Cross-Validation

The k-fold cross-validation method is an improvement of the holdout method. It provides additional consistency to the ML model's overall testing score. This is due to how the training and testing datasets are selected and then divided.

The original dataset is divided into k number of partitions, and the holdout method is performed k number of occasions.

Let us look at an example: if the value of k is set to six, there will be six subsets of equivalent sizes or folds of data. In the first iteration, the model trains on one subset and validates on the other. In the second iteration, the model re-trains on another subset and then is tested on the remaining subset. And so on for six iterations in total.

Diagrammatically this is shown as follows:

The k-fold cross-validation randomly splits the original dataset into k number of folds

The test results of each iteration are then averaged out, which is called the CV accuracy. Finally, CV accuracy is employed as a performance metric to contrast and compare the efficiencies of different ML models.

It is important to note that the value of k is incidental or random. However, the k value is commonly set to ten within the data science field. 

The k fold cross-validation approach normally produces less subjective ML models as each and every data point within the original dataset will materialize in both the testing and training datasets. The k fold method is ideal if a data science project has a finite amount of data.

The k fold method will likely be time-consuming because the algorithm has to be rerun k times from the beginning.

The following is an excerpt of python code for k fold cross-validation testing:

# import cross validation library from the sklearn package
from sclera import cross_validation
  
# set the value of k to 7
Data_input = cross_validation.KFold(len(training_set), n_folds=7, indices=True)

k fold cross-validation python code example

Stratified K Fold Cross-Validation

By using k-fold cross-validation there is a possibility that the subsets - or folds - might become disproportional. This situation can cause the training activity to produce biased outcomes, resulting in an erroneous ML model. The random shuffling of data and splitting them into folds using k-fold CV is the root cause.

To avoid such circumstances, the subsets or folds are distinguished using a procedure called stratification. In stratification, the dataset is shuffled to guarantee that each fold is a satisfactory miniature of the entire dataset.

Stratified k folds will maintain the same data class percentages within all the folds, based on the original number of data classes. A stratified k fold model can be trained with minority class datasets.

The following is a python code excerpt using stratified k fold cross-validation for testing and training:

This python example uses a stratified 4-fold cross-validation function call on a dataset with 150 samples from two unbalanced classes. The code shows the number of samples in each class and is then compared with KFold.

 
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as npy
 
X, y = npy.ones((150, 1)), npy.hstack(([0] * 50, [1] * 5))
skfl = StratifiedKFold(n_splits=4)
for training, testing in skfl.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        npy.bincount(y[training]), npy.bincount(y[testing])))
 
# set the value of k to 4
kfld = KFold(n_splits=4)
 
for training, testing in kfld.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        npy.bincount(y[training]), npy.bincount(y[testing])))

Stratified k fold cross-validation python code example

Get started

Learn more!

Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.

Leave P-Out Cross-Validation

Leave p-out is an exhaustive CV method in which p number of data points are totally removed - or taken out - from the total number of data samples represented by n.

The model is then trained on n p data points and later tested on p data points. This process is repeated for all probable combinations of p from the original dataset sample. 

Finally, the results of each iteration are averaged out to obtain the cross-validation accuracy.

In the python code below, the LeavePGroupsOut function call removes samples related to groups for each training and test dataset.

In this example, we Group 2 Out:

 
>>> from sklearn.model_selection import LeavePGroupsOut
 
>>> X = np.arange(6)
>>> y = [1, 2, 1, 2, 1, 2]
 
>>> groupings = [1, 1, 2, 2, 3, 3]
 
>>> lpgop = LeavePGroupsOut(n_groups=2)
 
>>> for train, test in lpgop.split(X, y, groups=groupings):
...     print( "%s %s" % (train, test) )
 
# Output from the LeavePGroupsOut function call
 
[3 4] [0 1 2 3]
[2 3] [0 1 4 4]
[0 1] [2 2 4 5]

Leave P-Out Cross-Validation python code example

Leave-One-Out Cross-Validation

The leave-one-out cross-validation approach is a simple version of the Leave p-out technique. In this CV technique, the value of p is assigned to one.

This method is slightly less exhaustive; however, the execution of this method can be time-consuming and expensive. This is due to the ML model being fitted n number of times.

>>> from sklearn.model_selection import LeaveOneGroupOut
 
>>> X = [2, 8, 10, 40, 50, 60, 70]
>>> y = [0, 1, 1, 1, 2, 2, 2]
>>> groupings = [1, 1, 1, 2, 3, 3, 3]
 
>>> logos = LeaveOneGroupOut()
 
>>> for train, test in logos.split(X, y, groups=groupings):
 
     print( "%s %s" % (train, test) )
 
[1 2 3 4 5] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]

Leave One-Out Cross-Validation python code example

What is Rolling Cross Validation?

The N-Out CV methods above are not ideal for time series based data, and there are two reasons for this assertion:

  1. Ambiguously rearranging the data disrupts the order of events within the dataset, rendering the combined data inoperable and useless,

  2. By using CV, there is a possibility that the ML model trains on future data and tests on past data; which will break the fundamental rule when using time series data for model processing - 'peeking in the future is not permissible.


Rolling cross-validation is a technique used to resolve these data preparation issues:

  1. The folds, or subsets, are created using a forward-chain-enabling manner,

  2. With time-series data representing a period of p years, we can divide the data yearly into n number of subsets or folds.


The rolling CV data subsets are created as follows:

  1. iteration 1: training [a], test [b]

  2. iteration 2: training [a b], test [c]

  3. iteration 3: training [a b c], test [d]

  4. iteration 4: training [a b c d], test [e]

  5. iteration 5: training [a b c d e], test [f]


With this data representation, and within the first iteration, the ML model trains on the data of year 1, and then tests it on year 2 and so on. Each iteration represents one year of rolling CV-based time-series data for training and testing.

Conclusion

In summary, CV is used to compare and assess the execution of ML models, model performance, and it is a powerful and influential tool. It allows data scientists to maximize the use of their training and testing data, and it provides valuable information regarding ML model algorithm performance. 

Machine learning approaches incorporate model training on labeled training data in a local environment (commonly called offline mode). Despite an ML model achieving high accuracy results on this training or offline data, data scientists need to know whether similar accuracy is achievable with new and relevant data. 

CV requires several partitionings of a dataset into training and testing datasets. inverse CV techniques will have different partitioning processes performed on the original dataset. In addition, the CV testing phase will incorporate multiple iterations of ML model training using different subsets of both training and testing data.

In this article, we have examined what cross-validation is, and its significance to ML model predictability and overall preciseness. We then looked at the high-level steps to perform CV and the various exhaustive and non-exhaustive models commonly used in the data science industry, along with associated python code examples.

Learning more about Machine Learning:

Related resources

Get started

Get Started

Get started! Build better data, now.