Loading
Loading
  • Solutions
  • Company
  • Resources
  • Docs

Cross Validation in Machine Learning: What You Need to Know

What exactly is Cross Validation, and why is it important in Machine learning? Learn more about its importance and how it can improve a Machine Learning models' accuracy in production.

Cross Validation in Machine Learning: What You Need to Know

Introduction

Let's start at the beginning: What is cross-validation? In machine learning, cross-validation is a statistical technique employed to estimate the overall accuracy of a machine learning model. It serves as a valuable tool that data scientists regularly use to understand how different ML models perform on specific datasets, enabling them to determine the most suitable one.

Cross-validation (CV) is not only a simple concept to comprehend and execute, but also a highly convenient method for comparing the predictive capabilities of distinct ML models. By repeatedly dividing the available data into subsets for training and testing, and then evaluating the model's performance across these subsets, CV provides a comprehensive and robust assessment of its effectiveness.

In this article, we will delve into further detail about what exactly cross-validation is, its significance to data science, and how its application can greatly enhance the overall accuracy and reliability of machine learning models. Understanding and harnessing the power of cross-validation can truly make a difference in the field of ML and data-driven decision making.

What exactly is Cross-Validation?

CV is a technique used to train and evaluate an ML model using several portions of a dataset. This implies that rather than splitting the dataset into two parts only, one to train on and another to test on, the dataset is divided into more slices, or “folds”. And these slices use CV techniques to train the ML model so as to test its predictive capability and hence accuracy.

cross-validation-data-flow-overview

Cross-Validation Data Flow Overview

In the process of building a training set, different portions of data are gathered, while the remaining ones are reserved for constructing a validation set. This strategic approach ensures that the model continuously leverages new and diverse data during training and testing stages, promoting its ability to adapt to various scenarios and challenges.

One key objective of employing cross-validation is to safeguard the model against overfitting. Overfitting occurs when a model simply memorizes the samples in the training set, resulting in an artificially high predictive test score. However, such a model may struggle to generalize well on unseen data, leading to a lack of useful results. By validating the model's performance on a separate validation set, CV helps identify if the model has truly learned meaningful patterns and can generalize to new and unseen scenarios effectively.

The three key steps involved in CV are as follows:

  1. Slice and reserve portions of the dataset for the training set,

  2. Using what's left, test the ML model.

  3. Use CV techniques to test the model using the reserve portions of the dataset created in step 1.

The Advantages of CV are as follows:

  1. CV assists in realizing the optimal tuning of hyperparameters (or model settings) that increase the overall efficiency of the ML model's performance.

  2. Training data is efficiently utilized as every observation is employed for both testing and training.

The Disadvantages of CV are as follows:

  1. One of the main considerations with computer vision (CV) is the significant increase in testing and training time it requires for machine learning models. This is because CV involves multiple iterative testing cycles to ensure the accuracy and efficiency of the model.

    It includes various steps such as test preparation, execution, and rigorous analysis of the results to fine-tune and optimize the CV system. Therefore, understanding the time commitment involved in CV development is crucial for effectively leveraging its potential benefits.

  2. Additional computation translates to increased resource demands. Cross Validation is known for its high computational expense, necessitating ample processing power. This results in the first drawback of extended time, which further inflates the budgetary requirements for an ML model project.

Two Types of Cross-Validation

Cross validation in machine learning is a crucial technique for evaluating the performance of predictive models. It involves dividing the available data into multiple subsets, or folds, to train and test the model iteratively.

Non-exhaustive methods, such as k-fold cross-validation, randomly partition the data into k subsets and train the model on k-1 folds while evaluating it on the remaining fold.

On the other hand, exhaustive methods, like leave-one-out cross-validation, systematically leave out one data point at a time for testing while training the model on the remaining data points.

These methods provide a comprehensive assessment of the model's performance and help in addressing overfitting or underfitting issues effectively.

The five key types of CV in ML are:

  1. Holdout Method

  2. K-Fold CV

  3. Stratified K-Fold CV

  4. Leave-P-Out CV

  5. Leave-One-Out CV

Holdout Method

The holdout method is a basic CV approach in which the original dataset is divided into two discrete segments:

  1. Training Data - As a reminder this set is used to fit and train the model.

  2. Test Data - This set is used to evaluate the model.

    the-hold-out-method-splits-the-dataset-into-two-portions

The Hold-out method splits the dataset into two portions

As a non-exhaustive method, the Hold-out model 'trains' the ML model on the training dataset and evaluates the ML model using the testing dataset.

In the majority of cases, the size of the training dataset is typically much larger than the test dataset. Therefore, a standard holdout method split ratio is 70:30 or 80:20. Furthermore, the overall dataset is randomly rearranged before dividing it into the training and test set portions using the predetermined ratio.

Holdout cross-validation python code example

from sklearn.cross_validation import train_test_split
 
# Use Holdout data split with a 70:30 ratio set – using a value of 0.7
 
a1, a2, b1, b2 = train_test_split(a, b, random_state=0, train_size=0.7)
 
# fit the model on one set of data
 
model.fit(a1, b1)
 
# evaluate the model on the second set of data
 
b2_model = model.predict(a2)
accuracy_score(b2, b2_model)

There are several disadvantages to the holdout method that need to be considered. One drawback is that as the model trains on distinct combinations of data points, it can sometimes yield inconsistent results, which can introduce doubt into the validity of the model and the overall validation process.

Another concern is that there is no certainty that the training dataset selected fully represents the complete dataset. If the original data sample is not large enough, there is a possibility that the test data may contain information that the model will fail to recognize because it was not included in the original training data portion.

However, despite these limitations, the Holdout CV method can be considered ideal in situations where time is a scarce project resource and there is an urgency to train and test an ML model using a large dataset.

K fold Cross-Validation

The k-fold cross-validation method is considered an improvement over the holdout method due to its ability to provide additional consistency to the overall testing score of machine learning models. This improvement is achieved by applying a specific procedure for selecting and dividing the training and testing datasets.

To implement k-fold cross-validation, the original dataset is divided into k number of partitions. The holdout method is then performed k number of occasions, each time using a different partition as the testing set, while the remaining partitions are used for training. This repeated process helps to obtain a more reliable and robust evaluation of the model's performance by leveraging a larger amount of data for testing and training purposes.

Let us look at an example: if the value of k is set to six, there will be six subsets of equivalent sizes or folds of data. In the first iteration, the model trains on one subset and validates on the other. In the second iteration, the model re-trains on another subset and then is tested on the remaining subset. And so on for six iterations in total.

Diagrammatically this is shown as follows:

the-k-fold-cross-validation-randomly-splits-the-original-dataset-into-k-number-of-foldsThe k-fold cross-validation randomly splits the original dataset into k number of folds

The test results of each iteration are then averaged out, which is called the CV accuracy. Finally, CV accuracy is employed as a performance metric to contrast and compare the efficiencies of different ML models.

It is important to note that the value of k is incidental or random. However, the k value is commonly set to ten within the data science field. 

The k-fold cross-validation approach is widely recognized for generating ML models with reduced subjectivity. By ensuring that each data point is present in both testing and training datasets, this technique enhances the objectivity of the models.

Moreover, the k-fold method proves to be particularly advantageous for data science projects with a finite amount of data. It maximizes the utilization of available data by repeatedly utilizing different subsets for testing and training, leading to more comprehensive and reliable model evaluation.

However, it is worth noting that the k-fold approach can be time-consuming. The algorithm needs to be rerun k times, starting from scratch for each iteration. While it adds computational overhead, the meticulousness of this method is crucial for obtaining accurate and robust results.

K Cross-Validation Python Code Example

# import cross validation library from the sklearn package
from sclera import cross_validation
  
# set the value of k to 7
Data_input = cross_validation.KFold(len(training_set), n_folds=7, indices=True)

Stratified K Fold Cross-Validation

By using k-fold cross-validation, there is a possibility that the subsets, or folds, might become disproportional. Such an imbalance can lead to biased outcomes during the training process, resulting in an erroneous machine learning model. The issue stems from the random shuffling of data and the subsequent splitting into folds using k-fold cross-validation.

To mitigate this, a procedure called stratification is employed. In stratification, the dataset is shuffled in such a way that each fold represents a satisfactory miniature of the entire dataset. This approach ensures that the class proportions are maintained consistently across all the folds, based on the original number of data classes.

Hence, stratified k-fold cross-validation provides a more robust and reliable framework for training models, especially when dealing with minority class datasets.

Stratified K Cross-Validation Python Code Example

The following is a python code excerpt using stratified k fold cross-validation for testing and training:

This python example uses a stratified 4-fold cross-validation function call on a dataset with 150 samples from two unbalanced classes. The code shows the number of samples in each class and is then compared with KFold.

 
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as npy
 
X, y = npy.ones((150, 1)), npy.hstack(([0] * 50, [1] * 5))
skfl = StratifiedKFold(n_splits=4)
for training, testing in skfl.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        npy.bincount(y[training]), npy.bincount(y[testing])))
 
# set the value of k to 4
kfld = KFold(n_splits=4)
 
for training, testing in kfld.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        npy.bincount(y[training]), npy.bincount(y[testing])))
Get started

Learn more!

Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.

Leave P-Out Cross-Validation

Leave p-out is an exhaustive cross-validation (CV) method commonly used in machine learning. In this technique, a fixed number of data points, denoted by p, are systematically excluded from the total number of data samples represented by n. By doing so, the model is trained on n-p data points and later tested on p data points. This process is repeated for all possible combinations of p from the original dataset sample, leading to a comprehensive evaluation of the model's performance.

The advantage of leave p-out cross-validation is that it offers an unbiased estimation of model performance by considering various subsets of the data. It helps to assess the robustness and generalizability of the model under different scenarios. Finally, the results obtained from each iteration are averaged out to obtain the cross-validation accuracy, providing a reliable measure of the model's effectiveness.

Overall, leave p-out cross-validation is a powerful technique that ensures a thorough evaluation of a model's performance and enhances the reliability of its results.

Leave P-Out Cross-Validation Python Code Example

In the python code below, the LeavePGroupsOut function call removes samples related to groups for each training and test dataset.

In this example, we Group 2 Out:

 
>>> from sklearn.model_selection import LeavePGroupsOut
 
>>> X = np.arange(6)
>>> y = [1, 2, 1, 2, 1, 2]
 
>>> groupings = [1, 1, 2, 2, 3, 3]
 
>>> lpgop = LeavePGroupsOut(n_groups=2)
 
>>> for train, test in lpgop.split(X, y, groups=groupings):
...     print( "%s %s" % (train, test) )
 
# Output from the LeavePGroupsOut function call
 
[3 4] [0 1 2 3]
[2 3] [0 1 4 4]
[0 1] [2 2 4 5]

Leave-One-Out Cross-Validation

The leave-one-out cross-validation approach is a simple version of the Leave p-out technique. In this CV technique, the value of p is assigned to one.

Leave-One-Out Cross-Validation Python Code Example

This method is slightly less exhaustive; however, the execution of this method can be time-consuming and expensive. This is due to the ML model being fitted n number of times.

>>> from sklearn.model_selection import LeaveOneGroupOut
 
>>> X = [2, 8, 10, 40, 50, 60, 70]
>>> y = [0, 1, 1, 1, 2, 2, 2]
>>> groupings = [1, 1, 1, 2, 3, 3, 3]
 
>>> logos = LeaveOneGroupOut()
 
>>> for train, test in logos.split(X, y, groups=groupings):
 
     print( "%s %s" % (train, test) )
 
[1 2 3 4 5] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]

When it comes to making decisions about which cross-validation technique to use, the choice between Leave-P-Out and Leave-One-Out depends on the specific requirements of the problem at hand.

Leave-P-Out is typically employed when you want to train your model using all but a specific number P of samples from a dataset. This technique can be useful when you have limited computational resources or when the size of the dataset is large.

On the other hand, Leave-One-Out is commonly used in situations where you want to evaluate the performance of your model by leaving out just one sample at a time. This approach is particularly beneficial when dealing with smaller datasets or when you want to closely examine the impact of each individual sample on your model's performance.

What is Rolling Cross Validation?

The N-Out CV methods above are not ideal for time series based data, and there are two reasons for this assertion:

  1. Ambiguously rearranging the data, without a clear and logical structure, disrupts the chronological order of events within the dataset. As a result, the combined data becomes inoperable and completely useless for any meaningful analysis or insights. It is crucial to maintain the integrity of data by preserving its original order and organization.

  2. When utilizing CV (cross-validation), there is a potential risk of training the ML model on future data and evaluating it on past data. This scenario violates the fundamental principle of using time series data for model processing, as it involves "peeking into the future," which is strictly forbidden. Such practice can lead to biased and unrealistic performance estimates. Therefore, it is crucial to exercise caution and adhere to proper data handling techniques to ensure reliable and accurate model evaluation in time series analysis.

Rolling cross-validation is a powerful technique employed to address various data preparation issues:

  1. This method involves the creation of folds or subsets in a forward-chain-enabling manner, allowing for a comprehensive evaluation of the model's performance across different subsets of data. By iteratively updating the training and test sets, rolling cross-validation ensures a more robust and reliable assessment of the model's effectiveness.

  2. In the case of time-series data spanning a period of p years, a clever approach is to divide the data yearly into n number of subsets or folds.

The rolling CV data subsets are created as follows:

  • Iteration 1: training [a], test [b]

  • Iteration 2: training [a b], test [c]

With this data representation, and within the first iteration, the ML model trains on the data of year 1, and then tests it on year 2 and so on. Each iteration represents one year of rolling CV-based time-series data for training and testing.

Rolling Cross-Validation Python Code Example

# Import necessary libraries
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator
from typing import List, Dict, Union, Callable

def rolling_cross_validation(
        model: BaseEstimator, 
        X: np.ndarray, 
        y: np.ndarray, 
        n_splits: int = 3, 
        metrics: Union[List[Callable], Callable] = mean_squared_error,
        verbose: bool = False
    ) -> Dict[str, List[float]]:
    """
    Perform rolling cross-validation on time-series data.
    
    Parameters:
    - model: The machine learning model to use (should have fit and predict methods)
    - X: The feature matrix
    - y: The target vector
    - n_splits: The number of splits for TimeSeriesSplit
    - metrics: A metric or list of metrics to calculate
    - verbose: Whether to print debugging information
    
    Returns:
    A dictionary containing lists of metric values for each split.
    """
    # Initialize TimeSeriesSplit
    tscv = TimeSeriesSplit(n_splits=n_splits)
    
    # If only a single metric is provided, put it in a list
    if callable(metrics):
        metrics = [metrics]
        
    # Initialize dictionary to hold metric results
    metric_results = {metric.__name__: [] for metric in metrics}
    
    for train_index, test_index in tscv.split(X):
        if verbose:
            print(f"Train indices: {train_index}, Test indices: {test_index}")
        
        # Create train/test splits
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Fit the model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        
        # Calculate and store metric(s)
        for metric in metrics:
            metric_value = metric(y_test, y_pred)
            metric_results[metric.__name__].append(metric_value)
            
        if verbose:
            print(f"Metrics: {[f'{key}: {value[-1]}' for key, value in metric_results.items()]}\n")
    
    return metric_results

# Example usage
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9], [9, 10]])
y = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10])

model = LinearRegression()
results = rolling_cross_validation(model, X, y, n_splits=3, verbose=True)
results

Blocked Rolling Cross Validation

Blocked cross-validation works by dividing the time series into blocks of data. The model is then trained on a subset of blocks and tested on the remaining blocks. This process is repeated multiple times, using different blocks for training and testing each time.

In blocked rolling cross-validation, the idea is similar to rolling cross validation, but there might be a "gap" or "block" between the training and test sets.

  • Iteration 1: training [a], test [c]

  • Iteration 2: training [a b], test [d]

The key difference is that in standard rolling cross-validation, the test set immediately follows the training set, whereas in blocked rolling cross-validation, there might be a gap between the training and test sets.

However, the main difference between blocked cross-validation and rolling cross-validation is the size of the blocks. Blocked cross-validation uses larger blocks of data, while rolling cross-validation uses smaller blocks.

Blocked cross-validation is less likely to suffer from overfitting than rolling cross-validation. This is because the model is not trained on the most recent data, which is the data that is most likely to be relevant to the future. However, blocked cross-validation can be less efficient than rolling cross-validation, as it requires more data to be used for training.

The best method to use depends on the specific time series forecasting problem. If overfitting is a concern, then blocked cross-validation is a good choice. If efficiency is a concern, then rolling cross-validation is a good choice.

Best practices

Use multiple metrics

While a single metric like accuracy might be easy to interpret, it may not provide a complete picture of your model's performance. Consider using multiple metrics like precision, recall, F1-score, and AUC-ROC to get a more comprehensive understanding.

As a reminder:

  • Precision is the proportion of true positive predictions among the total positive predictions and is normally used for spam detection use cases.

  • Recall is the proportion of true positive predictions among the actual positive instances. This is typically used for medical diagnoses where missing a positive instance can be costly.

  • F1-Score measures the harmonic mean of precision and recall, and is useful when you want a balance between precision and recall and there is an uneven class distribution.

  • AUC-ROC is the area under the curve plotted with true positive rate (sensitivity) versus the false positive rate (1-specificity). These are normally for binary classification problems to understand the trade-off between the true positive rate and false positive rate.

Why Use Multiple Metrics?

  1. Comprehensive Evaluation: Different metrics focus on different aspects of a model's performance, providing a more rounded view.

  2. Imbalanced Data: In cases of imbalanced datasets, accuracy can be misleading. Metrics like precision, recall, or F1-score can provide a better insight into the minority class's performance.

  3. Cost-Sensitivity: Depending on the application, the cost of false positives and false negatives can be different. Using multiple metrics allows you to tailor your model according to these specific needs.

  4. Model Comparison: When comparing multiple models, looking at a range of metrics can provide a more robust basis for selection.

Keep Data Leakage in Check

Data leakage occurs when your model accidentally gets access to the target variable during training, resulting in overly optimistic performance metrics that do not generalize well to new data. This can happen, for example, when features that are derived from the target variable are inadvertently included in the training data, misleading the model into "cheating" and inflating its effectiveness. Detecting and preventing data leakage is crucial for ensuring the reliability and trustworthiness of machine learning models.

To prevent data leakage in cross-validation, it is crucial to follow certain practices.

  • Firstly, when performing feature engineering or data preprocessing, ensure that these steps are learned solely from the training set and not the entire dataset. This ensures that the model does not inadvertently get information from the test or validation sets.

  • Secondly, employing pipelines is an effective approach as it confines the preprocessing steps within each fold of the cross-validation, thus preventing errors that can arise from sharing information across folds.

  • Lastly, when working with temporal data, it is important to take precautions such as not using future data to predict past events, as this can introduce bias and compromise the validity of the results.

Evaluate on an Independent Test Set

Cross-validation provides valuable insights into the performance of your model on unseen data. However, to ensure the robustness of your model's performance, it is a recommended practice to evaluate it on a completely independent test set.

This test set should not have been utilized during the training process or included in the cross-validation. By doing so, you can obtain a more comprehensive understanding of how your model performs on truly unseen data, enhancing the reliability of your results.

How to Use an Independent Test Set?

  • Before starting the cross-validation process, set aside a portion of your data as the test set.

  • After you've finalized your model using cross-validation, evaluate its performance on this independent test set to get a final measure of its generalizability.

Interpret Results Carefully

When interpreting the results obtained from cross-validation, it's crucial to carefully consider the context of the specific problem you're aiming to solve. For instance, while a 95% accuracy may be deemed excellent in certain applications, it could potentially be deemed inadequate in others, such as medical diagnoses.

Metrics are just one piece of the puzzle. To comprehensively evaluate the effectiveness of a model, consider additional factors. For instance, weigh the cost associated with false positives against false negatives as well as assess the interpretability of the model. Furthermore, the deployment and long-term maintenance of the model should also be taken into account for a more comprehensive evaluation. By considering these broader aspects, a more nuanced interpretation of the results can be achieved.

Conclusion

In summary, CV is used to compare and assess the execution of ML models, model performance, and it is a powerful and influential tool. It allows data scientists to maximize the use of their training and testing data, and it provides valuable information regarding ML model algorithm performance. 

Machine learning approaches incorporate model training on labeled training data in a local environment (commonly called offline mode). Despite an ML model achieving high accuracy results on this training or offline data, data scientists need to know whether similar accuracy is achievable with new and relevant data. 

CV requires several partitioning of a dataset into training and testing datasets. inverse CV techniques will have different partitioning processes performed on the original dataset. In addition, the CV testing phase will incorporate multiple iterations of ML model training using different subsets of both training and testing data.

Get started

Get Started

Get started! Build better data, now.