# Faster training with active learning

**Faster Training With Kili Technology : Active Learning**

It is no secret that machine learning models, especially deep learning models, need lots of training data. In the real world, unsupervised data is plenty while supervised data is rare and costly to obtain. Thus, you may be interested in using ** active learning**. It is the task of choosing the data samples to annotate to minimize the number of annotations required to achieve some performance. Since most real world applications are budget-constrained, active learning (AL) can help decrease the annotation cost.

At Kili Technology, we develop an annotation platform, to quickly get a production ready dataset covering almost all use cases. One of the features of our __Python API__ is the ability to change the priority of samples to annotate. Coupled with our query mechanism, which allows to import labels into your scripts, it allows the use of active learning algorithms. We show you how to do that __here__.

In this blog post, we will provide an overview of the current state (05/2020) of active learning :

what is it ?

what are typical algorithms ?

how to apply it to your use cases ?

This blog post is the first of a series on active-learning.

**1. Introduction**

The most common case, where active learning will be most useful, is the situation where you have lots of unsupervised training data quickly available. In this situation, you can iteratively select a subset of labels to annotate. You need three components :

A machine learning model that learns with supervised data $(X_l, y_l)$

An algorithm, which given a trained model and unlabeled data $X_u$, sends a

*query*: a subsample $X_s$ of this unlabeled data which will be the most beneficial to the training of a new model : $X_s \subseteq X_u$An

*oracle*, which is most probably you, the human annotator, that returns the labels $y_s$.

Then, the training process is as follows (see above schema).

You begin by training a model on preliminary labeled data.

You run the active learning algorithm on the remaining unsupervised data. It queries samples to annotate.

You annotate the queried data.

You retrain your model with this newly acquired data.

Repeat steps 2-4 until your model has good enough performance.

In the following, we will talk about the **active learning algorithms** allowing this training

process. Two references for the following work can be found here :

__Settles, Burr (2010). "Active Learning Literature Survey"__: reference review for active learning.To get up to date with the state of the art :

__Papers with code - active learning__

**2. Common algorithms**

**2.1 Uncertainty based methods**

The first class of active learning algorithms is based on **uncertainty methods**. They require that the model **$m$ is a probabilistic learner**, that is it outputs prediction probabilities : $m(x) = (\mathbb{P}(y_i|x))_{1 \leq i \leq n}$ if there are $n$ classes. Those are often the most used types of active learning algorithms, because they are easily usable with neural networks whose last layer has a softmax activation, representing class-probabilities. There are three main algorithms :

The simplest idea is to query assets for which we are the

*least confident*about. Given a sample $x$, the model predicts $\hat{y}$ with probability $\underset{y}{\operatorname{argmax}} \mathbb{P}(y|x)$. We then return the sample(s) for which we are the least sure (you can control the amount of assets queried).

$$

X_s = \underset{x}{\operatorname{argmax}} 1 - \mathbb{P}(\hat{y}|x)

$$Doing this discards all information about other class probabilities. One way to account for this is to use the margin between the predicted class $\hat{y}_1$ and the second top prediction $\hat{y}_2$. This is called

*margin sampling*$$

X_s = \underset{x}{\operatorname{argmin}} \mathbb{P}(\hat{y}_1|x) - \mathbb{P}(\hat{y}_2|x)

$$Finally, to use in the optimal way all labels, it is often best to use

*entropy*: it returns the assets for which the distribution of the predictions has the largest entropy. Its main difference with the two previous algorithms is that it won't return assets for which one label is very unlikely.

$$

X_s = \underset{x}{\operatorname{argmax}} -\sum_y \mathbb{P}(y|x)\log\mathbb{P}(y|x)

$$

To conclude, if you are interested in reducing the loss, use entropy-based uncertainty sampling, if you are interested in reducing the classification error, use margin sampling.

**2.2 Committee-based methods**

A second class of algorithms are **committee-based**. They require:

An ensemble of trained models $(m_1, ..., m_K)$

A way to compute disagreement between models $x \mapsto D(m_1, ..., m_K)(x)$

If your machine learning models are neural network, you can either :

Train $K$ models with different initializations. Thanks to the probabilistic nature of training neural networks, it will create an ensemble of different models.

Train a Bayesian neural network

Use dropout to sample neural networks with different weights.

To compute disagreement, you can use *vote-entropy* :

$$

X_s = \underset{x}{\operatorname{argmax}} -\sum_y \frac{N_y(x)}{K}\log\frac{N_y(x)}{K}

$$

where $N_y(x)$ is the number of models that predicted class $y$ for the sample $x$.

You can also use the Kullback-Leibler divergence :

$$

X_s = \underset{x}{\operatorname{argmax}} \frac{1}{K} \sum_{k=1}^{K} D(m_k||m)(x)

$$

where $m(x)$ is the mean model prediction : $m(x) = \frac{1}{K} \sum_{k=1}^{K} m_k(x)$ and the KL divergence is computed as such : $D(m_k||m)(x) = \sum_y \mathbb{P}_k(y|x) \log \frac{\mathbb{P}_k(y|x)}{\mathbb{P}(y|x)}$. You don't need dozens of models, in most cases having between $3$ and $5$ models is enough.

**2.3 Global methods**

A third class of algorithms requires that the model *is trained with a gradient descent*. Since currently neural networks are trained this way, those methods are applicable for deep learning. Those methods are less prone to fail against outliers and have proven experimentally to be effective, however they can be computationally expensive.

**Expected model change**

The first method computes a quantity called the **expected model change**, which, for a sample $x$, quantifies by how much the model would change if we added this sample to the training set. The question the algorithm answers is : which new labeled sample would minimize the prediction error if we performed one optimization step with it ? A common way to compute this quantity is :

First, get predictions $\mathbb{P}(y|x)$.

Then, for a class $c$ (which simulates a label we don't have access to), compute the gradient loss $\nabla l(\delta_c, \mathbb{P}(y|x))$ of the model, where $\delta_c$ is the dirac distribution with all mass probability on class $c$.

Finally, return the sum over $c$ of those class-gradients, weighted by their probability $\mathbb{P}(c|x)$

This approximates the loss of the gradient of some sample $x$ :

$$

X_s = \underset{x}{\operatorname{argmax}} \sum_c \mathbb{P}(c|x) \| \nabla l(\delta_c, \mathbb{P}(y|x)) \|

$$

**Expected error reduction**

The question the algorithm answers is : how much is the error prediction for all samples reduced if we re-train with that supplementary label ? We want to maximize the increase in certainty accross all labeled samples.

In that case :

$$

X_s = \underset{x}{\operatorname{argmin}} \sum_c \mathbb{P}(c|x) \sum_j E(x_j)

$$

where $E(x_j)$ is the error on sample $x_j$ if we trained with $x$ labeled. Of course we don't have access to the label of $x$ and don't want to train for all samples, so an approximation of $E(x_j)$ can be :

$$

E(x_j) \approx \sum_{\tilde{c}} \mathbb{P}(\tilde{c}|x_j) \nabla l(\delta_{\tilde{c}}, \mathbb{P}(\tilde{c}|x_j)) \cdot \nabla l(\delta_{c}, \mathbb{P}(c|x))

$$

**Density-weighted methods**

This method is a mix of both local and global based methods : it uses the result of a local method (like uncertainty sampling), and combines it with the representativeness of the considered data point. The idea is that it is no use knowing the label of a sample if it is an outlier. In that case :

$$

X_s = \underset{x}{\operatorname{argmax}} (\text{information brought by x}) \times \sum_{x_j} \text{similarity}(x, x_j)

$$

Other techniques exist, let's name a few :

**Variance reduction**: reduce the variance term of the predictions. See Section 3.5 of__Settles, Burr (2010). "Active Learning Literature Survey"__**Batch-mode active learning**: This family of methods tries to directly identify subsets of interesting assets to query instead of ranking each asset individually then returning the top $K$ assets.

Recent papers :

Active Learning for Convolutional Neural Networks: A Core-Set Approach, ICLR 2018. Core-Set :

**select samples that maximally cover the space of samples**, where the distance is defined as the l2 norm of activations. This is a kind of batch density weighted method, in the sense that we want queried samples to be as diverse as the whole dataset, and as informative as possible.**Very effective for deep learning compared to other methods**.Active Learning with Partial Feedback, ICLR 2019.

**Uses structure in the labels**. If you are doing classification of animals, it may be interesting to first classify an animal as a dog then to give its breed. You can already use the answer of the question (known as a**partial label**) to improve the model's accuracy. Another take from their paper : you should question the hypothesis that all assets are equally hard to label. The real rank should be : informativeness of the asset * cost to annotate the asset.Combining MixMatch and Active Learning for Better Accuracy with Fewer Label, ICLR 2020. They add simple active learning algorithms to

**MixMatch**, a recent state-of-the-art**semi-supervised**learning algorithm. Performance increase : 1% absolute, ~15% relative. A take from the article : the higher the desired expected accuracy of the model, the more valuable it is getting labeled data compared to unlabeled data. This ratio decreases the more labeled data we amass.Deep Active Learning with a Neural Architecture Search, NeurIPS 2019. Alternatively do NAS based on the current labeled set and (simple) active learning, to challenge the idea that the initial architecture of the neural network is nearly optimal.

Deep Active Learning with Adaptive Acquisition, IJCAI 2019. There are no algorithm performing better than others, it depends too much on the dataset. They propose that the active learning algorithm be a learner that "adapts" to the samples it queried, thanks to a new Reinforcement Active Learning framework. To do so, they use a bayesian neural network (BNN), to have access to posterior probabilities of predictions as the base learner, and another BNN as the learner which selects samples to query.

Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision. The authors consider the case where the unsupervised training data is biased w.r.t. the test data. This bias is often caused by a class imbalance. They use a fisher kernel to compare samples, and construct a validation dataset from the training data that has the same distribution as the test data.

**3. Applying algorithms to specific tasks**

At the moment, we only mentioned cases where your the model is trained on a single-class classification task. However, there are many more cases where active learning can be useful :

For

**regression**, you canchange your model so that it predicts a mean and a variance

use commitee-based methods to compute a mean and a variance of predictions.

For

**object detection**:__Active Learning for Deep Object Detection__. Compute the margin sampling for all detected objects in an asset, take the weighted (by class imbalance) average of all objects per image.__ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation__, CVPR 2020. Only works for multi-view datasets (the same object is present in multiple images, and we have 3D information).

For

**semantic segmentation**:__Deeply Supervised Active Learning for Finger Bones Segmentation__. Using a U-Net architecture, you can define predictions at different stages of the decoder, and compare those predictions to compute agreement. Then : less agreement = more informative sample.

For

**Named Entity Recognition**(NER) :For

**OCR**:__Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning__. Query-by-commitee active learning, using 5 models. They compute uncertainty with a normalized Levenshtein distance ratio.

For

**NLP**:__Deep Bayesian Active Learning for Natural Language Processing:____Results of a Large-Scale Empirical Study__Three tasks are studied : sentiment classification, NER and semantic role labeling. Their conclusion : if you do NLP, you should use__BALD__: Bayesian active learning by disagreement. To compute it for deep learning, you can use__Monte-Carlo dropout__or__Bayes-by-backprop__.