Machine learning models have proven to be extremely powerful for automating tasks. Automatic image recognition for example has seen an incredible leap thanks to the development of convolutional neural networks. We see that our models today have not yet reached their full potential and we can get closer to that by providing annotated data. The problem with data annotation is that we need humans to provide new annotations.
Several approaches try to mitigate this problem, we could try to generate new annotations automatically, we could try to use the existing data to infer new annotations etc. Interactive segmentation tries to tackle this problem by making the annotation process faster and by also reducing the fatigue of the annotators.
In this article we'll explore why interactive segmentation could play a huge role in the future of data annotation and machine learning in general. We'll follow the topics:
What is interactive segmentation and image segmentation?
What are the gains we could have by using this method of annotation?
Why use points for interactive segmentation
How to measure the performance?
What are the requirements for a real world data annotation platform?
How the model works?
How is it trained?
How to prioritise data with interactive segmentation?
Ideas for the future
What is image segmentation and interactive segmentation?
The task of image segmentation consists basically in assigning for each pixel of an image a class, this is very useful because we often want to be able to know where some kind of object, such as a car, is placed in a given image. We could represent a car in an image by using bounding boxes, or points, but giving a class for each pixel is the finest way we can define an object, therefore is the one that gives the most information. The image below highlights a car annotated using Kili's interactive segmentation tool with one click.
The most powerful models for semantic segmentation are currently extremely performant, we could cite DeepLabV3+  and HRNet+OCR . They take as input some class, for instance, car, and they try to detect it on a given image given some predefined set of classes. What happens if you want to detect a different object that isn't present in the predefined set of classes? You'll need to train the model with some examples of the new class, of course you can use a pretrained model as base, but you'll need some training data.
The idea of interactive segmentation is to be able to classify any object. Of course, for that, you'll need some user input. In the best case scenario, the user would be able to just say or write the name of the object and the model would correctly classify the pixels in the image. As this is a very complex task we're now limited to some simpler interactions. Some kinds of possible interactions are:
Points: the user can place points in the image to identify the desired object, each time the user places a point, he sees a mask as the output, then, if he sees misclassified regions he places 'negative' points to indicate that the model made wrong predictions. The final mask is a result of these interactions.
Bounding boxes: the user can draw a bounding box around the object he desires to classify and the the model uses this information to classify the right pixels.
Scribbles: the user can draw a rough 'skeleton' of the object with a freeform draw tool, then, the model uses this as input for predicting the mask.
In the image below you can see examples of those kinds of interactions. We also should highlight that these approaches can be combined. For example, we could draw a bounding box and then use points to mark the objects.
What are the gains we could have by using interactive segmentation?
The first thing that comes to mind is speed, actually, according to  they were able to achieve a three times factor in speed gain by using interactive segmentation, they also managed to achieve the same quality for the masks as they did without interactive segmentation.
We could also argue that it's easier to achieve better quality for the masks for several reasons. As the contours are calculated by the model, the annotator doesn't need to draw them by hand, drawing masks by hand can be very difficult and imprecise because we often use a mouse to annotate which is imprecise for drawing. We could also use some specialised hardware to draw the masks, but this would certainly drive the costs up which is undesirable.
The annotators that use interactive segmentation also face less fatigue as they concentrate in high level tasks. For example, when trying to identify a car in the image, the main concern of the annotator is just to find the object car, he won't have to worry about classifying each pixel and detailed contours.
Why use points for interactive segmentation
We've talked about different ways we can do interactive segmentation, but is there a way that it's preferred? You should be able to guess by the title of this section, that at Kili we use points for interactive segmentation.
There are several reasons to prefer points to other methods, but the main ones are: model performance and the annotation interface.
Interactive segmentation models based on points tend to be better because this type of interaction (placing points) has been better studied and more models have been produced. One of the reasons for that is that, during the training process we need an enormous quantity of data and we usually only have annotations for the segmentation masks. Therefore, we need to simulate the interactions using segmentation masks as base. Placing points is the simplest form of interaction, so it comes with no surprise that it's also the easiest to simulate reliably.
On the annotation interface side, we have that it's simpler to explain a point based platform, it's also very intuitive, in opposition to scribbles. More importantly, mouses are made to perform precise clicks and that's exactly what we need for a point based interface, therefore, we take the most advantage of our hardware.
From now on, we'll assume that we use a click interactive segmentation interface based on positive and negative clicks.
How to measure the performance?
The main metric used for quality in image segmentation is the intersection over union (IoU). This metric compares two masks considering that the area intersection should be the base of quality measurement and is defined as the intersection area over the union area between two shapes. In the image below the IoU would be the white area over the blue area + the white area.
In the context of interactive segmentation there is a new metric that should be taken into account. The [email protected]%, this means that we measure the number of clicks for achieving an IoU of X%. It's important to consider this metric because, at least in theory, we could achieve IoU of 100% with infinite clicks as the clicks would rebuild the original mask.
It's also important to note how the clicks are made. For that, we usually define an automatic clicker as a human clicker would have different behaviours at each try. The most common clicker, and also the one we consider here, at each step, clicks on the centre of the biggest misclassified region. Each region is defined as a connected group of pixels and the largest region is the one that has the largest area, finally the centre of a region is defined as the point that has the largest minimum distance to its edges.
Another possible metric is the [email protected] which is the IoU achieved by using a given number of clicks.
What are the requirements for a real world data annotation platform
We've already seen that interactive segmentation can be a very powerful tool, but what should be the requirements of an interactive segmentation system to be usable in the real world? First we have to define the word usable, by usable we mean that the annotator should be able to be more efficient using this tool than by annotating by hand, therefore the final annotations should be made faster and with quality equal or superior. Some other things that are optional but very desirable are: confort while using the tool and no lag for the tool to calculate its masks.
At Kili our interactive segmentation tool uses as base a state of the art model from 2021 that beats all of the other previous methods in this task. Having a powerful model is not the only challenge we faced to build our tool, we had to make it integrate seamlessly with our platform and make it versatile to handle the different kinds of images that we handle. Our objective was to be able to run predictions in around 300ms, this way each prédiction would arrive at approximately the human perception time , therefore minimal lag would be perceived.
Concerning the precision, as we use a state of the art model, we had no problems obtaining IoU higher than 90% in less than five clicks for all the default test datasets we used, which includes GrabCut, DAVIS, Berkeley and custom datasets. We highlight that, according to a large scale study on image segmentation , the agreement between annotators is, in average 90% IoU, therefore, given this information, we consider IoUs higher than that value to be our minimal specification.
Y-Axis: IoU, X-Axis: Number of clicks on dataset GrabCut
Finally, we're concerned that our user should have the best experience possible using our tool, therefore, our interactive segmentation tool is integrated in our semantic segmentation interface and the user interface consists, basically in two commands: adding a positive click and adding a negative click.
How the model works?
This part is a little bit more technical than the others, for this reason, feel free to jump it if you want.
First we have to understand the inputs and outputs of our model, we receive an image, which is often represented as three arrays one for each color (red, green and blue), containing the intensity of each color for each pixel. Then we have the user clicks, each click can be represented as a position in the image and a value that says if the click is positive or negative. These are the essential inputs that our model, which is a neural network, will receive, for the model to make the best use of them we might need to provide these inputs in a different format, it might also be useful to provide the model with some additional information.
The image is provided to the model as a tensor (which is essentially a multidimensional matrix) with the following dimensions: (image height, image width, 3). The input clicks are also provided as a tensor (image height, image width, 2), in this case the clicks are represented by two channels, similarly to the image which has three channels, one for each color. This representation for the clicks is called the encoding, and it's what we actually provide to the model. For the clicks encoding, each channel represents a different kind of click, either positive or negative and in each channel we have an image with the same dimensions as the original image where disks with a fixed size are placed where the clicks were made.
Representation of encoding for 3 clicks
We have five channels that represent our input, 3 for image and 2 for clicks, we additionally have a sixth channel that represents the segmentation mask of the previous step and that has shown to greatly improve the performance of the model.
We could now use these encodings and lots of labeled data to train a neural network from scratch to be able to output a new image with the same size as the input image, but with a single channel that would represent the segmentation mask. Therefore, to achieve the best results, this network would need lots of data and processing power to train. We can drastically reduce the training time by leveraging the power of pretrained models. For instance, in our case, we pass our input through some custom layers and then we send it to some powerful backbone model that does very well the job of image segmentation, hoping that the knowledge those models acquired in segmenting objects will make training faster.
Finally, our model outputs a single channel with decimal values to represent the probability of each pixel to belong to the object, we take all pixels with probability more than 50% as object pixels and we say the rest don't belong to the object and voila, our segmentation mask is generated.
How is it trained?
Training a neural network is complex enough to deserve its own blog post, we'll just try to explain briefly the main factors into play when training one network such as that.
Firstly we need to have a neural network, which in our case was defined in the last section, the neural network, in essence is just a function that accepts inputs and gives outputs, this function has parameters, which need to be changed in order to minimise some error function. We already talked about the inputs, the image and the clicks and the outputs, the predicted mask. The parameters are the weights and biases of the neural network, essentially the values that need to be changed during training to get better results. The error function is what defines how good the output of the network is compared to some provided ground-truth. Finally, we need to use our error function to tune our parameters, and that's the job of the optimiser, the most used optimiser nowadays is Adam and that's what's used for us.
The last important bit for the training is the training data. The model we use benefits much more from well defined segmentation masks than from lots of ill defined masks. Therefore, the dataset used for training was of paramount importance. It's important to say that fine tuning the current model for specific tasks can be a great way to improve performance for very specific tasks, but the general performance should not be greatly improved by fine tuning. We've also found out that the model performs significantly better on images that have the similar scales to the one used in training, even though it's able to accept as input images from any size. Therefore, rescaling the images or adding more size variety on the training dataset could be a great way to improve performance.
How to prioritise data with interactive segmentation
The essence of interactive segmentation is the fact that the annotator makes successive interactions with the model to arrive to a segmentation mask. This way each annotation generates a sequence in which each item is a click. The image below represents an example of sequence of annotations:
From these interactions we can infer lots of important information such as: the time taken between each click, the change in the mask after each annotation, the positions of the clicks etc. We can use this information to predict the quality of an annotation. The idea is the following, if the annotator takes too long, the clicks are very spread and the mask varies a lot between steps, it's likely that the annotations are of low quality. Using some regression model such as an RNN for example we could try to predict the quality of annotations, this functionality would be especially useful for Kili's review interface.
As a proof of concept  had been able to successfully use this technique to find poorly annotated masks in an interactive segmentation experiment as shown in image:
The image above shows that the model successfully ranked above masks with low quality, if the model was ineffective the output would be a straight line.
Ideas for the future
There are several possible points of improvement that still need to be explored. The first one is the dataset, we could train our model in a more diverse dataset with greater image variety and also mask size variety. Regarding the model the clicks encoding is definitely not the worst, as disks with fixed sizes have been proven to be more effective than gaussians at each point or distance maps, but different encodings should be tried such as disks with varying sizes. One could also try to change the structure of the model to complexify the first layers, mainly the one regarding the previous input mask.
The future of data annotation is certainly to use intelligent models to help annotators in their tasks, we are confident that by using interactive segmentation you can make your annotation process faster and more accurate. As a leading data annotation platform Kili, provides their users with a fast and efficient interactive segmentation tool and we're engaged to continue improving its quality over time. We hope that by reading this article you were able to understand the basics of this technology, don't hesitate to try it out on our platform, the documentation for our tool can be found at this link.
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018, August 22). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv.org. https://arxiv.org/abs/1802.02611.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B. (2020, March 13). Deep High-Resolution Representation Learning for Visual Recognition. arXiv.org. https://arxiv.org/abs/1908.07919.
Benenson, R., Popov, S., & Ferrari, V. (2019, April 17). Large-scale interactive object segmentation with human annotators. arXiv.org. https://arxiv.org/abs/1903.10830.