• Solutions
  • Company
  • Resources
  • Docs

How to Label with Interactive Segmentation

If you can detect a different object in a predefined set of classes, what if you don’t have a predefined set of classes? Welcome, Interactive Image Segmentation.

How to Label with Interactive Segmentation

Computer vision branches off artificial intelligence and trains machines to interpret and understand the visual world. The computer does this by processing and comprehending digital pictures and deep-learning models. The deep learning model will allow the machine to accurately identify and classify objects and then output what they “see.”

An essential element of computer vision is image segmentation deep learning, which is the technique of segmenting a digital image into many pieces. This has become very popular due to a rise in applications such as self-driving automobiles and medical image analysis. 

In image segmentation, each pixel within an image will be assigned a class allowing us to distinguish where a particular object is placed in a given image. The classes are based on their similarity, which can be measured in color, intensity, texture, or any other characteristic.

But what happens if you want to detect a different object that isn't present in the predefined set of classes? This is where interactive segmentation comes into the picture. 

Interactive Image Segmentation: what is it, again?


Interactive segmentation has the ability to classify any object. The aim is to find all classes in a data sample, along with providing a clear distinction between the classes' precise locations.

Each annotated pixel in the digital image will belong to a single class, where the output is a mask that outlines the shape of the object in the image. 

A segmentation mask is an outlined specific portion of an image that has been isolated from the rest of an image using image segmentation techniques. Segmentation masks are effective when a specific shape helps with classification tasks. Masks are also useful when there are a lot of similar objects in the image or the area between the objects causes problems with the classification.

But how would the machine be able to classify it? You will need some user input; for example, the user could just say or write the object's name, and the model would correctly classify the pixels in the image. However this is a complex task, but there are other simpler interactions available such as:

  • Points: In this method, the user can place points in the image to identify the desired object; each time the user places a point, he sees a mask as the output, then, if he sees misclassified regions, he places 'negative' points to indicate that the model made wrong predictions. The final mask is a result of these interactions.

  • Bounding boxes: the user can draw a bounding box around the object he desires to classify, and the model uses this information to classify the right pixels.

  • Scribbles: the user can draw a rough 'skeleton' of the object with a freeform draw tool, and then, the model uses this as input for predicting the mask.

The most popular methodologies used to segment the desired object are either contours or label propagation. Interactive image segmentation methods can be classified in different ways:

  1. 1. Contour-based methods

  2. 2. GC-based methods

  3. 3. Random Walk-based methods

  4. 4. RG/RM-based methods

  5. 5. Deep learning techniques, and more.

Being able to imitate how a human can divide an image up into regions with context is very much still a challenge in computer vision. Due to this, interactive image segmentation (IIS) methods have helped researchers curate a more effective process for their interactive image computer vision tasks. 

Interactive image segmentation (IIS) methods are classified based on the criteria used:

User Interaction 

This method can be used by dividing methods through seed-based and region-of-interest-based approaches. 

The seed-based approach type of user interactions that can be used are points, line segments, and strokes to mark an object, providing seeds. 

The region of interest-based approach is when you delimit the desired object using a bounding box, polygon, or a closed contour to define the region of interest specifically. 


In this method, you can segment the desired object by contouring or label propagation. The method is divided up into contour, graph cut (GC), random walk (RW), and region merging (RM) or region growing (RG) based methods. 

The contour-based method is achieved by extracting the contour of the object by using edge features and any prior knowledge provided by the initial user interaction.

The graph cut method is achieved by segmenting an image into foreground and background elements. This is done by drawing lines on the image, called scribbles, to help you identify what you want in the foreground and background.

The random walk method starts with constructing an undirected graph representing the input image as G = (V, E). Where V is the set of nodes, and E is the set of edges. Each edge will be connected to its two neighboring pixels, defined as (i, j). The pairwise weight wi,j will represent the probability of a random walk stepping between these two nodes. 

The region-growing and region-merging method begins with labels that have been provided by user seeds and then merges similar adjacent regions based on the homogeneity criterion. 

Processing level

The processing level method is a pixel-level, superpixel-wise, or hybrid (pixel/superpixel) approach. At the pixel level, you will be able to get a better and more comprehensive understanding and information about the object, the category, its position, and the shape of the image.

Labeling with Interactive Segmentation

Your first step will be to choose an effective labeling interface for your image segmentation. You will then need to select a smart labeling tool that will use raw data, such as images, and apply meaningful and informative labels to provide context. This will help the machine learning model learn and improve its performance. 

Looking at your raw image, you will then identify/select an object of your interest. Once you have identified your object, you must create a mask. This can be done by using the bounding box method. Your bounding box will specifically box off your object of interest to help the model effectively learn. 

Interactive Image Segmentation: how does it work?


The average time it takes to complete a polygon mask annotation is roughly 34 seconds. However, auto-annotation reduces that to 2.5 seconds. Allowing you to create almost 10,000 labels, regardless of the angles, lighting, etc. Using auto-annotate tools is faster, easy to measure, and saves you 65-90% of the time required for labeling. Manually labeling images can result in mental fatigue to ensure accuracy. 

Auto-annotate helps you to define a rough bounding box around each object, which doesn’t have to be perfectly aligned. It borders the pixels correctly and stops the need for you to adjust the mask manually. 

The interactive image segmentation process starts with a target object annotated roughly by a user and then extracted as a binary mask. As mentioned above, interactive segmentation algorithms can be categorized using either point, box-interfaced or scribble-interfaced. 

A box-interfaced interaction obtains the mask of a target object using the given bounding box. The algorithm attempts to obtain a one-shot segmentation. Whereas a scribble-interfaced interaction accepts the foreground and background annotations from a user, where the user can provide scribbles several times until a satisfactory result is obtained.

A pre-trained deep-learning model

A pre-trained model is a model or a saved network previously trained on a large dataset to solve a similar problem to your task. Pre-trained models are typically used for transfer learning to help customize a model for a specific task. 

As it states, transfer learning is the transfer of learning. It is the process when a model already developed and tested for a task is reused as the starting point for a model on a second task. Transfer learning can use the pre-trained model approach by:

  1. 1. Selecting your pre-trained source model

  2. 2. Reusing the model - You can use this as a starting point for a new model for a similar task. Depending on the task at hand, you may only need to use elements of the model. 

  3. 3. Tune the model - Your model may need to be refined and tuned based on the task at hand. 

Suppose a model has been trained on a large dataset that generalizes well. In that case, the model can benefit other image classification tasks as it acts as a generic model of the visual world. This provides you with a good foundation rather than starting from scratch. VGG16 is an example of one of the most popular pre-trained CNN models used for image classification.

Another example is Deeplabv3, which is Google's semantic image segmentation model. It is a series of deep learning architectures that have been designed to tackle the problem of semantic segmentation. It is a fully Convolutional Neural Network (CNN) model and is well known for its high speed, better accuracy, architectural simplicity, and generalizability to custom tasks. 

Use Cases of Segmentation Masks

Segmentation masks are very useful for a variety of use cases, such as:

  • Safety monitoring of drones, birds, and planes

  • Cars and pedestrians in autonomous vehicles 

  • Security and surveillance 

  • Power lines for utilities

  • Disease detection for agritech 

  • Cell outline for healthcare

To go into a little bit more detail, you can use segmentation masks to:

  • Detect and classify models of cars in a parking lot 

  • Detect and classify moving crafts in the sky to alert other moving crafts

In all of these use cases, it helps with the process of a business, such as security, safety, manufacturing, and more. 

Interactive Image Segmentation: why is it important?

Interactive image segmentation is an essential element in computer vision as it allows for using advanced image editing software. It allows users to choose a specific object through interactive inputs such as strokes, bounding boxes, and clicks.

Over the decades, there have been issues that have stopped the progress of image segmentation, such as lighting, angles, etc. These features have limited the performance of segmentation. However, deep learning techniques have rapidly been utilized in image segmentation, with the most recent work being based on click-point information. Deep interactive object selection (DOS) was the first deep solution to the interactive segmentation problem.

Benefits of Interactive Image Segmentation



Manually annotating object segmentation masks is a time-consuming and tedious task. However, with interactive object segmentation, the human annotator and machine segmentation model can work collaboratively. According to Large-scale interactive object segmentation with human annotators, they achieved a three times speed gain in using interactive segmentation.


The same paper also stated that they achieved the same quality for the masks as they did without interactive segmentation. This could be due to several reasons, such as the contours being calculated by the model and not requiring the annotator to draw them by hand. Having an annotator draw the masks by hand can be imprecise due to the use of a mouse to annotate being unreliable. 


Interactive segmentation has reduced the workload of annotators and allowed them to focus on more high-level tasks. The annotator does not need to worry or spend much time classifying each pixel and detailed contours; their main concern is finding the object.

Challenges with Interactive Segmentation

If you use a CNN for interactive segmentation, the biggest challenge you will face is that the CNNs do not generalize well to previously unseen object classes, which are not in the training set. The requirement of labeled instances of each object class in training set acts as a blocker for many industries. For example, in medical image analysis, you will need experts with the time to produce accurate annotations. The performance of CNNs to segment objects will be drastically low. 

Another challenge with interactive segmentation is the model's image-specific learning process and how it deals with various significant contexts across different images. With CNNs, the model's parameters learn from the training dataset, which is then applied in the testing phase. However, an image-specific adaptation of a pre-trained Gaussian Mixture Model (GMM) has been shown to improve segmentation accuracy.

Interactive segmentation requires fast inference and memory efficiency. It is much easier to work with 2D images. However, it can be challenging with 3D images. For example, some CNNs work on 3D local patch images to reduce memory requirements. However, they face a slow inference. Whereas some CNNs have fast inference but require a large amount of GPU memory.

Semantic Segmentation Example with Kili

Image semantic segmentation consists of detecting specific regions of objects in an image. Therefore the goal is to identify objects along with their positions, dimensions, and shapes. 

Reviving Iterative Training with Mask (RITM) method

If we refer back to the interactive image segmentation (IIS) methods that were explained above, we know that the method used is based on the criteria. Kili technology uses the Reviving Iterative Training with Mask (RITM) method.

There has been a lot of research and work around the click-based interactive segmentation process. Although it has produced state-of-the-art results, the methods are computationally expensive. Click-based interactive segmentation requires backward passes in the network, as well as another disadvantage is that they are hard to deploy on frameworks that only support forward passes. 

Therefore the solution to this problem is using a simple feedforward model for click-based interactive segmentation. The RITM method allows you to segment a completely new object, as well as start with an external mask, which you have the possibility of correcting if needed.

Being able to work with large datasets with accurate or corrected masks plays an important role in the training phase and performance of a model. Therefore, you will need to provide your model with a strong baseline where you can implement your improvements.

Kili technology implemented this interactive image segmentation method due to it being robust when changing image categories, requiring little input from the user, and being able to run quickly and effectively (less than a second). 

Kili Technology's labeling interface is made up of three parts: 

  1. 1. The header - to manage assets and questions

  2. 2. The assets viewer - a toolbar to manage your assets

  3. 3. Job viewer - to interact with your ontology and your label

Standard Semantic Segmentation

Using Kili Technology's labeling interface, to perform semantic segmentation - these are the following steps:

  1. 1. Select a category.

  2. 2. Hover over a specific point of an image, and then press and hold your left mouse button.

  3. 3. While holding the left mouse button, draw the object's shape.

  4. 4. To complete the shape, move your pointer to the first point that you clicked on and release the left mouse button.

Interactive Semantic Segmentation

Using Kili Technology's data labeling interface to perform interactive semantic segmentation - these are the following steps:

  1. 1. Select a category.

  2. 2. From the toolbar, select the interactive segmentation tool.

  3. 2. Click on the object that you want to classify, and the object mask will be created automatically.

  4. 4. Adjust the created mask:

    1. a.To add a specific region from the mask, click in the center of the region that you want to add to the mask. The region will be added automatically.

    2. b.To remove a specific region from the mask, press and hold Alt/Option and then click in the center of the region that you want to remove from the mask. The region will be removed automatically.

    3. c.To cancel and remove your mask, press the Escape key.

In the example image below, the category is: “Car without wheels,” - where we want to select the car and remove its wheels.Example-of-Interactive-Semantic-Segmentation-in-kili-technology

Interactive segmentation: Pro tips to optimize usage

Interactive image segmentation is not a simple task; therefore, ensuring your process is robust and effective is important. 

However, there are some pro tips to help optimize your usage:

Leverage Keyboard Shortcuts

Being able to find a shortcut always makes our lives a bit easier. Most interactive image segmentation tools have keyboard shortcuts that can help you label more efficiently. Learn and use these shortcuts to help speed up your labeling process.

Zoom and Pan Features

Take advantage of all the available features. Zooming in and panning around an image can better help you identify and label objects more specifically and accurately. Use these features to get a closer look at small or complex objects.

Undo and Redo Features

We’re bound to make a mistake here or there - that’s normal. The undo and redo features can help you correct mistakes and change your labeling. Use these features to ensure accuracy and save time.

Stop Smart Tool at the Correct Moment

Interactive image segmentation facilitates a lot of the first steps, and although they can be very accurate. As time goes on, the accuracy may fall and produce wrong predictions. Knowing when to stop using the smart tool will allow you to maintain high accuracy. 

Unusual Objects

Depending on your task, you may encounter unusual objects, such as occult objects. These types of objects can be very confusing to work with. However, you can adapt them using the ‘+’ and ‘-’ to improve your process.