Video Object Detection: AI's New Challenge
Video Object Detection is a Computer Vision task, meant to analyze video images and classify objects located in them.
Artificial Intelligence has taken an important part in our lives, concerning both personal and professional matters. It allows us to be more efficient and to focus on some other task while the machine works on its own. Object Detection is a key task of Computer Vision. Thanks to it, the machine is able to locate, identify and classify an object in an image. Today, people are progressively implementing the use of Video Objection Detection in their daily lives.
What is Video Object Detection and how does it work?
Inspired by the function of the human visual cortex, IT researchers and developers have decided to create Video Object Detection (or VOD) applications that provide machines with the capability to analyze images and detect the objects present within them. They first developed protocols and proceedings meant to function on images only. But today, things have moved on to video images.
A tool based on Image Object Detection
Object Detection in video works quite the same way it does on images. The goal of such a tool is to allow the machine to locate, identify and class objects which can be seen on input moving images.
First of all, it is necessary to feed the machine with reference data. We need to create a library of video footage and images. As it works for pictures, video datasets are available for download on various platforms such as Keras and Tensorflow, or other APIs (Application Programming Interface). Users who would like to develop their own VOD application might want to start with Python language, which is quite easy to comprehend and practice on, whilst benefiting from being the most popular language in AI today.
When all the dataset has been downloaded, the new user will have to categorize the objects seen in the moving images by affecting them with class labels. To allow the application to locate the objects, bounding boxes will have to be drawn manually around the items we want the machine to identify.
After having located and affected items with matching labels, some object detection techniques which have already been set up can be downloaded via different platforms. Contrarily to Image Processing, VOD uses a combination of Image Detection and Video Tracking to analyze the data it has to process.
Before being validated, the algorithm needs to be trained on the labeled data, which has been entered at the very beginning of the process. This step is very important to ensure that the accuracy is satisfactory. Once the training phase is complete, the program is ready to be evaluated and validated on a new set of data it has never met during the training phase, before being released.
Now let’s dive a bit deeper and discover different approaches which are used for VOD.
Different Video Object Detection methods
Machines don’t process images the way we do as human beings, especially videos. Machines and their proceedings don’t consider images as a whole. To analyze them, they have to divide the frames and work on pixels and their features. They usually mix image detection and video tracks to come out with their results. Here is a list of some of the most famous proceedings used to develop VOD solutions:
Faster R-CNN is a method based on R-CNN. R-CNN, or Region-Based Convolutional Neural Networks, is a model which divides the framework into various regions, where bounding boxes of items might be located. This Object Detection algorithm uses multiple convolutional layers meant to analyze all the regions of the video frames. The difference between R-CNN and Faster R-CNN relies on the use of a first network meant to identify the potential regions where the objects might be located. This way, the algorithm does not have to go through all the regions of the frameworks in order to identify items in a video. Filters are then applied to detect pixels and their patterns. The system will recognize and associate some of them with the references from the database and is going to classify the detected items. Usually, the convolutional layers are not applied at the same time but progressively in order to allow the system to go through all the information. In between these filters, other layers called pooling layers are applied in order to gather the information and clean up the framework. At the very end of the process, a final layer is applied to flatten all the previous filters, gather all the data, and come out with an output. This method is used quite often in Real-Time Object Detection. It is known to be very efficient in terms of accuracy and speed.
YOLOV3 (the third version of the You Only Look Once method) is a model which uses a single neural network on the moving image. This method consists in dividing the framework into regions and analyzing the pixel patterns of all these regions. According to the locations where bounding boxes have been referenced at the beginning of the process, the algorithm is going to predict where bounding boxes could be located on a future dataset. As the proceeding only requires the application of one neural network, the process can be done very quickly and deliver the output in less than a second. SSDs or Single Shot Detectors use single neural networks as well. The algorithm divides the images into slices and calculates probabilities in order to predict where items might be located in each section. In 2021, TensorFlow (an end-to-end open-source platform for machine learning) released a new SSD model called MobileNet V2 that has proven to be more accurate than YOLOV3 and V4. One should notice that this method, together with YOLOV3 and V4 are both well indicated for custom VOD solutions as it requires a fast response from the machine.
What challenges does Video Object Detection have to face?
Video Object Detection is based on the techniques developed for Image Recognition. Analyzing a static and a moving image are two different things, one is more challenging than the other.
The main problem VOD is facing today is the lack of Video Datasets available. But it’s clearly not the only one.
Learn more!
Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.
Avoiding Overfitting, Occlusion, Reacquisition, and Motion blur
When analyzing moving pictures, the user has to be aware that it might not be as easy as analyzing still images.
Overfitting is a recurrent problem in object detection, whether it regards still or moving images. When your system is trained and evaluated on the same dataset, it tends to not be able to generalize and the results tend to become less satisfying on new, real-world data. In order to tackle this problem at the core, data scientists and users have to keep in mind that the more the algorithms are trained with diverse data, the less overfitting there will be. And this is particularly true with VOD when using it for video surveillance for example when the same objects come and go all the time. Some techniques are used to prevent this overfitting phenomenon from happening, Regularization is one of them. It consists in removing the extra weights from selected features and redistributing them evenly. That way, no feature will be stronger than others. This technique, meant to penalize the performance of your model on the training dataset, allows to decrease the risk it will fail on another dataset (for example on production data points).
When analyzing a video, some objects might get over other ones and complicate the work of the application. For example, a bicycle on the street being passed by a car. At one point, the car is going to hide the presence of the bicycle. This occlusion and reacquisition problem can be an issue for the system. Once the object is not hidden anymore, the system has to be able to detect it again, focus its attention, and reacquire it as quickly as possible.
Your camera is filming an event your company has been preparing for months. Suddenly a man from the camera crew trips on the tripod, the camera moves and makes the images very blurry for a few seconds. Depending on the way your model functions, whether it analyzes the pictures frame by frame or generalizes all the images available, the machine could be able to detect the items in the picture frames even if motion blur is at play. Combining Image Object Detection and Video Tracking
Image Object Detection is essential to perform Computer Vision properly, thus it is necessary for VOD. Video Object Detection is based on the combination of results from both Image Object Detection and Video Tracking tasks. A fairly recent study led by Ye Lyu, Michael Ying Yang, George Vosselman, and Gui-Song Xia, "Video object detection with a convolutional regression tracker" published in June 2021, on https://doi.org/10.1016/j.isprsjprs.2021.04.004, proposes a brand new option for VOD. The team offers to combine the advantages of Image Object Detection and Video Tracking into a single tracker which can be integrated directly into any of the proceedings.
This paper is based on the works of Liu et al., 2018 and 2016, Zhu et al., 2017 and Girshick, 2015, among others. At the beginning of the VOD thinking, Liu was one of the first scientists to state that the VOD tasks should first go through object detection in individual images and then move on to post-processing the information linked to all the pictures. Zhu also argued that to link all the information, object tracking could be a solution, this is why the question of Optical flow is prominent in VOD today. Optical flow is a tool that calculates and traces the movement of all the items in an image. It takes all the pixels in an image and analyzes their trajectory throughout the video frame. Combining both image detection and object tracking could be the answer for a faster and more accurate VOD.
Other research studies have been conducted and new proceedings have been released. One of them is particularly interesting for real-time VOD. Lu He, Qianyu Zhou, Xiangtai Li, Li Niu, Guangliang Cheng, Xiao Li, Wenxuan Liu, Yunhai Tong, Lizhuang Ma, and Liqing Zhang launched in May 2021, a new set of codes that could revolutionize the field of Computer Vision. This new method proposes to use both the trajectories of objects and the temporality of the video images. The objects could be identified at a given time and tracked at the same time, making VOD much more efficient and accurate. You can find all the necessary documentation and lines of code on the following website https://arxiv.org/abs/2105.10920.
How is VOD used in today’s society?
Video Object Detection is still an extremely young tool to work with, but some companies have understood that this kind of AI proceedings is going to help with their productivity.
Video surveillance
State-of-the-art object detection programs have been revealed to be essential to accurately identify and track objects on a video or images. Video surveillance systems have taken the opportunity to work with these programs. Many retailers have decided to equip their premises with cameras to spot potential thefts or incivilities, even to monitor the performance and safety of their employees.
Facial detection
A few years ago, unlocking your smartphone was done with a simple slide of the finger, or putting your thumb on the central button for fingerprint recognition. But today, looking at your camera is enough to navigate on your phone, thanks to facial detection and real-time VOD. Trying on new glasses on the internet via some applications is also working with the use of VOD systems.
VOD proceedings also allow you to send your friends and families funny faces of yourself using filters on some apps like Snapchat or Instagram.
Self-driving cars
New models of cars have introduced real-time VOD in their vehicles’ systems. Autonomous vehicles systems are able to identify, locate and track the trajectories of the surrounding still or moving objects. For example, traffic lights, pedestrians, other cars, or bicycles. Anticipating the actions of every object allows the car and its passengers to avoid being involved in car accidents.
Using Video Object Detection is becoming more and more important in today’s world. IT researchers are still working hard on making this task more accurate and closer to the way a human being would analyze his environment.