How Deep Learning Powers Computer Vision
Computer vision applies deep learning to images & videos to detect & distinguish objects. Read on to learn how it works, the history behind it, and what lies ahead.
One of the most interesting applications of deep learning is when it is combined with computer vision. Deep learning for computer vision provides machines with the ability to identify and respond to what it senses around them. An example of this is self-driving cars, which use a range of different sensors, including cameras, to survey the environment and rely on deep learning neural networks to make decisions about what to do based on what it sees.
What Is Computer Vision?
Computer vision uses artificial intelligence that enables machines to identify objects in digital imagery and video feeds, which can be used as input for neural networks to make decisions on, such as with self-driving cars. Computer vision has a wide range of use cases across many different industries. It requires developing algorithms and training convolutional neural networks on tasks such as image recognition and object identification. This can then be applied to many different functions, from identifying diseases and tumors in X-rays and medical scans to powering self-driving cars and enabling them to drive safely and respond to their environment.
The History of Computer Vision
The very early days of the history of trying to get machines to identify and classify images can be stretched back to the late 1950s, though at that early stage image scanners had not yet been invented. It was not until the 1960s that this technology arrived and enabled researchers to digitize images and be able to transform 2D images into 3D forms. This was also the decade that artificial intelligence developed as a field of study.
The next major milestone occurred in 1974, when optical character recognition (OCR) became available, which allowed computers to recognize letters and characters in digital images. This opened the door to computers being able to digitize and ‘read’ books and other types of written documents. This technology is now commonplace in many different industries, from car license plate recognition to mail sorting and the preservation and archiving of historical documents.
By the early 2000s, real-time facial recognition technology emerged and standards started to be developed for machine learning datasets. In 2010, the infamous ImageNet data set became available, which contained millions of photographs sorted into thousands of categories. In 2012, a model known as AlexNet was entered into an image recognition contest that significantly reduced error rates in image recognition compared to peers and is considered a breakthrough moment in the history of computer vision. After this, the range of use cases and applications for computer vision exploded.
The Challenges in Computer Vision
The big challenge with computer vision, and image classification specifically, is in being able to identify the many different possibilities for what would be trivial tasks for a human. For example, when looking at a picture of a chair, an algorithm needs to understand what a chair looks like from many different angles. If a model has only been trained on chairs facing a certain direction, it will fail to identify chairs in photographs taken from different angles and directions.
There is also the problem of gathering sufficient training data. Many computer vision algorithms require training on thousands of data points in order to be able to learn object detection and classification well enough to identify known objects in untrained data. Moreover, this training data must be of sufficiently high enough quality to avoid the problems of overfitting or underfitting. These are problems that typically arise as a result of poor quality or not enough training data, which means that models are unable to either generalize beyond the training data or successfully identify objects in novel imagery.
Self-supervised learning is a future potential avenue for solving this problem that attempts to allow machines to learn in a more passive manner, without requiring plenty of direction and human supervision in the learning process.
As such, most of the processing work relies on cloud computing to handle the large datasets involved. Making these neural networks more portable for regular devices is another goal of machine learning engineers.
Computer vision tasks and applications
The range of applications that deep learning enables computer vision to achieve spans across industries, but in general, the many different applications of computer vision involve using some or all of the following tasks.
Image classification & localisation
Image classification involves the recognition of objects within an image or video. A classic example of this, and a task often used as a benchmark for newly developed algorithms, is the MNIST dataset. This contains image data of the alphabet and numerals 0-9 in handwriting. Image classification & localization would involve being able to recognize these characters and identify their location within an image.
Object detection is closely related to image classification & localization, except that an image may contain multiple instances of several different objects. An example of this would be a photograph of a street scene, in which object detection would involve being able to recognize, distinguish and label several different objects like cars and pedestrians. Object detection algorithms will be able to identify trained objects in a scene and draw a bounding box around them.
Object segmentation follows along from object detection but is much more precise in the labeling of different objects. Rather than recognizing an object within a boundary box, object segmentation can identify and segment different objects at the pixel level, resulting in much more accurate object bounding boxes. It can be thought of as object detection combined with localization, and it allows for computer vision systems to identify different instances of similar objects in a scene, such as a photograph of a group of people.
Style transfer is the application of deep learning to understand the methods and features of a particular art style so that that style can be transferred to create new images in a defined style. An example of this is training neural networks on a particular artist’s distinctive style (common examples include Van Gogh and Picasso) and then using those trained networks to create new art in the same style. By feeding a machine a photograph, style transfer algorithms can then render a computer-generated version of that photograph in a particular artist’s style.
Digital colorization involves using neural networks to apply color to black and white images and videos. This is commonly done with famous historical photos and videos that are only available in black and white.
Image reconstruction involves providing an image with some information missing and using neural networks to recreate the image and fill in the missing information. An example of this would be providing an old, historical photo with some tears or faded edges and having the image reconstructed to fit the best guess of what it would have looked like when whole.
Image super-resolution involves using deep learning neural networks to enhance the resolution of an image beyond the original image. Thus, a low-resolution photograph may be provided and a much higher resolution version can be generated from the original. This is very commonly used in the film industry for creating high-resolution re-releases of movies that were originally filmed in lower resolutions.
Image synthesis is the task of blending different images together in a realistic manner. This could involve modifying photographs to turn one type of object into another, such as turning a horse into a zebra in a photograph, or it could even involve ‘deepfakes’ which transplant someone’s face onto another person’s body. It is possible for this to be done in real-time and can be combined with other technologies such as speech synthesis to create highly convincing videos of famous people making scripted statements that have entirely been generated by machine learning. This kind of potential use case is controversial, but the technology behind it demonstrates the capabilities of deep learning for computer vision.
Ethical concerns in Computer Vision
The rise of deepfakes and other types of computer-generated images and videos that can be made using image synthesis has given rise to ethical concerns within the industry. Some of the potentially unethical uses for this technology involve using deepfakes to create artificial pornography by swapping the faces in the scenes with someone else's or even creating entirely artificial videos of politicians making convincing videos of statements that they did not actually make.
There is also the issue of systemic bias in the field of computer vision, which may at least partially come from the datasets used to train models. Facial recognition technology has been shown to produce higher error rates for black and Asian faces than for other races, leading to issues of misidentification that can become problematic when these systems are put into use by law enforcement.
Fraud detection and prevention is another major issue, particularly as it concerns facial recognition. By using this technology as a security feature, hackers and fraudsters have been attempting to find ways to game these systems to improperly gain access. Computer vision systems must be able to distinguish between a real face and various impersonation tactics, such as photographs and masks.
What lies ahead for Computer Vision
Despite the long history of computer vision, it’s really only within the past decade that the industry has exploded in its popularity as a field of study and as a practical solution for problems across industries. Greater accuracy and contextual understanding of images are powering new possibilities for these convolutional neural networks as more of this becomes possible to achieve in real-time.
Tesla is a very good example of a forward-looking application of computer vision to self-driving cars. Tesla hopes to eventually achieve Full Self-Driving, which will enable their electric cars to use autopilot to navigate roads without the need for human intervention. This means a driver may summon their car to them from anywhere and use it to reach another destination without having to manually intervene. This is not yet possible due to the complexity of the tasks involved and the relative infancy of the technology, but it is a good demonstration of what industry leaders hope to leverage deep learning for computer vision to achieve.
Another practical use case of this technology is enabling tourists to find their way around countries where they do not speak their native language. Apps like Google Translate use deep learning and computer vision to recognize characters in images and translate them in real-time into different languages. Using this technology, a tourist can point their camera at a train timetable, restaurant menu, or street sign written in a foreign language and have it translated into their native language in real-time. This drastically reduces language barriers and allows people to communicate who otherwise would have struggled beforehand.