For AI to successfully interact with, and learn from its environment, it must be able to ‘see’ what’s going on
Perhaps the greatest challenge for supervised machine learning is obtaining the large amounts of data required for training purposes – and most importantly, labeling them.
While it took several years to complete, a genuinely innovative approach to creating the immense ImageNet database involved crowdsourcing. ImageNet’s data were to have a monumental impact on the, then struggling, development of computer vision systems – enabling AI to derive information from digital images.
Back in July 2008, it had zero images. Yet, by December the same year, it had 3 million. And, by April 2010, it had a staggering **11 million images** classified into over 15,000 categories, or synsets, thanks to the work of 25,000 individuals. It was a breakthrough in AI innovation, offering researchers and computer vision systems the availability of large challenging datasets and opportunities to spar with one another and publish their code.
Researchers took up the challenge; systems competing on the accuracy of their best guesses. ImageNet’s approach was to become a gamechanger – bringing back the potential of neural net vision systems from the brink of being a lost cause.
What is computer vision?
For humans, vision is an essential perceptual channel, passively accepting stimuli from the world and creating some form of internal mental representation reflecting what we have seen. While visual perception may appear a simple and effortless process, its complexity became increasingly apparent when AI researchers tried to teach computers to visually ‘perceive’ the environment.
The machine learning approach involves using **algorithmic models** that enable a computer to teach itself about the context of visual data. If enough data are fed through the model, the computer will learn to tell one image from another. Such algorithms enable the machine to teach itself to recognize what it sees rather than someone programming it, but it requires a great deal of data.
**Machine learning for visual analysis typically involves using deep learning**, including predictive modeling and statistics, and convolutional neural networks.
Receiving and processing light waves
Human vision is **passive**, detecting naturally occurring energy – in this case, light waves. Unlike bats actively sending ultrasound out into the world and waiting for it to bounce back, we receive light to the back of the eye, then information from the retina travels up the optic nerve to our highly-evolved brains. Computer vision is similar. Rather than actively sending light, sound, and radar out into the environment, computer vision involves extracting 3D data from digital images through a series of complex processes.
Each ray of light produces an electrical effect based on its wavelength when hitting the image sensor. Over the course of a defined amount of time, the sensor output is summed up and sent off to processing, ultimately using the light to create two-dimensional (2D) images.
Yet to see a ‘focused’ image, the system must find a way to ensure the photons received, and, therefore, the information processed, originate from the same spot on the object in the real world.
Extracting information about the object
Creating a 3D representation of the 2D information to hit the optical sensors is not straightforward. It requires combining a series of bottom-up complementary processes that apply direct computation to sensor recording and handling the key attributes of image identification.
The brightness of an image provides clues to the shape and identity of an object, and yet is surprisingly ambiguous due to ambient lighting and whether it is facing the light or in shadow.
Humans are excellent at ignoring the effects of colored lights and can estimate what the color would be under white light – known as ‘color constancy.’ In order to have human-like vision, the computer must replicate the ability to predict that the same surface may appear differently under various colored lights.
The pattern on the surface of an object, or its ‘texture,’ is a property of a patch of the image, rather than an individual pixel, further supporting object identification and 3D information recovery by helping to combine multiple images.
Several properties of images, both still and moving, help classification. It is vital to overcome the challenges of handling large amounts of visual data and the potential for confusion resulting from ‘noisy’ and messy images, meaning that they contain random variation in lightness and color, poor picture quality, and out-of-focus objects.
**Object edges are identified from significant changes in pixel intensity** – a measure of energy that impacts ‘brightness,’ or how the human visual system perceives light. The information can be combined with **optical flow** to make sense of the apparent motion of an image over a series of static images. Segmentation then takes groups of similar pixels and associates them with properties such as brightness, texture, and color.
**The greater the information acquired and the processing performed, the greater the likelihood of correct image classifications.**
Convolutional neural networks and ImageNet
Modern approaches classify images using their appearance and require large amounts of training data.
Indeed, classifying images was an almost impossible task until the arrival of data sets such as ImageNet, which contains over 14 million training images classified into 30,000 fine-grained categories. Nowadays, **Convolutional Neural Networks (CNNs) are recognized as highly successful image classifiers, surpassing any other methods.**
The relatively small collection of 70,000 images in the MNIST data set consists of handwritten digits, 0-9, as a warm-up for number identification. Processes such as ‘dataset augmentation’ are crucial to handling real-world images, randomly shifting, rotating, and stretching them to increase the size of the dataset and facilitate training and learning.
When classifying objects, an awareness of context helps CNNs disregard any information that does not help them recognize or make distinctions with accuracy.
Detecting what is in the scene
Rather than detect all that is in a scene, **object detection** requires the AI to find one or many objects in the image(s). Think about ‘Where’s Waldo?’ We spend time trying to find this tricky little bespectacled guy in his red stripy top, ignoring other characters and objects.
While relatively easy, if not frustrating, for us, an AI has to perform a lot of specific processing. Object detectors, such as Convolutional Neural Networks (CNNs), typically use ‘sliding windows’ that pass over the image classifying, for example, a car over here and a pedestrian over there.
Image windows are scored and sorted using a ‘greedy’ algorithm called **non-maximum suppression**, discarding lower value ones, then further trimming using **bounding box regression**.
Detected objects are evaluated by matching them against a collection of pre-labeled images. They are then scored based on how many objects they find and their expected precision, in terms of false positives, interpreting something as right when it’s wrong, and false negatives, indicating a condition is not true when it is not.
2D pictures are often rich in cues about the 3D world, especially when there are several. Indeed, if you have 2 images and know each camera’s position and detail, you can construct a 3D model.
And if you can match points in each picture, you don’t even need to know much about the 2 cameras. For example, 2 views of 2 points offer 4 x and y coordinates – you only need 3 coordinates to specify a point in 3D space.
Texture and color are powerful attributes for matching points. For example, a green traffic light may appear in different positions in each image, enabling point matching.
Having 2 eyes offers us disparity, as do 2 cameras or images. Known as **binocular stereopsis**, when we superimpose both images, they give us a 3D view. If we can work out the disparity, we can calculate depth. When our camera is moving, **the disparity in the optical flow allows us to extract valuable information**, such as the scene’s geometry.
Real-world applications of computer vision
The potential application for computer vision is truly vast. AIs that can model and represent images successfully can help us understand what people are doing. Such knowledge can help change how AIs interact with them, with systems becoming more responsive to our behavior. Buildings can better monitor existing behavior and anticipate what people will do next, manage resources, produce supportive environments, and save energy and costs.
The capacity to link pictures and words provides opportunities for improved online searching, captioning for the visually impaired, and the ability to answer questions regarding an image, such as *“Is it raining?”*. Indeed Facebook, since rebranded as Meta, has used a billion pictures taken from Instagram to become better at identifying specific images searched for. Don’t we all love a cute kitten?
When we have a large collection of images, even if not paired, we can combine them to recreate a scene, object, or even reconstruct how a town looks from a series of tourist photos. It is not hard to imagine its potential use to recreate damaged buildings or art digitally, provide the design for their rebuild in the real world, or produce a visual map for a rescue operation.
The potential for misuse
Computer vision has incredible potential – much of which remains untapped and unimagined. And yet, it has the power to be misused by corporations, individuals, and even governments to mislead, subverting the truth by creating false realities.
**Generative adversarial networks** (GANs) have the ability to create deepfakes – novel, sham photos and videos that look like someone famous but not real. Such synthetic media can be the ideal deception, with images shared across social media and used in celebrity pornography, fake news, and financial fraud.
And yet, not all uses are harmful, even if misleading. To make the Star Wars film Rogue One, Carrie Fisher’s face, then 60, was reinvented digitally to that of her 19-year-old self and superimposed on another actor’s body. Such deepfake techniques have allowed her character to live on even after the actress’s untimely death in 2016.