Embedded vision: we’re only at the beginning
Smart image analysis has enormous potential. An image sensor produces copious amount of data. The embedded vision algorithms and platforms that can interpret and give meaning to this data enable completely new user experiences, on the mobile phone, in the home, and in the car. videantis’ Marco Jacobs shines his light on applications, techniques that are used, and the required compute platforms.
Image processing, according to Wikipedia, is “Any method of signal processing for which the input is an image, like a photo or frame of video, and the output is either an image, or a collection of characteristics”. This kind of image processing is everywhere around us. Our mobile phones do it, for example, as do our TVs…and so do we.
For us humans, the eyes perform simple image processing tasks: focusing (sometimes with the help of glasses), controlling exposure, and capturing color and light information. The interesting part starts in our brain, however, where we interpret the images and give meaning to them. Research has shown that about half of our brain is allocated to image processing. Apparently this is a compute-intensive task, as it requires lots of synapses and neurons performing their magic. But it does pay off.
A decade ago, professional applications primarily used computer vision techniques: cameras inspecting products during assembly in the factory, or surveillance cameras triggering an alarm when they detected motion. In the past decade, however, embedded vision has expanded and (for example) entered consumer electronics. Even an inexpensive digital camera these days detects the location of faces in the scene and adjusts its focus accordingly.
One of the best-known successes of embedded vision is Microsoft’s Xbox Kinect. The Kinect, originally sold as an Xbox accessory (later also as a PC peripheral), projects a pattern of infrared light onto the gamers in the room. Based on the distortions of the pattern, it then constructs a depth map. Using this depth information, the console can easily distinguish people or objects from the scene’s background, and use this information in games and other applications. Since the introduction of the Kinect to the gaming market, similar techniques have also made their way to other industries.
Today’s smartphones have at least two cameras, three if you also count the touch sensor, since it captures an image from which the positions of the finger tips on the screen can be deduced. The iPhone includes another sensor that takes an image of a fingerprint. Amazon’s Fire Phone includes another four image sensors that are used to find the gaze direction of the person holding the phone, which is then used to present a real-time 3D user interface.
Still, we’re only at the beginning of embedded vision. Many new applications are being developed, and large innovative companies like Google and Amazon are investing heavily. Speaking to the imagination most is probably the self-driving car. Google recently introduce an autonomous 25 MPH two-seat vehicle that doesn’t have a steering wheel, gas pedal, or brake pedal. Since the premise of autonomous vehicles is to drive us to our destinations in a flawless manner, car accidents will largely be a thing of the past.
Another interesting initiative at Google is Project Tango, which adds multiple image sensors to mobile phones and tablets. The primary goal of these depth and fisheye cameras is not to take nice pictures, but to analyze the mobile device’s surroundings, in order to accurately deduce its location and orientation. Once the exact camera pose is known, unique augmented reality games can be implemented. Imagine, for example, Mario not only being able to jump on platforms on the display, but also on the couches, tables and bookcases around you. Such non-GPS-based accurate positioning also opens the door to indoor navigation, i.e. user-generated InsideView instead of StreetView.
The algorithms at the foundation of smart embedded vision are still very much in evolution. Scientists publish one paper after another; companies similarly have unique approaches. One popular software package at the moment is OpenCV. This open source library offers over a thousand different computer vision and image processing routines, of which typically only a small portion is used in any one product.
The Khronos group, perhaps best-known for its OpenGL, OpenCL and OpenMax standards, is working on OpenVX, a library that can be used to construct efficient image processing systems. This library consists of only 40 image analysis routines, but is structured in a framework that allows image data to stay local to processing units. This attribute can greatly reduce the number of data read and write operations to external memory, lowering power consumption and increasing performance significantly.
Most algorithms are variations on the same theme. Feature detection, for example, finds interesting points in an image, mostly corners. A 300 KByte VGA image of a square, for instance, is then converted into just 4 data points of the square’s corners, a significant reduction of the amount of data. There are many different algorithms to find the corners, but most of them are very similar. Another key technique is feature tracking. This algorithm follows points from one frame to another in a video stream. This way, we get information about the speed and direction of the objects in the scene, or the change in position of the camera. Using a technique called structure from motion, this information can even be used to obtain a rough 3D model of the scene that the camera captured.
A third key technique is object detection, which finds and classifies objects in an image, such as the location of a face, a person, or a car. Such algorithms need to be trained and tuned using reference images. By running a large library of images through a training algorithm, the software learns what what are (and aren’t) the objects we’d like to detect. The resulting parameters of such an offline, non-real-time training algorithm are then fed into the real-time object detector.
The training phase typically requires lots of reference images, tuning, manual guiding, and verification of the algorithm. in the last few years, however, a new class of algorithms has been developed and become popular: convolutional networks, also known as deep learning. These algorithms can detect objects with higher accuracy or in a more generalized way. Training is also deemed to be an easier process via deep learning techniques.
Image analysis requires lots of compute power, and at first glance seems quite brute force. A 5 megapixel black and white camera that captures 30 frames per second generates 150 Megasamples per second. Many algorithms generate a multiscale image of this input data, each time downscaling the image (by 10%, for instance), which increases the amount of data significantly. The object detector then runs the same algorithm on all the different resolutions, looking for a match.
Running such vision algorithms on a standard CPU is nearly impossible even when the algorithms are simple and the resolution is low. When the algorithms get slightly more complex, or the required resolution and accuracy goes up, we have to look for alternative, more powerful processing solutions. Recently, GPUs have become GPGPUs and quite powerful in the process. In addition, the tooling and software frameworks to program and optimize for these complicated machines have become more workable. Still, GPUs are typically not efficient enough. They use a lot of energy and are expensive because of the large silicon area they consume. FPGAs are another alternative for lower volumes. Casting algorithms in hardware yields the most efficient implementation, but since these algorithms are still under development and changing, this usually isn’t an effective solution.
A new class of digital signal processors, specifically optimized for energy efficiency and high-performance image processing, has recently emerged. Such processors don’t have the overhead that RISC processor have; they don’t need to run complex operatings systems, web browsers and other large software stacks. These video DSPs also don’t carry the baggage that GPUs have because of their history in 3D graphics. An efficient and parallel video DSP that’s optimized for image processing seems to be an ideal solution.