Can programmable processors really be smaller than hardwired logic?

We run into this question quite often in customer meetings. We license our v-MP6000UDX processor and run applications like CNNs, computer vision algorithms and video codecs on it. Frequently, such algorithms are implemented in hard-wired logic, instead of running in software on a processor, like we do. Intuitively, you’d think such a hard-wired approach results in much smaller and lower power implementations, however, we’ve seen many designs and we’ve found that often the reverse is true: using our processor results in smaller and lower power solutions than using hard-wired designs. In this article we will highlight some reasons of why that can be the case.

Silicon reuse

The first reason a processor-based design can result in a smaller solution is that a processor reuses silicon a lot more. In hard-wired designs, each function in an application becomes its own individual circuit. When using a processor, each function just becomes some code that resides in a memory. This code can then be executed on the processor, giving the processor virtually unlimited functionality. The more code there is to run in silicon, the more efficient a processor-based approach becomes compared to implementing it in hard-wired logic. Try implementing all of Android in hardware for instance. It’s simply impossible.

Dark silicon at the task level

Silicon reuse takes place at different levels. For instance, many mobile phone application processors contain hard-wired accelerators for video encoding and decoding. These are often separate blocks, where in most use cases either one or the other is in use, leaving some of the silicon dark. Additionally, support is required for many video coding standards. Since only one standard can be used at a time, the circuits that support the non-active coding standards are also dark.

Dark silicon at the kernel level

Besides at the higher application levels, the same phenomenon occurs at the lower levels of implementation. Take for instance an optical flow kernel, which is often used in computer vision to track the movement of the camera and objects in view. Optical flow is usually just a small portion of a complete computer vision application, but an important one. The algorithm first finds a set of features, points in the image that can be tracked by running different filters over the image. It then tracks these features from frame to frame, trying to find a match between the frames. This algorithm is data dependent: some scenes will have thousands of features and some will have very few. A scene of a tree with leaves will result in many features and in case a camera points at a white wall or a blue sky there will be none. In a hard-wired implementation, the hard-wired feature matching circuits either have to track thousands of features or may not have have any work to do, since there’s no features. This imbalance results in dark silicon. Sometimes the circuits run full blast, sometimes they’re turned off. In case the algorithm is implemented in software and there are no features to track, the processor is immediately freed up to perform the next task thrown at it.

Reuse of memory

Yarn Another area where there are lots of opportunities for silicon reuse is in the memories. Many hard-wired designs consist of several building blocks, each with their own memories. In imaging and deep learning these are often line buffers or tiles that are distributed throughout the design and not reused between the different building blocks since each building block has its own demands on size, width, and bandwidth. When using a processor, there’s a natural memory hierarchy with on-chip and often off-chip memories. The memory allocation, figuring out which data goes where, is done in software, and often even performed by the compiler. Instead of doing this task once on paper, at design-time, in the case of designing hardware, this task is done in software, partially automated by the compiler, and can adjusted and optimized over time. As a result, there’s much more reuse of memory in a processor-based approach, resulting again in lower power and smaller designs.

Wheel of reincarnation
There’s a seminal paper written already over 50 years ago by Sutherland and Myer: “On the design of Display Processors”, which describes an interesting observation about the nature of hardware design, where they slowly realized they were incorporating CPUs into their designs to speed things up:

Gradually the processor became more complex. We were not disturbed by this because computer graphics, after all, are complex. Finally the display processor came to resemble a full-fledged computer with some special graphics features. And then a strange thing happened. We felt compelled to add to the processor a second, subsidiary processor, which, itself, began to grow in complexity. It was then that we discovered a disturbing truth. Designing a display processor can become a never-ending cyclical process. In fact, we found the process so frustrating that we have come to call it the “wheel of reincarnation.”

It was not until they had traveled around the wheel several times that they realized what was happening. Once they did, they tried to view the whole problem from a broader perspective. It’s a good example of how they made tradeoffs between hardware and processors already 50 years ago.

Optimizing at different levels of abstraction

Design optimization happens at different levels. At the higher levels, the application architecture is chosen, and the algorithms and data structures are selected. The lower levels consist of designing circuits, selecting instructions, and operations. The latter optimizations are platform dependent, while changes at the higher architecture and algorithm level are platform independent.

The optimizations happening at higher levels of abstraction have typically a bigger impact on performance than the lower-level implementation choices. For instance, changing an algorithm from O(n2) to O(n) in terms of space or time can have a huge impact on performance. Since hard-wired circuit design happens at a lower level of abstraction, there’s usually little effort spent on algorithm-level type of optimizations. Implementation in software running on a processor are performed at a higher level, giving the flexibility to implement a wider variety of optimizations. Again, this change in focus results in relatively lowering the power and increasing the performance of processor-based implementations.

Development time

Another aspect is development time, which is always scarce. It’s more labor intensive to design hardware circuits than to develop software. Therefore, hardware designers have started to use new tools like high-level synthesis that generate the circuits from C/C++ source files. This has allowed the designers to get more work done, at the expense of less efficient implementations though.

With a processor-based approach, there’s optimization done at two levels: the processor itself can be optimized over time, while keeping the ISA stable, allowing the same software to keep running on the continuously refined, faster, smaller, and lower power implementations of the same processor. In addition, there’s also more time available to optimize the software, since it takes less time to implement software compared to developing hardware circuits.

In short, since it’s much faster to develop software than to develop hardware, there’s more time left for optimization. Hard-wired implementations of visual computing algorithms and applications are often bloated and are heavily redesigned over the course of many years, from chip generation to chip generation.

Conclusion

Yes, hard-wired logic circuits can be much smaller than a processor-based approach. In case you need to for instance simply multiply an incoming 16-bit signal by a simple constant value, a hard-wired approach will be much smaller than a processor-based approach with its generic multipliers, instruction memory and instruction decoders that can be seen as pure overhead. As soon as the application is a bit more complex though, for instance a system that runs different deep learning networks, full video coding, or computer vision stacks, then a carefully crafted processor optimized for visual computing can result in a design that provides more performance per Watt and more performance per silicon dollar. We’ve been licensing our processors for over 15 years into applications that are very cost and power sensitive and regularly beat hard-wired accelerators on these key metrics, while remaining software programmable, which is crucial in the field of visual computing, whose algorithms and applications are still rapidly changing.

More news: