Stanford Researchers Build AI Directly Into Camera Optics

Stanford Researchers Build AI Directly Into Camera Optics

Until recently, cameras were exclusively designed to create images for humans — for fun, for art, and for documenting history. With the rapid growth of robots as well as various other kinds of machines and vehicles that need to observe and learn from their environment, many cameras are dedicated to machine vision tasks. Some of the most visible of those, like autonomous vehicles, rely heavily on object recognition, which almost universally means neural networks trained on commonly found objects. One limitation on the deployment of machine vision in many embedded systems, including electric vehicles, is the necessary compute and electrical power. So it makes sense to re-imagine camera design and consider what is the ideal camera architecture for a particular application, rather than simply repurposing existing camera models.

In this spirit, a team at Stanford University led by Assistant Professor Gordon Wetzstein and graduate student Julie Chang has built a prototype of a system that moves the first layer of an object recognition neural network directly into the camera’s optics. This means that the first portion of the needed inferencing takes essentially no time and no power. While their current prototype is limited and bulky, it points the way for some novel approaches to creating lower-power, higher-performance, inferencing solutions in IoT, vehicle, and other embedded applications. The research draws heavily from AI, imaging, and optics, so there isn’t any way we can detail the entire system in one article. But we’ll take you through the highlights and some of the breakthroughs that make the prototype so intriguing.

Basic Object Recognition, Neural Network Style

Most current object recognition systems use a multi-layer neural network. State of the art systems often include dozens of layers, but it is possible to address simple test suites like MNIST, Google’s QuickDraw, and CIFAR-10 with only a layer or two. However deep the network the first layer or layers are typically convolution layers. Convolution is the process of passing a matrix (called a kernel) over the image, multiplying it at each location and summing the result to create an activation matrix. In simple terms, the process highlights areas of the image that are similar to the kernel’s pattern. Typical systems involve multiple kernels, each reflecting a feature found in the objects being studied. As the network is trained, those kernels should start to look like those features, so the resulting activation maps will help later levels of the network recognize particular objects that include various examples of the features.

Later layers of the network are often fully connected, which are simpler to compute than convolution layers. The Stanford hybrid optical-digital camera doesn’t address those but instead models replacing the computationally-expensive initial convolution layer with an optical alternative, which the team refers to as an opt-conv layer. There isn’t any convenient way with traditional optics to perform a convolution, let along multiple convolutions, on an image. However, if the image is first turned into its frequency equivalent using a Fourier transform, fast convolution suddenly becomes possible — because multiplying in the frequency domain is like performing a convolution in the traditional spatial domain.

To take advantage of this property, the team draws from the techniques of Fourier Optics, by building what is called a 4f optical system. A 4f system relies on an initial lens to render the Fourier transform of the image. The system allows for processing the transformed image using an intermediate filter or filters, and then reverses the transform with a final lens and renders the modified result.

Fourier Optical system implemented in a 4f telescope including a phase mask to implement image convolution
Fourier Optical system implemented in a 4f telescope including a phase mask to implement image convolution

The Magic of Optically Computing a Convolution Layer

There is a lot of pretty deep science that goes into the optical portion of Stanford’s prototype, but it basically chains together a few powerful techniques that we can describe (if not fully explain) fairly succinctly:

First, it is a well-known feature of a Fourier transform (which take a signal or image and renders it in terms of frequencies), that you can also reverse it and get the original image back. Importantly, you can do this using a simple optical system with just a couple lenses, called a 4f optical system (this whole area of optics is called Fourier Optics).

Second, if you filter the Fourier transform of an image by passing it through a partially opaque surface, that is the same as performing a convolution.

Third, you can tile multiple kernels into a single filter and apply them to a padded version of the original image. This mimics the behavior of a multiple kernel system that would normally produce a multi-channel output by creating one that outputs a tiled equivalent as shown here:

The multi-channel output of a traditional convolution layer can be mimicked using tiling in an optical system
The multi-channel output of a traditional convolution layer can be mimicked using tiling in an optical system

So by calculating the desired kernels using traditional machine learning techniques, they can be used to create a custom filter — in the form of a phase mask of varying thickness — that can be added to the middle of the 4f system to instantly perform the convolutions as the light passes through the device.

Training and Implementing the Optical Convolution Layer

One limitation of the proposed optical system is that the hardware filter has to be fabricated based on the trained weights. So it isn’t practical to use the system to train itself. Training is done by using a simulation of the system. Once the needed final weights are determined they are used to fabricate a phase mask (a filter with varying thickness that alters the phase of the light passing through it) with 16 possible values, that can be placed in-line with the 4f optical pipeline.

The learned weights are used to create a mask template, which is then fabricated into a mask of varying thickness
The learned weights are used to create a mask template, which is then fabricated into a mask of varying thickness

Evaluating Performance of the Hybrid Optical-Electronic Camera System

The Stanford team evaluated the performance of their solution in both simulation and using their physical prototype. They tested it both as a way to create a standalone optical correlator using Google’s QuickDraw dataset and as the first layer of a two-layer neural network, which was combined with a fully connected layer to do basic object recognition using the CIFAR-10 dataset. Even after allowing for the limitation of an optical system that all weights need to be non-negative, as a correlator, the system achieved accuracy between 70 percent and 80 percent. That’s similar to that of a more-traditional convolutional layer created using standard machine learning techniques, but without needing to have powered computing elements to perform the convolutions. Similarly, the two-layer solution using a hybrid optical-electronic first layer achieved a performance of about 50 percent on CIFAR-10, about the same as a traditional two-layer network, but with a tiny fraction of the computing power — and therefore electrical power — of the typical solution.

While the current prototype is bulky and requires a monochrome light source as well as only working with grayscale images, the team has already started thinking about how to extend it to work under more typical lighting conditions and with full-color images. Similarly, the 4f system itself could potentially be reduced in size by using flat diffractive optical elements to replace the current lenses.

To learn more you can read the team’s full paper in Nature’s Scientific Reports. The team has also said that they’ll be making the full source code for their system publicly available.