IBM Aims To Reduce Power Needed For Neural Net Training By 100x

IBM Aims To Reduce Power Needed For Neural Net Training By 100x

With the rush to incorporate AI into nearly everything, there is an insatiable demand for the computing and electrical power needed. As a result, the power-hungry GPUs that are used today are starting to give way to lower-cost, lower-power, custom devices when it comes to running production-scale trained neural networks. However, the time-consuming training process has been slower to yield to new architectures. IBM Research, which brought TrueNorth — one of the first custom inferencing chips — to life, is aiming to do it again with a hybrid analog-digital chip architecture that can also train fully-connected deep neural networks.

Neural Networks Were Unleashed By the Modern GPU

Digital computer CPUs are almost always built on a Von Neumann computer architecture, and have been since their invention. Data and programs are loaded from some type of memory into a processing unit, and results are written back. Early versions were limited to one operation at a time, but of course we now have multi-core CPUs, multi-threaded cores, and other techniques to achieve some parallelism. In contrast, our brains, which were the original inspiration for neural networks, have billions of neurons that are all capable of doing something at the same time. While they aren’t all working on the same task, there can still be a stunning number of parallel operations going on essentially constantly in our minds.

This total mismatch in architecture is one reason that neural networks floundered for decades after their invention. There wasn’t enough performance, even on the fastest computers, to make them a reality. The invention of the modern GPU changed that. By having hundreds or thousands of very-high-speed, relatively simple cores connected to fast memory, it became practical to train and run the types of neural networks that have many layers (called Deep Neural Networks or DNNs) and can be used to solve real world problems effectively.

Custom Silicon For Inferencing Is Now a Proven Technology

IBM Aims To Reduce Power Needed For Neural Net Training By 100x

IBM’s TrueNorth chip, in contrast, is built to more directly model the human brain, simulating a million neurons using specialized circuitry. It achieves impressive power saving for inferencing, but isn’t suited to the important task of training networks. Now, IBM researchers think they have found a way to extend the power savings of using neuromorphic (brain-like) circuitry similar to that found in TrueNorth, along with some ideas borrowed from resistive computing, to achieve massive power savings in network training.

Resistive Computing May Come Back As an Efficient AI Platform

One of the largest bottlenecks of traditional computers when they are used to run neural networks is the reading and writing of data. In particular, each node (or neuron) in a neural network needs to store (during training) and retrieve (during both training and inferencing) many weights. Even with fast GPU RAM, fetching them is a bottleneck. So designers have tapped into a technology called resistive computing to find ways to store the weights right in the analog circuitry that implements the neuron. They’re taking advantage of the fact that neurons don’t have to be very precise, so close is often good enough. When we wrote about IBM’s work in this area in 2016, it was mostly aimed at speeding up inferencing. That was because of some of the issues inherent in trying to use it for training. Now one group at IBM thinks they have found the solution to those issues.

The crossbar architecture is modular and also allows for both forward and backward propagation
The crossbar architecture is modular and also allows for both forward and backward propagation

Hybrid Architecture Aims to Lower AI-Training Power by 100x

The IBM team, writing in the journal Nature, has come up with a hybrid analog plus digital design that aims to address the shortcomings of resistive computing for training. For starters, they have implemented a simulated chip that uses a crossbar architecture, which allows for massively parallel calculation of the output of a neuron based on the sum of all its weighted inputs. Essentially, it’s a hardware implementation of matrix math. Each small crossbar block in the chip can be connected in a variety of ways, so it can model fairly deep or wide networks up to the capacity of the chip — 209,400 synapses in the team’s current simulation version.

But that doesn’t do any good if all those synapses can’t get the data they need fast enough. Until now, memory used in this type of experimental AI chip has either been very high-speed — but volatile with limited precision or dynamic range — or slower Phase-Change Memory (PCM) — with lower write performance. The team’s proposed design uses a model similar to the brain to provide for each of these needs: by separating the short-term and long-term storage for each neuron. Data needed for computation is kept in volatile, but very fast, short-term analog memory. This includes all the weights needed for each synapse of every neuron. During training, the weights are periodically off-loaded into persistent PCM, which also has a larger capacity. The short-term weights are then reset, so the limited range of the analog memory isn’t overrun.

The concept is pretty simple, but the implementation is not. Device physics heavily affect analog circuitry, so the researchers have proposed a series of techniques, including using voltage differentials and swapping polarities periodically, to minimize the errors that could creep into the system during prolonged operation.

In Simulation, the Chip Is Competitive With Software At 1/100th the Power

IBM Aims To Reduce Power Needed For Neural Net Training By 100x

However, since the chip is only capable of running fully-connected layers, like those found at the higher layers of most deep models, there are limits to what it can do. It can run MNIST (the classic digit recognition benchmark) essentially un-aided, but for image recognition tasks like CIFAR it needs to have a pre-trained model for the feature-recognition layers. Fortunately, that type of transfer learning (using a pre-trained model for the feature extraction layers) has become fairly common, so it doesn’t have to be a big stumbling block for the new approach.

Are Hybrid Chips the Future for Neural Networks?

As impressive as these research results are, they come with a lot of very-specific device tweaks and compromises. By itself, it’s hard for me to see anything quite this specialized become mainstream. What I think is important, and makes this and the other resistive computing research worth writing about, is that we have an existence proof of the ultimate neuromorphic computer — the brain — and how powerful and efficient it is. So it makes sense to keep looking for ways we can learn from it and incorporate those lessons into our computing architectures for AI. Don’t be surprised if someday your GPU features hybrid cores.

[Image credits: Nature Magazine]

Continue reading

Microsoft Matches Epic, Reduces Windows Store Gaming Cut to 12 Percent
Microsoft Matches Epic, Reduces Windows Store Gaming Cut to 12 Percent

Microsoft has cut its Windows Store gaming revenue share from 30 percent to 12 percent, in an effort to entice developers over to its platform.

Leaked Data on AMD’s Milan-X Epyc CPUs Claims 768MB L3, Reduced Base Clocks
Leaked Data on AMD’s Milan-X Epyc CPUs Claims 768MB L3, Reduced Base Clocks

New specs on AMD's Milan-X have leaked. The chips pack a whopping lot of L3, as expected, but they (may) trade a bit of clock as well, if rumors are accurate.

Google Says Using Two-Factor Authentication Reduces Account Hacks 50 Percent
Google Says Using Two-Factor Authentication Reduces Account Hacks 50 Percent

Google has posted an update on its security measures in recognition of Safer Internet Day (February 8th), noting that account hacks dropped by 50 percent in the group Google enrolled in two-step verification.

Thermal Grizzly Launches Alder Lake ‘Contract Frame’ to Help Reduce Temps
Thermal Grizzly Launches Alder Lake ‘Contract Frame’ to Help Reduce Temps

Thermal Grizzly has teamed up with der8auer to offer a replacement for the LGA 1700 socket's retention mechanism that can offer improved cooling.