Samsung has announced the availability of a new Aquabolt variation. Unlike the typical clock speed jump or capacity improvement you’d expect, this new HBM-PIM can perform calculations directly on-chip that would otherwise be handled by an attached CPU, GPU, or FPGA.
PIM stands for Processor-in-Memory, and it’s a noteworthy achievement for Samsung to pull this off. Processors currently burn an enormous amount of power moving data from one location to another. Moving data takes time and costs power. The less time a CPU spends moving data (or waiting on another chip to deliver data), the more time it can spend performing computationally useful work.
CPU developers have worked around this problem for years by deploying various cache levels and integrating functionality that once lived in its own socket. Both FPUs and memory controllers were once mounted on the motherboard rather than directly integrated into the CPU. Chiplets actually work directly against this aggregation trend, which is why AMD has had to be careful that its Zen 2 and Zen 3 design could boost overall performance while disaggregating the CPU die.
If bringing the CPU and memory closer together is good, building processing elements directly into memory would be even better. Historically, this has been difficult because logic and DRAM are typically built very differently. Samsung has apparently solved this problem, and it’s leveraged the die-stacking capabilities of HBM to keep available memory density sufficiently high to interest customers. Samsung claims it can deliver a more than 2x performance improvement with a 70 percent power reduction at the same time, with no required hardware or software changes. The company expects validation to be complete by the end of the first half of this year.
THG has some details about the new HBM-PIM solution, gleaned from Samsung’s ISSCC presentation this week. The new chip incorporates a Programmable Computing Unit (PCU) clocked at just 300MHz. The host controls the PCU via conventional memory commands and can use it to perform FP16 calculations directly in-DRAM. The HBM itself can operate either as normal RAM or in FIM mode (Function-in-Memory).
Including the PCU reduces the total available memory capacity, which is why the FIMDRAM (that’s another term Samsung is using for this solution) only offers 6GB of capacity per stack instead of the 8GB you’d get with standard HBM2. All of the solutions shown are built on a 20nm DRAM process.
Samsung’s paper describes the design as “Function-In Memory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide 4× higher processing bandwidth than an off-chip memory solution.”
One question Samsung hasn’t answered is how it deals with thermal dissipation, a key reason why it’s been historically difficult to build processing logic inside DRAM. This could be doubly difficult with HBM, in which each layer is stacked on top of another. The relatively low clock speed on the PIM may be a way of keeping DRAM cool.
We haven’t seen HBM deployed for CPUs much, Hades Canyon notwithstanding, but multiple high-end GPUs from Nvidia and AMD have tapped HBM/HBM2 as primary memory. It’s not clear if a conventional GPU would benefit from this offload capability, or how such a feature would be integrated into the GPUs own impressive computational capacity. If Samsung can offer the performance and power improvements it claims to a range of customers, however, we’ll undoubtedly see this new HBM-PIM popping up in products a year or two from now. A 2x performance boost coupled with a 70 percent power consumption decrease is the kind of old-school improvement lithography node transitions used to deliver on a regular basis. It’s not clear if Samsung’s PIM will specifically catch on, but any promise of a classic full-node improvement will draw attention, if nothing else.
New ARMv9 Cortex X-2, A710 CPUs Deliver Major Efficiency Gains
ARM is announcing new CPUs today for its ARMv9 architecture. The Cortex-X2, Cortex-A710, and Cortex-A510 deliver a solid set of performance improvements and efficiency gains.
With Spintronics, Intel Sees Efficiency, Density Scaling Far Beyond CMOS
A new research paper from Intel suggests a new path forward for device scaling and lower power computing. We desperately need one.
Google’s EfficientNet Offers up to a 10x Boost in Image Analysis Efficiency
Google has earned a reputation for pushing out new AI technologies and upgrades at a remarkable pace and their announcement of EfficientNet serves as the latest example. Leveraging their work with AutoML, Google's scientists employed a scaling method that offers up to a tenfold increase in network efficiency.