Nvidia has long held the pole position in GPGPU computing, particularly in scientific and HPC applications. The company’s long-term investment into CUDA and high performance computing have won it a number of spots in the supercomputing TOP500 and fueled the growth of its Tesla product line, including GPUs like the $3,000 Titan V, a Volta-based graphics card that straddles the line between a consumer and a scientific product. But all may not be well with the Titan V — there are reports that the chip can produce different results from run to run.
That’s the word from The Register, which writes:
One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
The Reg goes on to note that it also spoke to an “industry veteran,” who speculated that the problem may be due to issues with HBM memory. That same individual noted that this could be due to problems with the GPU’s onboard RAM, and that Nvidia had encountered this kind of issue before and been forced to issue patches to address it.
Elsewhere, other communities have noted that the problem could be overblown. Floating point parallel computing is not necessarily deterministic, which is to say it does not automatically yield identical results every single time. If the order of operations is different from run to run, the final result could also be different.
It seems unlikely, however, that scientists and researchers would mistake a known issue (non-deterministic output in parallel FP calculations) for a significant hardware issue. The Reg’s source indicated the Titan V could give incorrect results about 10 percent of the time, but did not provide details on which applications were affected, whether the frequency of the problem varied from application to application, or if it could be impacted by changing various GPU settings.
Right now, what we have are more questions than answers. The problem, if it exists, might be addressable via driver or a code change. It might also reflect a problem with the GPU’s memory subsystem, as The Reg speculates. Some HPC applications have updated their own websites to indicate they are aware of the potential issue and haven’t seen it yet. It’s also possible that the issue is limited to a handful of cards and not indicative of a general problem.
As for Nvidia, the company has told the Reg it is aware of the issue and has invited anyone affected to contact Nvidia itself. The Titan V isn’t really positioned as a gaming GPU, but games do not appear to be impacted or affected at this time.
CPU Utilization Is Wrong on PCs, and Getting Worse Every Year
The way we measure CPU utilization isn't accurate — and it's getting worse each year.
Astronomers Find Alien Asteroid Orbiting ‘the Wrong Way’ Near Jupiter
An asteroid called 2015 BZ509 is currently in orbit close to Jupiter, but it's orbiting the wrong way.
Nvidia Claims RTX GPUs Are Much Faster Than Pascal By Deliberately Comparing the Wrong Cards
Nvidia's new benchmark results for its RTX 2080 look great if you ignore how the company is comparing the wrong cards to make its case look much better than it actually is.
Blackmore Is Betting the Rest Of the Lidar Industry Has It All Wrong
Lidar is king when it comes to autonomous vehicle sensors, but Blackmore wants to raise the ante further, with increased range and integrated velocity measurement in its FMCW Doppler lidar product.