Nvidia has long held the pole position in GPGPU computing, particularly in scientific and HPC applications. The company’s long-term investment into CUDA and high performance computing have won it a number of spots in the supercomputing TOP500 and fueled the growth of its Tesla product line, including GPUs like the $3,000 Titan V, a Volta-based graphics card that straddles the line between a consumer and a scientific product. But all may not be well with the Titan V — there are reports that the chip can produce different results from run to run.
That’s the word from The Register, which writes:
One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
The Reg goes on to note that it also spoke to an “industry veteran,” who speculated that the problem may be due to issues with HBM memory. That same individual noted that this could be due to problems with the GPU’s onboard RAM, and that Nvidia had encountered this kind of issue before and been forced to issue patches to address it.
Elsewhere, other communities have noted that the problem could be overblown. Floating point parallel computing is not necessarily deterministic, which is to say it does not automatically yield identical results every single time. If the order of operations is different from run to run, the final result could also be different.
It seems unlikely, however, that scientists and researchers would mistake a known issue (non-deterministic output in parallel FP calculations) for a significant hardware issue. The Reg’s source indicated the Titan V could give incorrect results about 10 percent of the time, but did not provide details on which applications were affected, whether the frequency of the problem varied from application to application, or if it could be impacted by changing various GPU settings.
Right now, what we have are more questions than answers. The problem, if it exists, might be addressable via driver or a code change. It might also reflect a problem with the GPU’s memory subsystem, as The Reg speculates. Some HPC applications have updated their own websites to indicate they are aware of the potential issue and haven’t seen it yet. It’s also possible that the issue is limited to a handful of cards and not indicative of a general problem.
As for Nvidia, the company has told the Reg it is aware of the issue and has invited anyone affected to contact Nvidia itself. The Titan V isn’t really positioned as a gaming GPU, but games do not appear to be impacted or affected at this time.