Nvidia’s Titan V Accused of Returning Wrong Answers in Simulations

Nvidia has long held the pole position in GPGPU computing, particularly in scientific and HPC applications. The company’s long-term investment into CUDA and high performance computing have won it a number of spots in the supercomputing TOP500 and fueled the growth of its Tesla product line, including GPUs like the $3,000 Titan V, a Volta-based graphics card that straddles the line between a consumer and a scientific product. But all may not be well with the Titan V — there are reports that the chip can produce different results from run to run.
That’s the word from The Register, which writes:
One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
The Reg goes on to note that it also spoke to an “industry veteran,” who speculated that the problem may be due to issues with HBM memory. That same individual noted that this could be due to problems with the GPU’s onboard RAM, and that Nvidia had encountered this kind of issue before and been forced to issue patches to address it.

Elsewhere, other communities have noted that the problem could be overblown. Floating point parallel computing is not necessarily deterministic, which is to say it does not automatically yield identical results every single time. If the order of operations is different from run to run, the final result could also be different.
It seems unlikely, however, that scientists and researchers would mistake a known issue (non-deterministic output in parallel FP calculations) for a significant hardware issue. The Reg’s source indicated the Titan V could give incorrect results about 10 percent of the time, but did not provide details on which applications were affected, whether the frequency of the problem varied from application to application, or if it could be impacted by changing various GPU settings.
Right now, what we have are more questions than answers. The problem, if it exists, might be addressable via driver or a code change. It might also reflect a problem with the GPU’s memory subsystem, as The Reg speculates. Some HPC applications have updated their own websites to indicate they are aware of the potential issue and haven’t seen it yet. It’s also possible that the issue is limited to a handful of cards and not indicative of a general problem.
As for Nvidia, the company has told the Reg it is aware of the issue and has invited anyone affected to contact Nvidia itself. The Titan V isn’t really positioned as a gaming GPU, but games do not appear to be impacted or affected at this time.
Continue reading

More Mysterious Fast Radio Bursts Detected, With Possible Answer in Sight
These anomalous pulses of energy were discovered in 2007, and a new data set covering hundreds of FRBs is being made available. This could be the advancement that helps us understand FRBs once and for all.

New Analysis of Iconic Miller-Urey Origin of Life Experiment Asks More Questions Than It Answers
Building on the original Miller-Urey experiments, new work shows that some ingredients of the "primordial soup" came from a thoroughly unexpected place. The results may have implications for our search for life off-planet — as well as our quest to understand how it arose here on Earth.

Unanswered Questions, Nosy Private Investors: Musk Decides Tesla Won’t Go Private
Little by little, Musk's plan unraveled: The Saudi investment was never locked in. Big mutual funds couldn't invest in a private company. Musk's Twitter fixation annoyed board members and the SEC.

AMD Will Answer Nvidia’s Ray Tracing Technology, Eventually
AMD will response to the introduction of Microsoft's DirectX ray tracing and Nvidia's RTX technology, but the time frame is a touch unclear.