Nvidia has long held the pole position in GPGPU computing, particularly in scientific and HPC applications. The company’s long-term investment into CUDA and high performance computing have won it a number of spots in the supercomputing TOP500 and fueled the growth of its Tesla product line, including GPUs like the $3,000 Titan V, a Volta-based graphics card that straddles the line between a consumer and a scientific product. But all may not be well with the Titan V — there are reports that the chip can produce different results from run to run.
That’s the word from The Register, which writes:
One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
The Reg goes on to note that it also spoke to an “industry veteran,” who speculated that the problem may be due to issues with HBM memory. That same individual noted that this could be due to problems with the GPU’s onboard RAM, and that Nvidia had encountered this kind of issue before and been forced to issue patches to address it.
Elsewhere, other communities have noted that the problem could be overblown. Floating point parallel computing is not necessarily deterministic, which is to say it does not automatically yield identical results every single time. If the order of operations is different from run to run, the final result could also be different.
It seems unlikely, however, that scientists and researchers would mistake a known issue (non-deterministic output in parallel FP calculations) for a significant hardware issue. The Reg’s source indicated the Titan V could give incorrect results about 10 percent of the time, but did not provide details on which applications were affected, whether the frequency of the problem varied from application to application, or if it could be impacted by changing various GPU settings.
Right now, what we have are more questions than answers. The problem, if it exists, might be addressable via driver or a code change. It might also reflect a problem with the GPU’s memory subsystem, as The Reg speculates. Some HPC applications have updated their own websites to indicate they are aware of the potential issue and haven’t seen it yet. It’s also possible that the issue is limited to a handful of cards and not indicative of a general problem.
As for Nvidia, the company has told the Reg it is aware of the issue and has invited anyone affected to contact Nvidia itself. The Titan V isn’t really positioned as a gaming GPU, but games do not appear to be impacted or affected at this time.
NASA Discovers Vital Organic Molecule on Titan
In the latest analysis, researchers from NASA have identified an important, highly reactive organic molecule in Titan's atmosphere. Its presence suggests the moon could support chemical processes that we usually associate with life.
SpaceX Plans Imminent High-Altitude Starship Test
Following a successful static fire test this week, SpaceX CEO Elon Musk has announced on Twitter that the company wants to perform a high-altitude test flight as early as next week.
SpaceX Cancels Starship High-Altitude Test at Last Second
SpaceX says the cancellation was due to abnormal readings from one of the rocket's three Raptor engines. There are more potential launch windows coming up, but it's unclear what went wrong and how long it'll take to fix.
FTC Files Antitrust Case to Break Up Facebook
New York Attorney General Letitia James has announced a major antitrust case against Facebook, which will be joined by 47 other state and regional AGs. And that's not all: the Federal Trade Commission (FTC) is filing a separate case against Facebook later today.