While I am no fan of NVIDIA, this headline seems somewhat disingenuous in making it sound like NVIDIA’s fault. They aren’t the ones making the memory chips.
The researchers’ proof-of-concept exploit was able to tamper with deep neural network models used in machine learning for things like autonomous driving, healthcare applications, and medical imaging for analyzing MRI scans. GPUHammer flips a single bit in the exponent of a model weight—for example in y, where a floating point is represented as x times 2y. The single bit flip can increase the exponent value by 16. The result is an altering of the model weight by a whopping 216, degrading model accuracy from 80 percent to 0.1 percent, said Gururaj Saileshwar, an assistant professor at the University of Toronto and co-author of an academic paper demonstrating the attack.
Rowhammer attacks present a threat to memory inside the typical laptop or desktop computer in a home or office, but most Rowhammer research in recent years has focused on the threat inside cloud environments. That’s because these environments often allot the same physical CPU or GPU to multiple users. A malicious attacker can run Rowhammer code on a cloud instance that has the potential to tamper with the data a CPU or GPU is processing on behalf of a different cloud customer. Saileshwar said that Amazon Web Services and smaller providers such as Runpod and Lambda Cloud all provide A6000s instances. (He added that AWS enables a defense that prevents GPUhammer from working.)
Well, if you can afford twice the computation cost, you can run a computation twice to validate that the result is the same, and re-run if they differ. I suspect that corrupting GPU memory in a reproducible way is going to be a lot harder, so defeating that should be pretty hard. That won’t require hardware changes.
I would even go as far speculating that Nvidia is not even going to bother with hardware changes. Especially considering AWS (and other cloud providers?) have mitigation approaches.