

The problem is that the deprecation/obsolescence/lifetime cycles of GPUs are WAY more rapid than anyone in the “AI” circlejerk bubble is willing to admit. Aside from the generational upgrades that you tend to see in GPUs, which make older models far less valuable in terms of investment, server hardware simply cannot function at peak load indefinitely - and running GPUs at peak load constantly MASSIVELY shortens the MTBF.
TL;DR: the way GPUs are used in ML applications mean that they tend to cook themselves WAY quicker than the GPU you have in your gaming machine or console - as in, they often have a couple of years lifetime, max, and that failure rate is a bell curve.











MTBF is absolutely not six years if you’re running your H100 nodes at peak load and heat soaking the shit out of them. ML workloads are particularly hard on GPU RAM in particular, and sustained heat load on that particular component type on the board is known to degrade performance and integrity.
As to Meta’s (or MS, or OpenAI, or what have you) doc on MTBF: I don’t really trust them on that, because they’re a big player in the “AI” bubble, so of course they’d want to give the impression that the hardware they’re using in their data centers still have a bunch of useful life left. That’s a direct impact to their balance sheet. If they can misrepresent extremely expensive components that they have a shitload of as still being worth a lot, instead of being essentially being salvage/parts only, I would absolutely expect them to do that. Especially in the regulatory environment in which we now exist.