Nvidia H100s/B200s, what they mostly make for AI, can’t game.
But, they also repurpose some 4090/5090 silicon as budget “inference” AI cards, like the L40 and such.
My observation is that these “inference” cards aged like milk. No business wants them at scale, they’re just crap at running big MoEs, to the extent that Nvidia was doling out contracts like “Okay, if you buy this many H100s you have to take some L40s too”.
They have piles of these “low end” AI cards no one wants, so what do they do? Use them for game streaming.
Not trying to defend Nvidia’s predatory practices, but this does make a lot of sense. They’re essentially overstocked with cloud-only gaming GPUs already.
This isn’t a bad theory. Just to add some color, one of these “low-end” cards goes for $7500 USD right now. I certainly want one but not for a penny over 2k
If you can make a useful MoE thing where each expert model has a small final layer in its neural net, so you don’t need to move much data between cards, then running each MoE on a different card might be viable. Regardless of whether the GPU vendor wants to segment up the gaming and AI markets.
I think that that’s one of the biggest unknowns as to where AI may wind up going. If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models, and it’s going to be much harder for OpenAI or similar to obtain much of a barrier to entry. That may have dramatic impact on who has what degree of access to AI.
Huawei’s model splits the experts into 8 groups, routed so that each group always has the same number of experts active. This means that (on an 8 NPU server) intercommunication is minimized and the load is balanced.
There’s another big MoE (ERNIE? Don’t quote me) that ships with native 2-bit QAT, too. It’s basically explicitly made to cram into 8 gaming GPUs.
If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models
I mean. I can run GLM 4.6 350B at 7 tokens/sec on a single 3090 + Ryzen CPU. With modest token convergence compared to the full model. Most can run GLM air and replace base tier ChatGPT.
Some businesses are already serving models split across cheap GPUs. It can be done, but its not turnkey like it is for NVLink HBM cards.
Honestly the only thing keeping OpenAI in place is name recognition, a timing lead, SEO/convenience and… hype. Basically inertia + anticompetitiveness. The tech to displace them is there, it’s just inaccessible and unknown.
Yeah. They’re like aging potatoes in a barn they need to use for something, and its an “answer” to the gaming GPU shortage without reallocating any silicon supply.
I had an epiphany with this GFN push.
Nvidia H100s/B200s, what they mostly make for AI, can’t game.
But, they also repurpose some 4090/5090 silicon as budget “inference” AI cards, like the L40 and such.
My observation is that these “inference” cards aged like milk. No business wants them at scale, they’re just crap at running big MoEs, to the extent that Nvidia was doling out contracts like “Okay, if you buy this many H100s you have to take some L40s too”.
They have piles of these “low end” AI cards no one wants, so what do they do? Use them for game streaming.
Not trying to defend Nvidia’s predatory practices, but this does make a lot of sense. They’re essentially overstocked with cloud-only gaming GPUs already.
This isn’t a bad theory. Just to add some color, one of these “low-end” cards goes for $7500 USD right now. I certainly want one but not for a penny over 2k
If you can make a useful MoE thing where each expert model has a small final layer in its neural net, so you don’t need to move much data between cards, then running each MoE on a different card might be viable. Regardless of whether the GPU vendor wants to segment up the gaming and AI markets.
I think that that’s one of the biggest unknowns as to where AI may wind up going. If you can get good results on gaming cards, then suddenly ordinary gaming hardware, run in parallel, may be quite capable of running the important models, and it’s going to be much harder for OpenAI or similar to obtain much of a barrier to entry. That may have dramatic impact on who has what degree of access to AI.
Kinda already done:
https://arxiv.org/abs/2504.07866
Huawei’s model splits the experts into 8 groups, routed so that each group always has the same number of experts active. This means that (on an 8 NPU server) intercommunication is minimized and the load is balanced.
There’s another big MoE (ERNIE? Don’t quote me) that ships with native 2-bit QAT, too. It’s basically explicitly made to cram into 8 gaming GPUs.
I mean. I can run GLM 4.6 350B at 7 tokens/sec on a single 3090 + Ryzen CPU. With modest token convergence compared to the full model. Most can run GLM air and replace base tier ChatGPT.
Some businesses are already serving models split across cheap GPUs. It can be done, but its not turnkey like it is for NVLink HBM cards.
Honestly the only thing keeping OpenAI in place is name recognition, a timing lead, SEO/convenience and… hype. Basically inertia + anticompetitiveness. The tech to displace them is there, it’s just inaccessible and unknown.
Yeah that makes a fuck ton of sense imo.
We have all this inventory, it isn’t moving, we need to hit delusional profit margins, how do we do that?
Yep, massive revamp/expansion of streaming game rendering as a service is a pretty solid answer to that question.
Yeah. They’re like aging potatoes in a barn they need to use for something, and its an “answer” to the gaming GPU shortage without reallocating any silicon supply.