GPU VRAM Price (€) Bandwidth (TB/s) TFLOP16 €/GB €/TB/s €/TFLOP16
NVIDIA H200 NVL 141GB 36284 4.89 1671 257 7423 21
NVIDIA RTX PRO 6000 Blackwell 96GB 8450 1.79 126.0 88 4720 67
NVIDIA RTX 5090 32GB 2299 1.79 104.8 71 1284 22
AMD RADEON 9070XT 16GB 665 0.6446 97.32 41 1031 7
AMD RADEON 9070 16GB 619 0.6446 72.25 38 960 8.5
AMD RADEON 9060XT 16GB 382 0.3223 51.28 23 1186 7.45

This post is part “hear me out” and part asking for advice.

Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.

so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.

  • starshipwinepineapple@programming.dev
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 hours ago

    The table you’re referencing leaves out CUDA/ tensor cores (count+gen) which is a big part of the gpus, and also not factoring in type of memory. From the comments it looks like you want to use a large MoE model. You aren’t going to be able to just stack raw power and expect to be able to run this without major deterioration of performance if it runs at all.

    Don’t forget your MoE model needs all-to-all communication for expert routing

    • TheMightyCat@ani.socialOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 hours ago

      Why do core counts and memory type matter when the table includes memory bandwith and tflop16?

      The H200 has HBM and alot of tensor cores which is reflected in its high stats in the table and the amd gpus don’t have cuda cores.

      I know a major deterioration is to be expected but how major? Even in extreme cases with only 10% efficiency of the total power then its still competitive against the H200 since you can get way more for the price, even if you can only use 10% of that.

      • starshipwinepineapple@programming.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 hours ago

        Tflops is a generic measurement, not actual utilization, and not specific to a given type of workload. Not all workloads saturate gpu utilization equally and ai models will depend on cuda/tensor. the gen/count of your cores will be better optimized for AI workloads and better able to utilize those tflops for your task. and yes, amd uses rocm which i didn’t feel i needed to specify since its a given (and years behind cuda capabilities). The point is that these things are not equal and there are major differences here alone.

        I mentioned memory type since the cards you listed use different versions ( hbm vs gddr) so you can’t just compare the capacity alone and expect equal performance.

        And again for your specific use case of this large MoE model you’d need to solve the gpu-to-gpu communication issue (ensuring both connections + sufficient speed without getting bottlenecked)

        I think you’re going to need to do actual analysis of the specific set up youre proposing. Good luck