• 4 Posts
  • 109 Comments
Joined 2 years ago
cake
Cake day: September 6th, 2023

help-circle





  • Ollama does use ROCm, however, so does llama.cpp. Vulkan happens to be another available backend supported by llama.cpp.

    GitHub: llama.cpp Supported Backends

    There is an old PRs which attempted to bring Vulkan support to Ollama - a logical and helpful move, given that the Ollama engine is based on llama.cpp - but the Ollama maintainers weren’t interested.

    As for performance vs ROCm, it does fine. Against CUDA, it also does well unless you’re in a mulit-gpu setup. Its magic trick is compatibility. Pretty much everything runs Vulkan. And Vulkan is intecompatible between generations of cards, architectures AND vendors. That’s how I’m running a single PC with Nvidia and AMD cards together






  • Oh god this turned into a vent session

    I think back of what I left behind. And I feel bad.

    But then I feel better because I remember the reason I left was that we outgrew our processes and codebase and we desperately needed a restructure but i got no support in doing so.

    I bitched for years that it was a continuity risk and a performance nightmare. But no. “Deliver more features. Add more junk for use cases that brought us no business value.” Never consider governance or security. Never consider best practices. Just more.

    I knew eventually something bad would happen and I would be thrown under the bus. So I split. It was a good decision.

    But yeah. Seone inherited a lot turd code




  • afk_strats@lemmy.worldtolinuxmemes@lemmy.world1337 h4x0r
    link
    fedilink
    English
    arrow-up
    61
    ·
    29 days ago

    This is so on point.

    I get not wanting to compile your code. Its extra work and, if you’re already catering to a very thech-savvy crowd, you can let them deal with the variance and extra compile time.

    BUT if you’re releasing your code for others TO USE and you don’t provide reproducible instructions, what’s the point?!?


  • It depends on your goals and your use case.

    • Do you want the most performance per dollar?You will never touch what the big datacenters can achieve.

    • Do you want privacy? Buy it yourself?

    • Do you want quality output? Go to the online providers or expect to pay more to build it yourself.

    I am actively trying to work on non-Nvidia hardware because I’m a techno-masochist. It’s very uphill especially at the cutting edge. People are building for CUDA.

    I can do amazing image generation on a 7900xtx with 24gb of vram. One of those is under 900 in the US which is great. A 3090 would probably be easier and is more expensive although it’s less performant hardware



  • Possibly. Vulkan would be compatible with the system and would be able to take advantage of iGPUs. You’d definintely want to look into whether or not you have any dedicated vRAM thats DDR5 and just use that if possible.

    Explanation: LLMs are extremely bound by memory bandwidth. They are essentially giant gigabyte-sized stores of numbers which have to be read from memory and multiplied by a numeric representations of your prompts…for every new word you type in and every word you generate. To do this, these models constantly pull data in and out of [v]RAM. So, while you may have plenty of RAM, and decent amounts of computing power, your 780m probably won’t ever be great for LLMs, even with Vulkan, because you don’t have the memory bandwidth to keep it busy.

    roughly, for a small model

    • CPU Dual channel DDR4 - 1.7 words per second
    • CPU Dual channel DDR5 - 3.5 words per second
    • vRAM RTX 1060 - 10+ words per second …