Qwen3-Next-80B-A3B Thinking and Instruct land in llama.cpp

panda_abyss@lemmy.ca · edit-2 13 hours ago

Qwen3-Next-80B-A3B Thinking and Instruct land in llama.cpp

Avid Amoeba@lemmy.ca · 12 hours ago

How much RAM does something like that need?

hendrik@palaver.p3x.de · edit-2 8 hours ago

Don’t MoE models just load into memory as every other model and it’s just that they pick a subset of numbers to multiply by each step so they’re faster? That’d make me think it’d need somewhere around 80GB of memory at 8bit or 160GB at full precision, or something like 50GB at the average llama.cpp Q4_K_M…

Dran@lemmy.world · 7 hours ago

That is correct, but you might be missing why this is useful. MoE models are great for CPU inference, which is considerably cheaper than GPU inference at scale. The qwen 30b_a3b MoE and 8b dense models were widely considered similar in quality. If you have the vram, the 8b would be faster. If you don’t, then the 30b would be faster (as long as you had the ~19-22gb of ram required)

A very inexpensive used server with lots of memory channels but no gpu can do very cost-efficent inference in this scenario and loads of people are asking for this.

SmokeyDope@piefed.social · 2 hours ago

Fantastic explaination, thank you

Avid Amoeba@lemmy.ca · 2 hours ago

So not an AMD AM5 dual-channel system. 😅