Unsloth has quants for both
- Unsloth Qwen3 Next guide
- unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF
- unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF
This was a great day to check the news. I also saw that vllm has just added support for strix-halo.
Unsloth has quants for both
This was a great day to check the news. I also saw that vllm has just added support for strix-halo.
That is correct, but you might be missing why this is useful. MoE models are great for CPU inference, which is considerably cheaper than GPU inference at scale. The qwen 30b_a3b MoE and 8b dense models were widely considered similar in quality. If you have the vram, the 8b would be faster. If you don’t, then the 30b would be faster (as long as you had the ~19-22gb of ram required)
A very inexpensive used server with lots of memory channels but no gpu can do very cost-efficent inference in this scenario and loads of people are asking for this.
Fantastic explaination, thank you
So not an AMD AM5 dual-channel system. 😅