My 8gb vram system as i try to load GLM-4.6-Q0.00001_XXXS.gguf:

Xylight@lemdro.id · edit-2 2 months ago

My 8gb vram system as i try to load GLM-4.6-Q0.00001_XXXS.gguf:

TheMightyCat@ani.social · 2 months ago

I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.

i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm

but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.

I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.

ffhein@lemmy.world · 2 months ago

I guess there’s some automatic vram paging going on. How many tokens per second do you get while generating?

TheMightyCat@ani.social · edit-2 2 months ago

i found the reason, somehow setting --max_num_seqs 1 makes vllm way more efficient.

Not sure exactly what it does but i think its because vllm batches requests and the api was using with exlamav3 doesn’t

Now im doing 100k with vllm too

(Worker_TP0_EP0 pid=99695) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(Worker_TP1_EP1 pid=99696) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,264 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,328 tokens
(EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x

I would say exlamav3 is still slightly more efficient but this explains the huge discrepancy, exlamav3 also allows setting GB per gpu which allows me to get a view more GB then vllm which spreads it evenly because a bunch of memory on gpu 0 is used for other stuff

As for the T/s its about the same, in the 80-100 range, this is what im getting with vllm:

(APIServer pid=99454) INFO 11-03 17:36:31 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:34 [loggers.py:127] Engine 000: Avg prompt throughput: 461.4 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 66.9%
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:43 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
(APIServer pid=99454) INFO 11-03 17:36:44 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
(APIServer pid=99454) INFO 11-03 17:36:44 [loggers.py:127] Engine 000: Avg prompt throughput: 1684.4 tokens/s, Avg generation throughput: 96.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 83.4%

Now that i have found this out ive switched back to vllm because the API i’m using with exlamav3 doesn’t support qwen 3 tools yet :(

ffhein@lemmy.world · 1 month ago

Ah, multiple GPUs? For some reason I thought you meant that with exllamav3 you had managed to load a model which was larger than your VRAM.