• afk_strats@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    9 days ago

    Im not sure if it’s a me issue but that’s a static image. I figure you posted where they throw a brick into it.

    Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven’t seen benchmarks other than perplexity which isn’t a good measure of capability?

    • Xylight@lemdro.idOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 days ago

      It’s a webp animation. Maybe your client doesn’t display it right, i’ll replace it with a gif

      Regarding your other question, I tend to see better results with higher params + lower precision, versus low params + higher precision. That’s just based on “vibes” though, I haven’t done any real testing. Based on what I’ve seen, Q4 is the lowest safe quantization, and beyond that, the performance really starts to drop off. unfortunately even at 1 bit quantization I can’t run GLM 4.6 on my system

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        9 days ago

        What’s higher precision for you? What I remember from the old measurements for ggml is, lower than Q3 rarely makes sense and roughly at Q3 you’d think about switching to a smaller variant. But on the other hand everything above Q6 only shows marginal differences in perplexity, so Q6 or Q8 or full precision are basically the same thing.

        • Xylight@lemdro.idOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          9 days ago

          As a memory-poor user (hence the 8gb vram card), I consider Q8+ to be is higher precision, Q4-Q5 is mid-low precision (what i typically use), and below that is low precision

          • hendrik@palaver.p3x.de
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            9 days ago

            Thanks. That sounds reasonable. Btw you’re not the only poor person around, I don’t even own a graphics card… I’m not a gamer so I never saw any reason to buy one before I took interest in AI. I’ll do inference on my CPU and that’s connected to more than 8GB of memory. It’s just slow 😉 But I guess I’m fine with that. I don’t rely on AI, it’s just tinkering and I’m patient. And a few times a year I’ll rent some cloud GPU by the hour. Maybe one day I’ll buy one myself.

      • afk_strats@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        9 days ago

        That fixed it.

        I am a fan of this quant cook. He often posts perplexity charts.

        https://huggingface.co/ubergarm

        All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don’t know if 8gb is enough for everything.

    • hendrik@palaver.p3x.de
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      9 days ago

      I think perplexity is still central to evaluating models. It’s notoriously difficult to come up with other ways to measure these things.

  • TheMightyCat@ani.social
    link
    fedilink
    English
    arrow-up
    4
    ·
    9 days ago

    I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.

    i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm

    but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.

    I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.

    • ffhein@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      8 days ago

      I guess there’s some automatic vram paging going on. How many tokens per second do you get while generating?

      • TheMightyCat@ani.social
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        5 hours ago

        i found the reason, somehow setting --max_num_seqs 1 makes vllm way more efficient.

        Not sure exactly what it does but i think its because vllm batches requests and the api was using with exlamav3 doesn’t

        Now im doing 100k with vllm too

        (Worker_TP0_EP0 pid=99695) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
        (Worker_TP1_EP1 pid=99696) INFO 11-03 17:34:00 [gpu_worker.py:298] Available KV cache memory: 4.73 GiB
        (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,264 tokens
        (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
        (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1087] GPU KV cache size: 103,328 tokens
        (EngineCore_DP0 pid=99577) INFO 11-03 17:34:00 [kv_cache_utils.py:1091] Maximum concurrency for 100,000 tokens per request: 1.03x
        

        I would say exlamav3 is still slightly more efficient but this explains the huge discrepancy, exlamav3 also allows setting GB per gpu which allows me to get a view more GB then vllm which spreads it evenly because a bunch of memory on gpu 0 is used for other stuff

        As for the T/s its about the same, in the 80-100 range, this is what im getting with vllm:

        (APIServer pid=99454) INFO 11-03 17:36:31 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:32 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:34 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:34 [loggers.py:127] Engine 000: Avg prompt throughput: 461.4 tokens/s, Avg generation throughput: 17.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 66.9%
        (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:35 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:36 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:43 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO:     127.0.0.1:32968 - "POST /tokenize HTTP/1.1" 200 OK
        (APIServer pid=99454) INFO 11-03 17:36:44 [qwen3coder_tool_parser.py:76] vLLM Successfully import tool parser Qwen3CoderToolParser !
        (APIServer pid=99454) INFO 11-03 17:36:44 [loggers.py:127] Engine 000: Avg prompt throughput: 1684.4 tokens/s, Avg generation throughput: 96.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 83.4%
        

        Now that i have found this out ive switched back to vllm because the API i’m using with exlamav3 doesn’t support qwen 3 tools yet :(