• 4 Posts
  • 135 Comments
Joined 2 years ago
cake
Cake day: September 6th, 2023

help-circle















  • I don’t know if this is still useful for you, but I tried this out, mostly because I wanted to make sure I wasn’t crazy. Here’s my gpt-oss setup running on cheap AMD Instinct VRAM:

    ./llama-server \
      --model {model}.gguf
      --alias "gpt-oss-120b-mxfp4" \
      --threads 16 \
      -fa on\
      --main-gpu 0 \
      --ctx-size 64000 \
      --n-cpu-moe 0 \
      --n-gpu-layers 999 \
      --temp 1.0 \
      -ub 1536 \
      -b 1536 \
      --min-p 0.0 \
      --top-p 1.0 \
      --top-k 0.0 \
      --jinja \
      --host 0.0.0.0 \
      --port 11343 \
      --chat-template-kwargs '{"reasoning_effort": "medium"}'
    

    I trimmed the content because it wasn’t relevant but left roughly the shape of the replies to give a sense of the verbosity.

    Test 1: With default system message

    user prompt: how do i calculate softmax in python

    What is softmax
    1 python + numpy
    ...
    quick demo
    ...
    2 SciPy
    ...
    ...
    ...
    8 full script
    ...
    running the script
    ...
    results
    ...
    TL;DR
    ...
    

    followup prompt: how can i GPU-accelerate the function with torch

    1 why pytorch is fast
    ...
    ...
    **[Headers 2,3,4,5,6,7,8,9]**
    ...
    ...
    TL;DR
    ...
    Recap
    ...
    Table Recap
    ...
    Common pitfalls
    ...
    Going beyond float32
    ...
    10 Summary
    ...
    

    Overall 6393 Tokens including reasoning

    TEST 2 with this system prompt: You are a helpful coding assistant. Provide concise answers, to-the point answers. No fluff. Provide straightforward explanations when necessary. Do not add emoji and only provide tl;drs or summaries when asked.

    user prompt: how do i calculate softmax in python

    Softmax calculation in Python
    ...
    Key points
    ...
    

    followup prompt: how can i GPU-accelerate the function with torch

    GPU‑accelerated Softmax with PyTorch
    ...
    What the code does
    ...
    Tips for larger workloads
    ...
    

    Overall 1103 Tokens including reasoning




  • Qwen 3 or Qwen 3 Coder? Qwen3 comes in a 235B, 30B and smaller sizes. Qwen 3 Coder comes in a 30B or 480B size.

    Open Router has multiple quant options and, for coding, I’d try to only use 8bit int or higher.

    Claude also has a ton of sizes and deployment options with different capabilities.

    As far as reasoning, the newest Deepseek V3.1 Terminus should be pretty good.

    Honestly, all of these models should be able to help you up to a certain level with docker. I would double check how you connect to open router, making sure your hyperparams are good, making sure thinking/reasoning is enabled. Maybe try duck.ai and see if the models there are matching up to whatever you’re doing in open router.

    Finally, not being a hater, but LLMs are not intelligent. They cannot actually reason or think. They can probabilistically align with answers you want to see. Sometimes your issue might be too weird or new for them to be able to give you a good answer. Even today models will give you docker compose files with a version number at the top, a feature which has been deprecated for over a year.

    Edit: gpt-oss 120 should be cheap and capable enough. Available on duck.ai