[Guide] Arch (btw), ROCm (AMD), docker (podman), llama.cpp (server) setup

Mechanize@feddit.it · 21 days ago

You could try asking in [email protected] too, feddit.it is an italian speaking instance

Mechanize@feddit.it · 29 days ago

I’m not sure if you are referring to this, but I’ve noticed some people overriding their comments after around 24h they have posted them - probably running an automated script. I assume they are doing it for perceived privacy reasons.

Mechanize@feddit.it · 1 month ago

Thank you for commenting!

Mechanize@feddit.it · 1 month ago

It was a pleasure! Thank you!

Mechanize@feddit.it · 1 month ago

[Guide] Arch (btw), ROCm (AMD), docker (podman), llama.cpp (server) setup

Mechanize@feddit.it · 6 months ago

I’ve never used oobabooga but if you use llama.cpp directly you can specify the number of layers that you want to run on the GPU with the -ngl flag, followed by the number.

So, as an example, a command (on linux) from the directory you have the binary, to run its server would look something like: ./llama-server -m "/path/to/model.gguf" -ngl 10

Another important flag that could interest you is -c for the context size.

This will put 10 layers of the model on the GPU, the rest will be on RAM for the CPU.

I would be surprised if you can’t just connect to the llama.cpp server or just set text-generation-webui to do the same with some setting.

At worst you can consider using ollama, which is a llama.cpp wrapper.

But probably you would want to invest the time to understand how to use llama.cpp directly and put a UI in front of it, Sillytavern is a good one for many usecases, OpenWebUI can be another but - in my experience - it tends to have more half baked features and the development jumps around a lot.

As a more general answer, no, the safetensor format doesn’t directly support quantization, as far as I know

Mechanize@feddit.it · edit-2 11 months ago

Last night Organic Maps was removed from the Play Store