I did nothing and I’m all out of ideas!

  • 2 Posts
  • 10 Comments
Joined 3 years ago
cake
Cake day: June 11th, 2023

help-circle
  • I’m not sure what you are really interested in, if you are searching frontier model’s capabilities with a good privacy policy… The answer is no.

    If you are interested in privacy and can take an hit to performance, there’s lumo by proton, which I’ve never tried personally, but it should use open models, and there should be the list somewhere there.

    Otherwise you can go European with Mistral’s Le Chat, which is not as good as the multibillion dollars companies offerings but it is quite good. I tend to use this one. Check the settings to disable data training.

    Last but not least you can use a wrapper around the frontier models like the one offered by duckduckgo. There are many.

    If you don’t mind paying there are no logs services that give you access to KimiK2 level models. Or you could spin up something on runpod or vast ai style gpu rentals.

    So. It depends.




  • I don’t have direct experience with RooCode and Cline, but I would be mighty surprised if they work with lesser models of even the old Qwen2-Coder 32B - and even that was mostly misses. I never tried the Qwen3 coder but I assume it is not drastically different.

    Those small models are at most useful for some kind of smarter autocomplete, not to run a full tools framework.

    BTW you could check out Aider too for a different approach, and they have a lot of benchmarks that can help you get an idea about what’s needed.








  • I’ve never used oobabooga but if you use llama.cpp directly you can specify the number of layers that you want to run on the GPU with the -ngl flag, followed by the number.

    So, as an example, a command (on linux) from the directory you have the binary, to run its server would look something like: ./llama-server -m "/path/to/model.gguf" -ngl 10

    Another important flag that could interest you is -c for the context size.

    This will put 10 layers of the model on the GPU, the rest will be on RAM for the CPU.

    I would be surprised if you can’t just connect to the llama.cpp server or just set text-generation-webui to do the same with some setting.

    At worst you can consider using ollama, which is a llama.cpp wrapper.

    But probably you would want to invest the time to understand how to use llama.cpp directly and put a UI in front of it, Sillytavern is a good one for many usecases, OpenWebUI can be another but - in my experience - it tends to have more half baked features and the development jumps around a lot.

    As a more general answer, no, the safetensor format doesn’t directly support quantization, as far as I know