I have an unused dell optiplex 7010 i wanted to use as a base for an interference rig.
My idea was to get a 3060, a pci riser and 500w power supply just for the gpu. Mechanically speaking i had the idea of making a backpack of sorts on the side panel, to fit both the gpu and the extra power supply since unfortunately it’s an sff machine.
What’s making me weary of going through is the specs of the 7010 itself: it’s a ddr3 system with a 3rd gen i7-3770. I have the feeling that as soon as it ends up offloading some of the model into system ram is going to slow down to a crawl. (Using koboldcpp, if that matters.)
Do you think it’s even worth going through?
Edit: i may have found a thinkcenter that uses ddr4 and that i can buy if i manage to sell the 7010. Though i still don’t know if it will be good enough.
No, it’s super efficient! I can run 27B’s full 128K on my 3090, easy.
But you have to use the base llama.cpp server. kobold.cpp doesn’t seem to support the sliding window attention (last I checked like two weeks ago), so even a small context takes up a ton there.
And the image input part is optional. Delete the mmproj file, and it wont load.
There are all sorts of engine quirks like this, heh, it really is impossible to keep up with.
Oh ok. That changes a lot of things then :-). I think i’ll finally have to graduate to something a little less guided than kobold.cpp. Time to read llama.cpp’s and exllama’s docs i guess.
Thanks for the tips.
The LLM “engine” is mostly detached from the UI.
kobold.cpp is actually pretty great, and you can still use it with TabbyAPI (what you run for exllama) and the llama.cpp server.
I personally love this for writing and testing though:
https://github.com/lmg-anon/mikupad
And Open Web UI for more general usage.
There’s a big backlog of poorly documented knowledge too, heh, just ask if you’re wondering how to cram a specific model in. But the “jist” of the optimal engine rules are:
For MoE models (like Qwen3 30B), try ik_llama.cpp, which is a fork specifically optimized for big MoEs partially offloaded to CPU.
For Gemma 3 specifically, use the regular llama.cpp server since it seems to be the only thing supporting the sliding window attention (which makes long context easy).
For pretty much anything else, if it’s supported by exllamav3 and you have a 3060, it’s optimal to use that (via its server, which is called TabbyAPI). And you can use its quantized cache (try Q6/5) to easily get long context.