I’m kind of more-sympathetic to Microsoft than to some of the other companies involved.
Microsoft is trying to leverage the Windows platform that they control to do local LLM use. I’m not at all sure that there’s actually enough memory out there to do that, or that it’s cost-effective to put a ton of memory and compute capacity in everyone’s home rather than time-sharing hardware in datacenters. Nor am I sold that laptops — which many “Copilot PCs” are — are a fantastic place to be doing a lot of heavyweight parallel compute.
But…from a privacy standpoint, I kind of would like local LLMs to be at least available, even if they aren’t as affordable as cloud-based stuff. And at least Microsoft is at least supporting that route. A lot of companies are going to be oriented towards just doing AI stuff in the cloud.
Isn’t that the whole shtick of the AI PCs no one wanted? Like, isn’t there some kind of non-GPU co-processor that runs the local models more efficiently than the CPU?
I don’t really want local LLMs but I won’t begrudge those who do. Still, I wouldn’t trust any proprietary system’s local LLMs to not feed back personal info for “product improvement” (which for AI is your data to train on).
Copilot+ PCs are a new class of Windows 11 hardware powered by a high-performance Neural Processing Unit (NPU) — a specialized computer chip for AI-intensive processes like real-time translations and image generation—that can perform more than 40 trillion operations per second (TOPS).
It’s not…terribly beefy. Like, I have a Framework Desktop with an APU and 128GB of memory that schlorps down 120W or something, substantially outdoes what you’re going to do on a laptop. And that in turn is weaker computationally than something like the big Nvidia hardware going into datacenters.
If Microsoft cared about privacy then they wouldn’t have made windows practically spyware. Even if they install AI locally in the OS, it’s still proprietary software that constantly sends data back to the mothership, consuming your electricity and RAM to do so. Linux has so many options, there’s really no reason not to switch.
Small LLMs already exist for local self-hosting, and there are open-source options which won’t steal your data and turn you into a product.
Bear in mind that the number of parameters your system can handle is limited by how much memory is available, and using a quantized version can increase the number of parameters you can handle with the same amount of memory.
Unless you have some really serious hardware, 24 billion parameters is probably the maximum that would be practical for self-hosting on a reasonable hobbyist set-up. But I’m no expert, so do some research and calculate for yourself what your system can handle.
Unless you have some really serious hardware, 24 billion parameters is probably the maximum that would be practical for self-hosting on a reasonable hobbyist set-up.
Eh…I don’t know if you’d call it “really serious hardware”, but when I picked up my 128GB Framework Desktop, it was $2k (without storage), and that box is often described as being aimed at the hobbyist AI market. That’s pricier than most video cards, but an AMD Radeon RX 7900 XTX GPU was north of $1k, an NVidia RTX 4090 was about $2k, and it looks like the NVidia RTX 5090 is presently something over $3k (and rising) on EBay, well over MSRP. None of those GPUs are dedicated hardware aimed at doing AI compute, just high-end cards aimed at playing games that people have used to do AI stuff on.
I think that the largest LLM I’ve run on the Framework Desktop was a 106 billion parameter GLM model at Q4_K_M quantization. It was certainly usable, and I wasn’t trying to squeeze as large a model as possible on the thing. I’m sure that one could run substantially-larger models.
EDIT: Also, some of the newer LLMs are MoE-based, and for those, it’s not necessarily unreasonable to offload expert layers to main memory. If a particular expert isn’t being used, it doesn’t need to live in VRAM. That relaxes some of the hardware requirements, from needing a ton of VRAM to just needing a fair bit of VRAM plus a ton of main memory.
See, you have more experience in the matter than I do, hence the caveat that I’m not an expert. Thanks for sharing your experience.
Then again, I’d consider 128GB of memory to be fairly serious hardware, but if that’s common among hobbyists then I stand corrected. I was operating on the assumption that 64GB of RAM is already a lot
All in all, 106 billion parameters on 128GB of memory with quantization doesn’t surprise me all that much. But again, I’m just going off of the vague notions I’ve gathered from reading about it.
The focus of my original comment was more on the fact that self-hosting is an option, I wasn’t trying to be too precise with the specs. My bad if it came off that way
Microsoft wants developers to have local access to models but end users are 100% corralled into OneDrive and Copilot. I’m not sympathetic to them at all.
They’re trying to leverage their windows platform to seek rent (sell premium cloud services like LLM access) for shit people don’t even want because they aren’t satisfied making very respectable money on licenses.
I wouldn’t trust a local LLM solution from a large American company. Not saying that they would try to “pull a quick one”, but they are unreliable and corrupt.
I’m kind of more-sympathetic to Microsoft than to some of the other companies involved.
Microsoft is trying to leverage the Windows platform that they control to do local LLM use. I’m not at all sure that there’s actually enough memory out there to do that, or that it’s cost-effective to put a ton of memory and compute capacity in everyone’s home rather than time-sharing hardware in datacenters. Nor am I sold that laptops — which many “Copilot PCs” are — are a fantastic place to be doing a lot of heavyweight parallel compute.
But…from a privacy standpoint, I kind of would like local LLMs to be at least available, even if they aren’t as affordable as cloud-based stuff. And at least Microsoft is at least supporting that route. A lot of companies are going to be oriented towards just doing AI stuff in the cloud.
Is that true? I haven’t heard MS say anything about enabling local LLMs. Genuinely curious and would like to know more.
Isn’t that the whole shtick of the AI PCs no one wanted? Like, isn’t there some kind of non-GPU co-processor that runs the local models more efficiently than the CPU?
I don’t really want local LLMs but I won’t begrudge those who do. Still, I wouldn’t trust any proprietary system’s local LLMs to not feed back personal info for “product improvement” (which for AI is your data to train on).
NPU neural processing unit
That’s why they have the “Copilot PC” hardware requirement, because they’re using an NPU on the local machine.
searches
https://learn.microsoft.com/en-us/windows/ai/npu-devices/
It’s not…terribly beefy. Like, I have a Framework Desktop with an APU and 128GB of memory that schlorps down 120W or something, substantially outdoes what you’re going to do on a laptop. And that in turn is weaker computationally than something like the big Nvidia hardware going into datacenters.
But it is doing local computation.
If Microsoft cared about privacy then they wouldn’t have made windows practically spyware. Even if they install AI locally in the OS, it’s still proprietary software that constantly sends data back to the mothership, consuming your electricity and RAM to do so. Linux has so many options, there’s really no reason not to switch.
Small LLMs already exist for local self-hosting, and there are open-source options which won’t steal your data and turn you into a product.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/
Bear in mind that the number of parameters your system can handle is limited by how much memory is available, and using a quantized version can increase the number of parameters you can handle with the same amount of memory.
Unless you have some really serious hardware, 24 billion parameters is probably the maximum that would be practical for self-hosting on a reasonable hobbyist set-up. But I’m no expert, so do some research and calculate for yourself what your system can handle.
Eh…I don’t know if you’d call it “really serious hardware”, but when I picked up my 128GB Framework Desktop, it was $2k (without storage), and that box is often described as being aimed at the hobbyist AI market. That’s pricier than most video cards, but an AMD Radeon RX 7900 XTX GPU was north of $1k, an NVidia RTX 4090 was about $2k, and it looks like the NVidia RTX 5090 is presently something over $3k (and rising) on EBay, well over MSRP. None of those GPUs are dedicated hardware aimed at doing AI compute, just high-end cards aimed at playing games that people have used to do AI stuff on.
I think that the largest LLM I’ve run on the Framework Desktop was a 106 billion parameter GLM model at Q4_K_M quantization. It was certainly usable, and I wasn’t trying to squeeze as large a model as possible on the thing. I’m sure that one could run substantially-larger models.
EDIT: Also, some of the newer LLMs are MoE-based, and for those, it’s not necessarily unreasonable to offload expert layers to main memory. If a particular expert isn’t being used, it doesn’t need to live in VRAM. That relaxes some of the hardware requirements, from needing a ton of VRAM to just needing a fair bit of VRAM plus a ton of main memory.
See, you have more experience in the matter than I do, hence the caveat that I’m not an expert. Thanks for sharing your experience.
Then again, I’d consider 128GB of memory to be fairly serious hardware, but if that’s common among hobbyists then I stand corrected. I was operating on the assumption that 64GB of RAM is already a lot
All in all, 106 billion parameters on 128GB of memory with quantization doesn’t surprise me all that much. But again, I’m just going off of the vague notions I’ve gathered from reading about it.
The focus of my original comment was more on the fact that self-hosting is an option, I wasn’t trying to be too precise with the specs. My bad if it came off that way
Microsoft wants developers to have local access to models but end users are 100% corralled into OneDrive and Copilot. I’m not sympathetic to them at all.
They’re trying to leverage their windows platform to seek rent (sell premium cloud services like LLM access) for shit people don’t even want because they aren’t satisfied making very respectable money on licenses.
I wouldn’t trust a local LLM solution from a large American company. Not saying that they would try to “pull a quick one”, but they are unreliable and corrupt.