I have a server at home sitting IDLE for the last 2 years with 2 TB of RAM and 4...

AzN1337c0d3r · 2025-06-10T06:40:05 1749537605

Depends on the server. Probably not going to be cost effective. I get barely ~0.5 tokens/sec.

I have Dual E5-2699A v4 w/1.5 TB DDR4-2933 spread across 2 sockets.

The full Deepseek-R1 671B (~1.4 TB) with llama.cpp seems to have a in that local engines that run the LLMs don't do NUMA aware allocation, so cores will often have to pull the weights in from another socket's memory controllers through the inter-socket links (QPI/UPI/Hypertransport) and bottleneck there.

For my platform that's 2x QPI links @ ~39.2GB/s/link that get saturated.

I give it a prompt, go to work and check back on it at lunch and sometimes it's still going.

If you're going to want to achieve interactively I'd aim for 7-10 tokens/s, so realistically it means you'll run one of the 8b models on a GPU (~30 tokens/s) or maybe a 70b model on an M4 Max (~8 tokens/s).

hadlock · 2025-06-10T19:33:31 1749584011

Unless it's actively processing something it's also sitting idle, so pretty efficient besides it vacuuming up all your system memory.