Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Depends on the server. Probably not going to be cost effective. I get barely ~0.5 tokens/sec.

I have Dual E5-2699A v4 w/1.5 TB DDR4-2933 spread across 2 sockets.

The full Deepseek-R1 671B (~1.4 TB) with llama.cpp seems to have a in that local engines that run the LLMs don't do NUMA aware allocation, so cores will often have to pull the weights in from another socket's memory controllers through the inter-socket links (QPI/UPI/Hypertransport) and bottleneck there.

For my platform that's 2x QPI links @ ~39.2GB/s/link that get saturated.

I give it a prompt, go to work and check back on it at lunch and sometimes it's still going.

If you're going to want to achieve interactively I'd aim for 7-10 tokens/s, so realistically it means you'll run one of the 8b models on a GPU (~30 tokens/s) or maybe a 70b model on an M4 Max (~8 tokens/s).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: