Hacker News new | past | comments | ask | show | jobs | submit login

I would use Ollama. The answer depends on your RAM size. 13B at 4 bit just fits on a 16GB. 34B at 4bit will require 32GB.



What is the best model that would run on a 64GB M2 Max? I'm interested in attempting 70B model but not gimped to 2bit if I can avoid it. Would I bet able to run 4bit 70B?


The biggest issue with local LLM is that best is extremely relative. As in, best for what application? If you want a generic LLM that does ok at everything, you can try Vicuna. If you want coding, code llama 34B is really good. Surprisingly, the 13B code up isn’t not as good, but pretty darn close as code llama 34B.


I have 8GB RAM. So I am able to run 7B 4bit at maximum.

I used GGUF + llama.cpp.

Will try out Ollama.

Thanks for the info!


Does each parameter need double the allocation size?


Not really, a 13B 4 bit fits in about 8GB of RAM. But you have to factor in the unified memory aspect of M series chips. Your “vram” and ram are from the same pool. So your computer shares the same pool to run the OS and application hence why you can only load a 13B 4bit for 16GB RAM.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: