I would use Ollama. The answer depends on your RAM size. 13B at 4 bit just fits ...

Art9681 · on Sept 15, 2023

What is the best model that would run on a 64GB M2 Max? I'm interested in attempting 70B model but not gimped to 2bit if I can avoid it. Would I bet able to run 4bit 70B?

syntaxing · on Sept 16, 2023

The biggest issue with local LLM is that best is extremely relative. As in, best for what application? If you want a generic LLM that does ok at everything, you can try Vicuna. If you want coding, code llama 34B is really good. Surprisingly, the 13B code up isn’t not as good, but pretty darn close as code llama 34B.

atharv_jaju · on Sept 14, 2023

I have 8GB RAM. So I am able to run 7B 4bit at maximum.

I used GGUF + llama.cpp.

Will try out Ollama.

Thanks for the info!

lynguist · on Sept 13, 2023

Does each parameter need double the allocation size?

syntaxing · on Sept 13, 2023

Not really, a 13B 4 bit fits in about 8GB of RAM. But you have to factor in the unified memory aspect of M series chips. Your “vram” and ram are from the same pool. So your computer shares the same pool to run the OS and application hence why you can only load a 13B 4bit for 16GB RAM.