Hacker News new | past | comments | ask | show | jobs | submit login

You can, wait for a 4-bit quantized version



I only have a RTX 3070 with 8GB VRam. It can run quantized 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with 24GB VRAM can do it.


Once on llama.cpp, it will likely run on CPU with enough RAM, especially given that the GGUF mmap code only seems to use RAM for the parts of the weights that get used.


Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that.

It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.


It would be very tight. 8x7B 24GB (currently) has more overhead than 70B.

Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)

EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: