You can, wait for a 4-bit quantized version

tarruda · on Dec 8, 2023

I only have a RTX 3070 with 8GB VRam. It can run quantized 7B models well, but this is 8 x 7B. Maybe an RTX 3090 with 24GB VRAM can do it.

espadrine · on Dec 8, 2023

Once on llama.cpp, it will likely run on CPU with enough RAM, especially given that the GGUF mmap code only seems to use RAM for the parts of the weights that get used.

burke · on Dec 8, 2023

Napkin math: 7x(4/8)x8 is 28GB, and q4 uses a little more than just 4 bits per param, and there’s extra overhead for context, and the FFN to select experts is probably more on top of that.

It would probably fit in 32GB at 4-bit but probably won’t run with sensible quantization/perf on a 3090/4090 without other tricks like offloading. Depending on how likely the same experts are to be chosen for multiple sequential tokens, offloading experts may be viable.

brucethemoose2 · on Dec 8, 2023

It would be very tight. 8x7B 24GB (currently) has more overhead than 70B.

Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)

EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.