Any 7b can run well (~50 tok/s) on an 8gb gpu if you tune the context size. 13b ...

aarushsah · on Dec 9, 2023

How so? I'm only getting 12 t/s using Mistral in LM Studio.

eyegor · on Dec 9, 2023

The lazy way is to use text-generation-webui, use an exllamav2 conversion of your model, and turn down context length until it fits (and tick the 8 bit cache option). If you go over your vram it will cut your speed substantially. Like 60/s down to 15/s for an extra 500 context length over what fits. Similar idea applies to any other backends, but you need to shove all the layers into vram if you want decent tok/s. To give you a starting point, typically for 7b models I have to use 4k-6k context length and I use 4-6 bit quantizations for an 8gb gpu. So start at 4 bit, 4k context and adjust up as you can.

You can find most popular models converted for you on huggingface.co if you add exl2 to your search and start with the 4 bit quantized version. Don't bother going above 6 bits even if you have spare vram, practically it doesn't offer much benefit.

For reference I max out around 60 tok/s at 4bit, 50 tok/s at 5bit, 40 at 6bit for some random 7b parameter model on a rtx 2070.

aarushsah · on Dec 10, 2023

Just tried it - doesn't seem to be working. In fact, I'm getting 1.4 t/s with a Quadro P4000 (8 GB) running a 7B at 3 bits per weight. Are you changing anything other than the 8 bit cache and context?

For reference, I'm getting 10 t/s with a Q5_K_M Mistral GGUF model.