Hacker News new | past | comments | ask | show | jobs | submit login

> You won't be able to run this on your home GPU.

As far as I understand in a MOE model only one/few experts are actually used at the same time, shouldn't the inference speed for this new MOE model be roughly the same as for a normal Mistral 7B then?

7B models have a reasonable throughput when ran on a beefy CPU, especially when quantized down to 4bit precision, so couldn't Mixtral be comfortably ran on a CPU too then, just with 8 times the memory footprint?




So this specific model ships with a default config of 2 experts per token.

So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.

Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.

So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.

~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: