One viable strategy might be to offload as many experts as possible to the GPU, ...

wongarsu on Dec 8, 2023 | parent | context | favorite | on: Mistral "Mixtral" 8x7B 32k model [magnet]

One viable strategy might be to offload as many experts as possible to the GPU, and evaluate the other ones on the CPU. If you collect some statistics which experts are used most in your use cases and select those for GPU acceleration you might get some cheap but notable speedups over other approaches.