Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
danielhanchen
6 days ago
|
parent
|
context
|
favorite
| on:
Deepseek R1-0528
Oh you can still run them unquantized! See
https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t...
where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!
You can do it via `-ot ".ffn_.*_exps.=CPU"`
behnamoh
6 days ago
[–]
Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.
reply
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search:
You can do it via `-ot ".ffn_.*_exps.=CPU"`