Correct. You want everything loaded, but for each forward pass just some experts... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		NitpickLawyer 28 days ago \| parent \| context \| favorite \| on: Qwen3-Next Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model. That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.

theanonymousone 28 days ago [–]

Can you give me a name please? Is that distributed llama or something else?

skirmish 27 days ago | [–]

I have not used it but this is probably it: https://github.com/lyogavin/airllm

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact