> For most current models, you need 40+ GB of RAM to train them. Gradient accumu...

> For most current models, you need 40+ GB of RAM to train them. Gradient accumulation doesn't work with batch norms so you really need that memory.

There's a decision tree chart in the article that addresses this - as it points out there are plenty of models that are much smaller than this.

Not everything is a large language model.