I'm not on my desktop, but I used nanoGPT extensively on my RTX-4090. I trained a GPT2-small but with a small context window (125 > 90M params) at batch sizes of 1536 using gradient accumulation (3*512). This runs at just a smidge over 1 it/s. Some notes
- Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis.
- torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7).