Hacker News new | past | comments | ask | show | jobs | submit login

Can you please run bench from https://github.com/karpathy/nanoGPT ?



I'm not on my desktop, but I used nanoGPT extensively on my RTX-4090. I trained a GPT2-small but with a small context window (125 > 90M params) at batch sizes of 1536 using gradient accumulation (3*512). This runs at just a smidge over 1 it/s. Some notes

- Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis.

- torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7).

- I did the optimalisations in https://arxiv.org/abs/2212.14034


Can you share the code and numbers so that I could compare directly with my 3090?

Do you train in fp16/bf16?

Have you tried fp8?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: