Hacker News new | past | comments | ask | show | jobs | submit login

I find this article odd with its fixation on computing speed and 8bit.

For most current models, you need 40+ GB of RAM to train them. Gradient accumulation doesn't work with batch norms so you really need that memory.

That means either dual 3090/4090 or one of the extra expensive A100/H100 options. Their table suggests the 3080 would be a good deal, but it's not. It doesn't have enough RAM for most problems.

If you can do 8bit inference, don't use a GPU. CPU will be much cheaper and potentially also lower latency.

Also: Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?




> Gradient accumulation doesn't work with batch norms so you really need that memory.

Last I looked, very few SOTA models are trained with batch normalization. Most of the LLMs use layer norms which can be accumulated? (precisely because of the need to avoid the memory blowup).

Note also that batch normalization can be done in a memory efficient way: It just requires aggregating the batch statistics outside the gradient aggregation.


wav2vec2, whisper, HifiGAN, Stable Diffusion, and Imagen all use BatchNorm.


> It doesn't have enough RAM for most problems.

It might not be as glamorous or make as many headlines, but there is plenty of research that goes on below 40Gb.

While I most commonly use A100s for my research, all my models fit on my personal RTX 2080.


Wonder are you trying to walk around the limit or it just happen like this?


We're trying to walk around to limits:

I) My research involves biological data (protein-protein interactions) and my datasets are tiny (about 30K high-confidence samples). We have to regularize aggressively and use a pretty tiny network to get something that doesn't over-fit horrendously.

II) We want to accommodate many inferences (10^3 to 10^12) inferences on a personal desktop or cheap OVH server in little time so we can serve the model on an online portal.


I'm not sure any of this is accurate. 8 bit inference on a 4090 can do 660 Tflops and on an H100 can do 2 Pflops. Not to mention, there is no native support for FP8 (which are significantly better for deep learning) on existing CPUs.

The memory on a 4090 can serve extremely large models. Currently, int4 is started to become proven out. With 24GB of memory, you can serve 40 billion parameter models. That coupled with the fact that GPU memory bandwidth is significantly higher than CPU memory bandwidth means that CPUs should rarely ever be cheaper / lower latency than GPUs.


> Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?

They need to advertise it better. First time I hear about it.

What are the prices like there? GPUs/workstations?


Depends on who you know, but I've seen as low as €799 per new 3090 TI. But you need to waive the right to resell them and there are quotas, for obvious reasons.


Consumer parts are dirt cheap compared to enterprise ones. Most companies are not able to use them at scale due to CUDA license terms. I don't think there is much of a need for rebates here. For hobbyists, it is somewhat of a steep price for the latest cards, but it's already way down from the height of ETH mining a year back.


> For most current models, you need 40+ GB of RAM to train them. Gradient accumulation doesn't work with batch norms so you really need that memory.

There's a decision tree chart in the article that addresses this - as it points out there are plenty of models that are much smaller than this.

Not everything is a large language model.


> Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?

So maybe they were including information for the hobbyists/students which do not need or cannot afford the latest and greatest professional cards?


> If you can do 8bit inference, don't use a GPU. CPU will be much cheaper and potentially also lower latency.

Good advice. Does that mean that I can install like 64 gb ram on a PC and run those models in comparable time?


That's how cloud speech recognition is usually deployed. OpenAI whisper is faster than realtime on regular desktop CPUs, which I guess is good enough.

And for a datacenter, a few $100 AMD CPUs will beat a single $20k NVIDIA A100 at throughput per dollar.


> Also: Almost everyone using GPUs for work will join NVIDIA's Inception program and get rebates... So why look at retail prices?

Out of curiosity, does that also apply for consumer grade GPUs?


you can get RTX/A6000s but not 3090s or 4090s via inception.


Prices? Prebuilt workstations? Or should we just apply and see, is that easy?


Companies with 4+ employees only.


FAQ says 1 developer.


The client I work with can order 3090 TI and 4090 through their Inception link (in Germany). Apparently, it varies by partner.


I would be surprised if it did. But you probably shouldn't do professional work on GPUs that lack ECC memory.


The lack of ECC memory is almost certainly not a factor. If you can train at FP8 your model will recover from a single flipped bit somewhere.


I mean you could even view bit flips as a regularization technique like dropout...


Yeah I hear it’s common practice now to avoid synchronizing GPU training kernels in order to speed things up, and it has positive regularization benefits and little downside.


Anyone know if the GPUs are relatively affordable through Inception?


The 4090 Ti is rumored to have 48GB of VRAM, so one can only hope.


They nerfed the heck out of board memory in the 3000 series (3080 20GB was even made in limited quantity... going to miners in China :( ) so color me a bit skeptical.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: