Ask HN: Why isn’t it possible to train LLMs on idle resource like SETI home?

kingcai · on April 16, 2023

ML training is not as easily parallelizable as the other problems that have been explored. I'm not familiar with SETI but I know this to be true for folding@home.

As you mentioned, ML training can be parallelized but this requires either model/data parallelism.

Data parallelism means spreading the data over many different compute units and then synchronizing gradients somehow. The heterogeneous nature of @home computing makes this particularly challenging, as you will be limited by the smallest compute unit. I've personally only ever seen data (and model) parallel done on a homogenous compute cluster (i.e. 8x GPUS)

For model parallelism, we split the model across different compute units. However, this means that you need to synchronize the different parts of the model together, which can get very expensive when you do it across the internet. If you have 8xGPUS on one machine, your latency is limited by PCIe instead of TCP/IP in a distributed @home cluster.

But I would say it's not impossible, someone clever could definitely figure it out.

johntiger1 · on April 16, 2023

Why wouldn't it work for CPU models?

lm28469 · on April 15, 2023

Probably because your entire country worth of personal computers delivers the same capacity as a rack of dedicated hardware

thedevindevops · on April 15, 2023

ML training is iterative and non-parallelizable so breaking it up into distributable units of work would not provide any benefits and would actually slow down learning.

uptownfunk · on April 16, 2023

Why does AWS seem to suggest otherwise?

>> With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.

https://aws.amazon.com/sagemaker/distributed-training/

lostdog · on April 16, 2023

There's still a lot of data that needs to be passed between GPUs. SageMaker is "minimizing the communication," but it's still way more than nothing, and all the gradients need to be communicated roughly every iteration. That's ok to send between computers with high speed datacenter links, but much more than you would ever want to send across the internet repeatedly.

Nevermark · on April 16, 2023

Some computations work well with large-chunk parallelism, but not fine-grain parallelism.

At some point, greater distribution of computing, especially across a large distributed network (the Internet!), and for smaller individual computing systems, means the time cost of combining or synchronizing intermediate calculations is far greater, than any benefit of adding another computing node.

muzani · on April 15, 2023

TIL Skynet needs deep work and focus time too

j4hdufd8 · on April 16, 2023

Calling ML training non-parallelizable is very strong, no? Matrix multiplication is highly parallelizable.

Am4TIfIsER0ppos · on April 16, 2023

Aside from technical reasons why would I volunteer my computing time for it to be wasted when someone lobotomizes the end result with their "bigotry protections"?

uptownfunk · on April 16, 2023

I think the idea is the model becomes fully open source.

gitgud · on April 18, 2023

Leaving a comment to come back in a few years/months when someone releases an approach that makes this possible...