Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Why isn’t it possible to train LLMs on idle resource like SETI home?
8 points by uptownfunk on April 15, 2023 | hide | past | favorite | 12 comments
What makes using volunteer compute resource not practical for training large scale LLMs. Something similar to the SETI@home project or the Mersenne prime number search which enabled users to effectively pool available compute resource together to solve some large problem.

It seems like compute resources are quickly becoming a bottleneck and moat preventing ML researchers to train and use LLM type language models.

Would be great to see a more publicly available solution to this, to break down the dam so to speak and give everyone access to SOTA LLMs



ML training is not as easily parallelizable as the other problems that have been explored. I'm not familiar with SETI but I know this to be true for folding@home.

As you mentioned, ML training can be parallelized but this requires either model/data parallelism.

Data parallelism means spreading the data over many different compute units and then synchronizing gradients somehow. The heterogeneous nature of @home computing makes this particularly challenging, as you will be limited by the smallest compute unit. I've personally only ever seen data (and model) parallel done on a homogenous compute cluster (i.e. 8x GPUS)

For model parallelism, we split the model across different compute units. However, this means that you need to synchronize the different parts of the model together, which can get very expensive when you do it across the internet. If you have 8xGPUS on one machine, your latency is limited by PCIe instead of TCP/IP in a distributed @home cluster.

But I would say it's not impossible, someone clever could definitely figure it out.


Why wouldn't it work for CPU models?


Probably because your entire country worth of personal computers delivers the same capacity as a rack of dedicated hardware


ML training is iterative and non-parallelizable so breaking it up into distributable units of work would not provide any benefits and would actually slow down learning.


Why does AWS seem to suggest otherwise?

>> With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.

https://aws.amazon.com/sagemaker/distributed-training/


There's still a lot of data that needs to be passed between GPUs. SageMaker is "minimizing the communication," but it's still way more than nothing, and all the gradients need to be communicated roughly every iteration. That's ok to send between computers with high speed datacenter links, but much more than you would ever want to send across the internet repeatedly.


Some computations work well with large-chunk parallelism, but not fine-grain parallelism.

At some point, greater distribution of computing, especially across a large distributed network (the Internet!), and for smaller individual computing systems, means the time cost of combining or synchronizing intermediate calculations is far greater, than any benefit of adding another computing node.


TIL Skynet needs deep work and focus time too


Calling ML training non-parallelizable is very strong, no? Matrix multiplication is highly parallelizable.


Aside from technical reasons why would I volunteer my computing time for it to be wasted when someone lobotomizes the end result with their "bigotry protections"?


I think the idea is the model becomes fully open source.


Leaving a comment to come back in a few years/months when someone releases an approach that makes this possible...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: