Yep. That is exactly the idea here. Our compression method is super duper naive....

mayukhdeb 4 months ago | parent | context | favorite | on: TopoNets: High performing vision and language mode...

Yep. That is exactly the idea here. Our compression method is super duper naive. We literally keep every n-th weight column and discard the rest. Turns out that even after getting rid of 80% of the weight columns in this way, we were able to retain the same performance in a 125M GPT.