The thing to understand about any model architecture is that there isn't really ...

noosphr · 2025-05-19T23:17:45 1747696665

Without going into too much detail: the complexity space of tensor operations is for all practical purposes infinite. The general tensor which captures all interactions between all elements of an input of length N is NxN.

This is worse than exponential and means we have nothing but tricks to try and solve any problem that we see in reality.

As an example solving mnist and its variants of 28x28 pixels will be impossible until the 2100s because we don't have enough memory to store the general tensor which stores the interactions between group of pixels with every other group pixels.

joefourier · 2025-05-20T00:31:38 1747701098

While true in a theoretical sense (an MLP of sufficient size can theoretically represent any differentiable function), in practice it’s often the case that it’s impossible for a certain architecture to learn a specific task no matter how much compute you throw at it. E.g. an LSTM will never capture long range dependencies that a transformer could trivially learn, due to gradients vanishing after a certain sequence length.

ActorNightly · 2025-05-20T17:55:02 1747763702

You are right with respect to ordering of operations, where recurrent networks have a whole bunch of other computational complexity to them.

However, for example, a Transformer can be represented with just deeply connected layers, albeit with a lot of zeros for weights.