Yes to both, the "neuron" would basically be a weighted parameter. A parameter i...

pmayrgundter · on Jan 4, 2024

This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Transformers do have coefficients that are fit, but that's more broad.. could be used for any sort of regression or optimization, and not necessarily indicative of biological analogs.

So I think the terms "learned model" of "weights" are malapropisms for Transformers, carried over from deep nets because of structural similarities, like many layers, and the development workflow.

The functional units in Transformer's layers have lost their orginal biological inspiration and functional analog. The core function in Transformers is more like autoencoding/decoding (concepts from info theory) and model/grammar-free translation, with a unique attention based optimization. Transformers were developed for translation. The magic is smth like "attending" to important parts of the translation inputs&outputs as tokens are generated, maybe as a kind of deviation on pure autoencoding, due to the bias from the .. learned model :) See I can't even escape it.

Attention as a powerful systemic optimization is the actual closer bit of neuro/bio-insporation here.. but more from Cog Psych than micro/neuro anatomy.

Btw, not only is attention a key insight for Transformers, but it's an interesting biographical note that the lead inventor of it, Jakob Uzkereit, went on to work on a bio-AI startup after Google.

kmmlng · on Jan 4, 2024

> This is all true in a neutral net, but Transformers aren't Neural Nets in the traditional sense. I was under that impression originally, but there's not a back propagation or Hebbian learning here, which were the key bits of biomimicry that earned classic NNs their name.

Hebbian learning has never been used with much success in training neural nets. Backpropagation is not bio-inspired, but backpropagation is certainly used to train transformers.

pmayrgundter · on Jan 4, 2024

Agreed Hebbian learning isn't used.. just meant it as an example of what would signal a NN.

For Backprop, I'm basing this off the development of the Perception. Wiki supports this and its bio-inslired origin[1].

As for its use in Transformers, if you mean simple regressing of errors or use of gradient descent, I'd agree, but that's not usually called Backprop and the term isn't used in the original paper. The term typically means back propagating the errors thru the entire network at a certain stage of learning, and that's not present in Transformers that I can tell.

Happy to see any support for your claims tho.

https://en.m.wikipedia.org/wiki/Backpropagation

kmmlng · on Jan 4, 2024

What do you mean, the development of the "Perception"? Do you mean the Perceptron? In that case, Backprop was invented way later than the Perceptron (see https://people.idsia.ch/~juergen/who-invented-backpropagatio...).

I don't see any information in your linked Wikipedia article that supports a bio-inspired origin. In fact, researchers have been wondering whether an equivalent to Backprop might be found in biological brains, but Backprop is widely believed to be biologically implausible (see e.g. https://arxiv.org/pdf/1502.04156.pdf, https://www.sciencedirect.com/science/article/pii/S089360801...).

It's not surprising that the term Backprop is not mentioned in the original paper, it isn't mentioned in most neural network research, because it's simply the default method to optimize weights and additionally it's hidden away by modern autodiff frameworks, so no one actually has to give it any thought. But backprop is definitely used in transformers (see e.g. https://aclanthology.org/2020.emnlp-main.463.pdf, https://arxiv.org/pdf/2004.08249, https://proceedings.mlr.press/v202/phang23a/phang23a.pdf, https://dinkofranceschi.com/docs/bft.pdf)

pmayrgundter · on Jan 4, 2024

Ah yes, Perceptron. Had a couple typos.. sorry, was on phone.

The bio-inspiration was via Frank Rosenblatt, who is referred to in that article tho yeah, the history is over in his article:

https://en.wikipedia.org/wiki/Frank_Rosenblatt#Perceptron

"Rosenblatt was best known for the Perceptron, an electronic device which was constructed in accordance with biological principles and showed an ability to learn.

He developed and extended this approach in numerous papers and a book called Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, published by Spartan Books in 1962.[6] He received international recognition for the Perceptron.

The Mark I Perceptron, which is generally recognized as a forerunner to artificial intelligence, currently resides in the Smithsonian Institution in Washington D.C."

Your Juergen page is interesting, tho no direct comment on Rosenblatt there. He does cite the work on this page:

https://people.idsia.ch/~juergen/deep-learning-overview.html (refs R58, R61)

My reading is that a long-known idea, about multi-variate regression, was reinterpreted by Rosenblatt by 1958 via the bio-inspired Perceptron, and then that was criticized by Minksy and others and viable methods were achieved by 1965. When I was taught NNs by Mitchell at CMU in the 1990s (lectures similar to his book Machine Learning), this was the same basic story. Also reminds me of a moment in class one day when a Stats Prof who was surveying the course broke out with "but wait, isn't this all just multivariate regression??" :) Mitchell agreed to the functional similarity, but I think that helps highlight how the biomimicry was crucial to developing the idea. it had laid hidden in plain sight for a century.

Agreed, and I was aware, there has since been criticism of the biological plausibility of backprop.

Your further links with refs to backprop in transformers are interesting; I hadn't seen these. It's clear the term is being used like you say, tho I still see ambiguity of it utility here. Autodifferentiation, gradient descent, multi-variate regerssion etc. are ofc in common use and scanning these papers it's not clear to me the terms aren't simply to a point of conflation. What had stood unique for me with backprop was a coherent whole-network regression. This to me looks like a piecewise approach.

But anyways, I see your point. Thanks!

pmayrgundter · on Jan 5, 2024

Got me reading the original. It's rad.

Link to PDF and some screens from intro here..

https://twitter.com/PMayrgundter/status/1743096776456867921

KoolKat23 · on Jan 4, 2024

Thanks for your reply, you raise a very good point, transformer models are a lot more complex. I'd argue conceptually they're the same, just the data and process is more abstracted. Autoencoded data implies using efficient representations, basically semantically abstracted data and opting for measures like back propagation through time.

pmayrgundter · on Jan 4, 2024

So like in my sister reply, I don't see the Backprop, but maybe I'm missing it. This article does use the word, but in a generic way

"For example, when doing the backpropagation (the technique through which the models learn), the gradients can become too large"

But I think this is more of a borrowing and it's not used again in description and may just be a misconception. There's no use of the Backprop term in the original paper nor any stage of learning where output errors are run thru the whole network in a deep regression.

What I do see in Transformers is localized uses of gradient descent, and Backprop in NNs also uses GD...but that seems the extent of it.

Is there a deep regression? Maybe I'm missing it

KoolKat23 · on Jan 6, 2024

Yes, if the below perhaps helps. Over my head but...

https://courses.grainger.illinois.edu/ece448/sp2023/slides/l...

From another source:

Backpropagation Through Time (BPTT) is an adaptation of backpropagation used for training recurrent neural networks (RNNs), which are designed to process sequences of data and have internal memory. Because the output at a given time step might depend on inputs from previous time steps, the forward pass involves unfolding the RNN through time, which essentially converts it into a deep feedforward neural network with shared weights across the time steps. The error for each time step is computed, and then BPTT is used to calculate the gradients across the entire unfolded sequence, propagating the error not just backward through the layers but also backward through the time steps. Updates are then made to the network weights in a way that should minimize errors for all time steps. This is computationally more involved than standard backpropagation and has its own challenges such as exploding or vanishing gradients"