Ah yes, Perceptron. Had a couple typos.. sorry, was on phone. The bio-inspiratio...

Ah yes, Perceptron. Had a couple typos.. sorry, was on phone.

The bio-inspiration was via Frank Rosenblatt, who is referred to in that article tho yeah, the history is over in his article:

https://en.wikipedia.org/wiki/Frank_Rosenblatt#Perceptron

"Rosenblatt was best known for the Perceptron, an electronic device which was constructed in accordance with biological principles and showed an ability to learn.

He developed and extended this approach in numerous papers and a book called Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, published by Spartan Books in 1962.[6] He received international recognition for the Perceptron.

The Mark I Perceptron, which is generally recognized as a forerunner to artificial intelligence, currently resides in the Smithsonian Institution in Washington D.C."

Your Juergen page is interesting, tho no direct comment on Rosenblatt there. He does cite the work on this page:

https://people.idsia.ch/~juergen/deep-learning-overview.html (refs R58, R61)

My reading is that a long-known idea, about multi-variate regression, was reinterpreted by Rosenblatt by 1958 via the bio-inspired Perceptron, and then that was criticized by Minksy and others and viable methods were achieved by 1965. When I was taught NNs by Mitchell at CMU in the 1990s (lectures similar to his book Machine Learning), this was the same basic story. Also reminds me of a moment in class one day when a Stats Prof who was surveying the course broke out with "but wait, isn't this all just multivariate regression??" :) Mitchell agreed to the functional similarity, but I think that helps highlight how the biomimicry was crucial to developing the idea. it had laid hidden in plain sight for a century.

Agreed, and I was aware, there has since been criticism of the biological plausibility of backprop.

Your further links with refs to backprop in transformers are interesting; I hadn't seen these. It's clear the term is being used like you say, tho I still see ambiguity of it utility here. Autodifferentiation, gradient descent, multi-variate regerssion etc. are ofc in common use and scanning these papers it's not clear to me the terms aren't simply to a point of conflation. What had stood unique for me with backprop was a coherent whole-network regression. This to me looks like a piecewise approach.

But anyways, I see your point. Thanks!