Everyone must first get over the terminology confusion. Convolution in DL is actually cross-correlation, not convolution. In practise it does not matter, signal is just flipped, but it can be very confusing when you try to learn and go trough examples.
The terminology comes from signal processing, where a convolution in the frequency domain is equivalent to a multiplication in the time domain. I don't think anyone is thinking about the frequency domain in deep-learning, but they still call the operators convolution kernels.
"Convolution with a kernel K" describes a system whose impulse response is K. In discrete time, suppose you have K=[1,2] and convolve [0,1,2,0] with it- you wind up with [0,1,3,2,0], if I'm awake enough for arithmetic.
Correlation with a kernel K is convolution with K time-reversed (i.e. [2,1])- you'd get [0,2,5,2,0] (again if I'm awake). Note that 5- right there, the input signal "lines up just right" with the kernel- 2x2 + 1x1. That's why it's called correlation- its output is big when the input looks like the kernel.
I mean ultimately it comes from functional analysis and differential equations (not signal processing).
It's a binary operator on functions that yields a third function. It has a lot of useful properties and equivalences, like that it can be described as the product of two Fourier transforms (although that's very roundabout).
You're actually introduced to convolution in middle school when you're taught how to multiply monomials to build a polynomial (at my middle school they called it "FOIL").
It appears to be a discrete Fourier, no? Does it apply to all convolutions or just a specific instance or subset? As in id there a proof showing that as sample size N goes to a limit it approaches a continuous limit? I still natively think in continuous convolutions from Physics. The whole discretization of these operators is oddly harder for me despite it technically being simpler to compute.
No, it's true of both continuous and discrete time/domain Fourier. Convolution in time is multiplication in frequency, and vice versa. You don't need to prove this with limits directly, just use the definition of the convolution integral and Fourier transform integral.
> technically being simpler to compute.
They're equivalent, since the only meaningful way to "compute" a continuous convolution is symbolically, and discrete convolutions obey most of the same identities.
If one can place a lower bound on the time step resolution of a simulation then continuous convolutions are evaluated using discrete convolutions, which can represent the continuous case exactly via the Nyquist-Shannon sampling theorem.
Interestingly enough, to prove the Sampling Theorem you need to rely on the identity that multiplication in frequency is convolution in time, and to prove that it can't be realized in a physical system (breaks causality, since you multiply by a superposition of Heavisides which of course are infinitely long sinc functions in both directions of time).
And more interesting is that signals and systems is mostly applied dynamics and statistics, so it shouldn't be surprising if there's overlap.
This is still very low level, the whole article (although very comprehensive) missed the simple definition which it should have mentioned first. Going into jargons only add to complications. Look it up in a dictionary first for English definition and then try to understand how it has been applied in different domains.
1 : a form or shape that is folded in curved or tortuous windings
e.g the convolutions of the intestines
2 : one of the irregular ridges on the surface of the brain and especially of the cerebrum of higher mammals
3 : a complication or intricacy of form, design, or structure
… societies in which the convolutions of power and the caprices of the powerful are ever-present dangers to survival.
After this is clear read the mathematical idea on wikipedia. After reading that, do google scholar search on the AI papers that first mentioned it. That is the way to go.
As it is mostly done with weights whose initialization and any operation will be same with the flipping, it can basically be imagined whatever you find easy to imagine.
It is a convolution. Grant Sanderson (3blue 1brown) explains the relationship between a filter and the fourier transform towards of the end of this hot-off-the-presses video: https://mitmath.github.io/18S191/Fall20/lecture2/
That doesn't help one understand what it is at all. Convolution in DL is simply a set of dot products of a patch from the input with a bunch of filters. Each resulting dot product is simply a measure of similarity between a patch and a filter. That's all there is to it.
IMO calling it "convolution" in deep learning is extra confusing, because the word "convolution" means many fairly different things in other contexts.
The idea behind convolution in deep learning is that, if a particular pattern of pixels is meaningful, then it is probably also meaningful if you shift the whole thing in some direction. So you can force some layers of the network to be the same under translation, and it'll be faster to pick up some sorts of patterns.
It's faster because its reduces the dimensionality of the inputs down to something manageable (hundreds or low thousands). You can replace convolutions with most other types of dimensionality reduction (including other types of layers) and outside of image tasks you'll get very similar or even better performance.
Convolution has been an established term in image processing long before convnets were invented in the 90s (or 80s or whenever). It's the same thing. It's useful to learn a bit of basic image processing, edge detection etc before jumping straight into the flashiest shiniest DL model with no basic foundations to conceptualize what is happening.
(Even before that it has been used in signal processing.)