Using neural nets to recognize handwritten digits

svantana · on Nov 25, 2013

Very well written, and I applaud the effort. But personally I don't care for the "magical" aura that writers tend to give ANNs - to me, they are simply (non-linear) function approximators that have a nice fitting algorithm. They work well for some problems and poorly for others. Also, beware of over-fitting - ANNs tend to be parameter-heavy, although there are approaches to prune the connections.

thearn4 · on Nov 25, 2013

The mysticism around ANN seems to come and go at least once per developer generation since the 50s

davorak · on Nov 30, 2013

I think this problem would be significantly less if there was no"neural" in "neural network." That said that one word has probably brought a significant number of eyes to the subject that would have looked elsewhere with out it.

Houshalter · on Nov 26, 2013

>they are simply (non-linear) function approximators that have a nice fitting algorithm.

Couldn't "function approximator" describe most machine learning approaches? And a nice fitting algorithm is of course the goal.

shoo · on Nov 26, 2013

Yes.

I'd go further - the "nice fitting algorithm" is to minimise the error on the training set as a function of the weight parameters, and one obvious way to (locally) minimise that is gradient descent + the chain rule.

The math / applied math machinery is all incredibly general, and a useful way to think about many machine learning algorithms.

http://en.wikipedia.org/wiki/Gradient_descent http://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions

Not to say that applied math is the only valuable perspective to think about, there's clearly also statistical and computational views as well. E.g. we're trying to approximate a function that we can never directly evaluate (error on the samples we havent seen yet)

fpgaminer · on Nov 26, 2013

I believe svantana was merely trying to differentiate ANNs from magic, not ANNs from other ML algorithms. Indeed you are correct that machine learning as a whole could be seen as function fitting.

ivoflipse · on Nov 25, 2013

If you liked his first chapter, consider supporting his IndieGogo campaign for the whole book (http://www.indiegogo.com/projects/neural-networks-and-deep-l...).

swores · on Nov 25, 2013

In case the author is reading this thread - might be worth adding a couple more reward tiers, for example an equivalent of the "major sponsor" on an individual level, maybe $30-60 to be named somewhere as a supporter. $15-$200 seems like a very big gap (and of course you can chose to donate in between, but I presume that reward tiers are effective in pushing people to donate more).

Looking forward to reading chapter one when I have time, though I suspect it will confuse me quite a lot...

edit: I see he is, already commented before I wrote this

joe_the_user · on Nov 25, 2013

"The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation."

I think this comment from the article needs caveats. Of course, a neural network would not qualify as Turing Complete just because it's finite. Keep in mind also that neural network, lacking anything like counters, tape, or recursion, couldn't approximate a Turing in the way that a finite Von Neuman architecture machine does. (A NN can represent any given function over a domain if it get large enough, kind of the universality of a finite automaton).

I know this a reference to this generation of NN having overcome an earlier problem of not being able to represent a NAND gate but still, it's worthing keeping mind that an ordinary computer can simulate an NN with just a program but this doesn't work vice-versa, so that NN's in that sense are far from universal.

michael_nielsen · on Nov 26, 2013

Networks of perceptrons are universal in the standard sense used when talking about circuits --- they can compute any finite Boolean function.

I agree that the relationship between circuit complexity and Turing machines is somewhat subtle, for the reasons you mention. The relationship is greatly clarified by the notion of uniform circuit complexity, which makes it possible to prove an equivalence between a (carefully defined notion of) circuit complexity and Turing machine complexity. Unfortunately, I don't know of a good online treatment of uniform circuit complexity. I learnt it through a 1993 paper by Andy Yao, but that's definitely not a good introductory reference!

In any case, in my book I'm using the term universal in the same way as people usually use it for circuits, i.e., it means the same thing as when people say that the NAND gate is universal for computation. Hope that clarifies things.

tba · on Nov 26, 2013

This is a cool exercise! After completing it, I wanted to find out exactly what each NN hidden node represented. I trained a tiny (10 hidden node) NN on an OCR dataset and created a visualization here: https://rawgithub.com/tashmore/nn-visualizer/master/nn_visua... .

Can anyone figure out what each hidden node represents?

You can also select a node and press "A" (Gradient Ascent). This will change the input in a way that increases the selected node's value. By selecting an output node and mashing "A", you can run the NN in reverse, causing it to "hallucinate" a digit.

MechSkep · on Nov 25, 2013

What about convolutional neural nets? They weren't mentioned, but that's really what most of the deep learning approaches use...

michael_nielsen · on Nov 25, 2013

They're discussed later in the book. The first chapter is an introduction, and I didn't want to introduce convolutional nets before (for example) fundamental techniques such as stochastic gradient descent and backpropagation.

MechSkep · on Nov 25, 2013

Great! How far does the book go in terms of advanced approaches? Up to the current state of research?

michael_nielsen · on Nov 25, 2013

My current plan is to describe some pretty recent results -- most likely, the big breakthrough on ImageNet by Krizhevsky, Sutskever and Hinton (http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf), which uses convolutional nets. I may also describe the famous Google-Stanford "cat neuron" paper (http://ai.stanford.edu/~ang/papers/icml12-HighLevelFeaturesU... ). But at this point things are moving so quickly that I'll keep my options open, and if more exciting things come up, I may change my plans.

Of course, there's a tremendous amount going on, so my broader philosophy is to focus on fundamentals. Readers who thoroughly master the core ideas shouldn't have much trouble later getting up to speed with the result-of-the-month.

VladRussian2 · on Nov 25, 2013

>some pretty recent results -- most likely, the big breakthrough on ImageNet by Krizhevsky, Sutskever and Hinton (http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf), which uses convolutional nets.

kernels learned by the first convolutional layer (the figure 3. on page 6) have uncanny resemblance to Gabor function-modeled orientation-selective cells ("bars and grating cell") in the primary visual cortex. Looks like computers are on the right track :)

http://www.cs.rug.nl/~petkov/publications/bc1997.pdf

"The discovery of orientation-selective cells in the primary visual cortex of monkeys almost 40 years ago and the fact that most of the neurons in this part of the brain are of this type ..."

The difference here is a "number game" - visual cortex contains cells whose receptive fields' positions, eccentricities, sizes, orientation, number of excitatory and inhibitory zones (e.g. Fig.1 in the link) make a reasonable coverage for the space of possible values. Ie. the number of these cells is in the millions vs. 96. Of course it is only a matter of computing power to run all reasonable combinations of kernels emulating the real visual cortex, yet it would put immense computational challenge onto the second and next layers until we understand what [should] happens there.

apu · on Nov 25, 2013

FWIW, many vision researchers believe that the resemblance of the first convolutional layer to Gabor filters is perhaps more a case of selection bias than anything else. The argument goes that were they not the output of the first layer, that paper wouldn't get accepted =)

I'm not sure if I fully believe this, but certainly there doesn't seem to be a very principled way to choose your network architecture. Different people propose different ones, and the fundamental justification for each one seems to be: "look, we recreate gabor filters in layer 1 and we get good numbers at the end!"

Of course, NN people argue that that's almost exactly what vision people do as well, except in "feature-land" rather than "architecture-land".

VladRussian2 · on Nov 26, 2013

>FWIW, many vision researchers believe that the resemblance of the first convolutional layer to Gabor filters is perhaps more a case of selection bias than anything else. The argument goes that were they not the output of the first layer, that paper wouldn't get accepted =)

well, i can see the temptation - the orientation and spatial frequency selectivity are the major characteristics of cells in V1 and the receptive field for the first layer there does look like Gabor

http://www.scholarpedia.org/article/Area_V1#Receptive_fields

I agree that such a good resemblance of the learned kernels to Gabor is too good, this is why i used "uncanny" :) If it is real then i think it manifests very interesting and, no pun intended, deep emerging properties of the neural net learning process (something along the lines "maximum entropy kernels while still doing the job" as the asymptotic state)

Btw, is it really selection or confirmation bias?

And to expand on previous point of convoluting the input with many-many kernels - happens to be at the order of 40 per "pixel":

"V1 contains a vast number of neurons. In humans, it contains about 140 million neurons per hemisphere (Wandell, 1995), i.e. about 40 V1 neurons per LGN neuron. Such divergence gives scope for extensive processing of the images received from LGN."

cdurr · on Nov 26, 2013

Ahh, the MNIST database of handwritten digits. I never took an ML course and I was only able to achieve 87% recognition rate 6 years ago for a university software engineering project. I read others achieving 99.9% recognition rates with their ANNs so I wasn't happy with my result. I tried to self-study about ANNs but I found most material to be either too simple or too complicated. I finally found some articles about ANNs (http://visualstudiomagazine.com/Articles/List/Neural-Network...) with code samples in C#, so I'll finally be looking into rewriting my old code to get a better result.

snarfy · on Nov 25, 2013

That reminds me of a video I saw about something called restricted boltzmann machines:

http://www.youtube.com/watch?v=AyzOUbkUf3M&t=24m0s

kyzyl · on Nov 26, 2013

Caveat emptor. Geoff's videos are a great way to launch into the field of deep learning, but do bear in mind that they are beginning to age. A lot of the stuff about why things work, what is state-of-the-art, and where the work is headed is now dated (even according to Hinton himself).

Draco6slayer · on Nov 26, 2013

I'm not sure if I am making a mistake, but I couldn't use the command listed in order to clone the repository. I'm using windows, with git installed, and I received the error: "Permission Denied: publickey"

I was able to get everything by looking you up on github and using the url of the repository.

Edit: Also, you might mention the repository earlier, because it's rather large and I've had to break from the book while it downloads.

michael_nielsen · on Nov 26, 2013

Thanks for the tip, I'll look into it!

bcuccioli · on Nov 26, 2013

I did exactly this for a school project about a year ago:

https://github.com/bcuccioli/neural-ocr

There's a paper in there that explains the design of the system and my results, which weren't great, probably due to the small size of training data.

manoDev · on Nov 25, 2013

Isn't that the same material covered on Andrew Ng's Coursera course "Machine Learning", down to the training data?

diab0lic · on Nov 25, 2013

It may well be similar material to Ng's Coursera course, but the table of contents on the right shows that this is obviously going to transition into the topic of deep networks -- part of Ng's research but not his Coursera course.

The training data is the MNIST dataset released by NIST some time ago, anyone is allowed to use it. It truly is no surprise to see it here, as it is very a very commonly used dataset in ML tutorials/books. It receives some discussion in Artificial Intelligence: A Modern Approach by Russel and Norvig, and even in the Theano getting started tutorials.

kyzyl · on Nov 26, 2013

That's right. In fact, for a subclass of image recognition tasks, MNIST has become the standard benchmark. Pretty much all of Geoff Hinton's early work on deep learning used MNIST to track the progress of his (their) methods.

In fact, if the goal of the book is to educate people on the field then I would say it's definitely best to use the standard benchmarks. It lets readers relate what's in the book to the literature, should they desire, and just like in academia, it lends credence to the author's statements. I've seen people take a lot of flak for publishing writings that don't use the standard datasets, but make claims of progress.

manoDev · on Nov 26, 2013

That's interesting info about the dataset, thank you.

myth_drannon · on Nov 25, 2013

Handwritten digits and Iris databases are the most frequently used ones in most books and tutorials I've read. I think it's good idea because you can compare between different teachers/techniques that base their content on the same core data.

mbecker73 · on Nov 25, 2013

Yeah it sure seems like it. He went a little more in depth than the course does, but I recognized that training data right away.

hatred · on Nov 25, 2013

Yeah, it is. Infact my school had the same assignment during my ML course too.