Hacker News new | past | comments | ask | show | jobs | submit login

Nice article, but I suspect the implementation will not work. I did essentially the same for an AI-class exercise, and was thrilled to see that you could write a working Bayes classifier in 60 short lines of Python code. But later I landed a free-lance job that required writing a classifier that could be applied to real world data, and I soon realized that repeated multiplication of numbers between 0 and 1 sends you to zero too fast for the implementation to actually work. I might have missed it in the code, but I think he's doing the same mistake: you need to normalize or move to logarithms for the estimation of probabilities to work for medium or large datasets.



I'm the author of the article ...

Yes you are right, it's better to convert the whole thing to a sum of logs, otherwise you end up with floating-point underflow.

The article was getting already too long however, but I'll add a note about it because it is an important optimization that affects both speed and even correctness (because of the underlying limitations of floating point).

UPDATE: note added.


Is the sum of logs method mathematically equivalent to the multiplication of probabilities (i.e. will it always produce a monotonic ordering of class predictions)?


log(AB) = log(A) + log(B) for all A, B real numbers


Only for A, B > 0, but that's good enough for probabilities (except 0, I guess)


That was there to test you...


Even better, convert everything to log(p/(1-p)) log-odds, and halve the number of things to sum.


I think this should also have the added effect of being overall faster, as doing lots of addition would be quicker than doing lots of multiplication (log notwithstanding).


That could be true in a hardcore numerical computation, assuming you know enough about your architecture and compiler to make use of it.

In the context of this article and implementation, the speed difference between + and * is utterly irrelevant and unmeasurable.


On Ruby you can use the Decimal class, although it's really slow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: