I would challenge you to find a processor on which the rsqrt plus two newton-raphson iterations is not slower than plain sqrt. (We don't know what mtune the author used)
The author probably didn't use any mtune setting, which is likely the problem. If you look at older cores on Agner's instruction tables, SQRT has been getting steadily faster over time. This implementation is slightly faster on old Intel machines, for example.