I wouldn't call this a performance bug in clang. It's an optimization working as...

dooglius · 2024-06-22T13:52:04 1719064324

I would challenge you to find a processor on which the rsqrt plus two newton-raphson iterations is not slower than plain sqrt. (We don't know what mtune the author used)

GrantMoyer · 2024-06-22T16:04:15 1719072255

According to Intel, any processor before Skylake (section 15.12 from [1]).

[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...

pclmulqdq · 2024-06-22T14:32:40 1719066760

The author probably didn't use any mtune setting, which is likely the problem. If you look at older cores on Agner's instruction tables, SQRT has been getting steadily faster over time. This implementation is slightly faster on old Intel machines, for example.