I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.
Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.
Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.
Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.
Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.
Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.
Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.