I understand you point, however it not as simple as it seems. Of course, for tri...

I understand you point, however it not as simple as it seems. Of course, for trivial code transition between different SIMD flavors could be seamless. But the world is cruel. :)

Think about shuffling instructions (pshufb), lookup vector for the instruction are different in AVX2 and SSE. Even if an AVX2 vector could be created by cloning SSE vector twice, this must be a programmer decision.

Another example is algorithm using video-encoding instruction mpsadbw to locate substrings (http://0x80.pl/articles/sse4_substring_locate.html#introduct...). AVX2 instruction vmpsadw operates on 128-bit lanes and the algorithm have to be rewritten in some parts to align with this limitation.