Even simpler: just sum all elements of the array. Then at the end subtract 'p'\*...

xoranth · on July 7, 2023

That's likely the fastest way to do that without vectorization. But you'd need to upcast 's' to an uint64 (or at least an uint32). That means that vectorization would operate on 32/64 bit lanes.

With vectorization, I think the way to go is to have two nested loops, an outer advances by 32 * 255 elements at a time, and an inner one that loads 32 bytes, compares each character to 's', and accumulates on 8 bit lanes.

Then in the outer loop you do an horizontal sum of the 8 bit accumulators.

fuber2018 · on July 7, 2023

My SWAR version almost does what your vectorization algorithm description does - just that the SWAR-code looks rather gnarly because the compiler isn't auto-generating the vector code for you, it's hand-coded in C by me and I'm limited to 64 bits at a time.

gpderetta · on July 7, 2023

Indeed, the blocked vectorization with 8 bits accumulators shown elsethread is going to be faster and there reducing the sum to 1 bit per iteration is worth it.

fuber2018 · on July 7, 2023

I took the 64-bit SWAR ('S'IMD-'W'ithin-'A'-'R'egister) road and passed in the string length - the calling code has the length "right there"!!!

Using the original run_switches function, app took 3.554s (average of 10 runs).

With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).

That's an overall 27.6x speedup.

fuber2018 · on July 7, 2023

If I unroll the main while loop to handle 4x as much each time through the loop in the SWAR-version, the runtime drops to 0.0562s (average 10 runs).

That's an overall 57.5x speedup.

fuber2018 · on July 7, 2023

If I convert the unrolled-64-bit SWAR function to use 32-bit chunks instead, average runtime almost doubles, approx. 0.1s now.

Need sleep now.

fuber2018 · on July 7, 2023

If I unroll the 64-bit SWAR version by 8x instead of 4x, the runtime is reduced by another 10% over the 4x-unrolled SWAR version. Diminishing returns...