Changing branch alignment causes swings in performance

rayiner · on June 18, 2015

A fuller description of the post-decode uop cache (with pictures!) is here: http://www.realworldtech.com/haswell-cpu/2.

Note that there are two paths for instructions: one, from the L1 icache through the traditional decoders into the instruction queue, and another a post-decode cache directly into the instruction queue. There are numerous advantages to the cache, such as power saved by idling the decode logic, as well as bypassing the 16-byte fetch restriction (which has been a feature of the architecture since the Pentium Pro days).

The gist of the surprising behavior is that the processor cannot execute out of the uop cache if a given 32-byte (naturally aligned) section of code decodes to more than 3 lines of 6 uops each (with the catch being that a branch ends any given line). In that case it falls back to the traditional instruction fetch/decode. Depending on the alignment of branches, you may or may not run into this limitation on an otherwise identical sequence of instructions.

kentonv · on June 18, 2015

This caused me a lot of grief back when I was working on Protobufs and doing lots of microbenchmarking. I'd often make a change and find it affected the performance of test cases that didn't even execute the changed code, sometimes by double-digit percentages.

Another problem that can cause a lot of noise between two executions of the same executable is positioning of data. For instance, two objects on the heap can alias in the TLB cache. If you run your microbenchmark in a loop reusing the same data structure over and over (as my benchmarks tended to do), then there can be a huge difference in performance depending where those structures landed in the heap. I ended up fixing this one by allocating 100 different copies of the structure and cycling through them.

Ultimately, though, I came to the conclusion that microbenchmarks have almost nothing to do with real-world performance, and I was just wasting my time all along. :/

userbinator · on June 18, 2015

I would be wary of microbenchmarks like this, especially when the faster sequence is bigger - keeping as much in cache as possible is more important for newer processors, and fetching NOPs wastes bandwidth without doing any useful work. A faster sequence of code won't be anymore if, upon exiting it, something else has to stall due to a cache miss. Pushing the function to the next alignment boundary might move the one after it as well, causing a cascade effect. If you can rearrange the code to spread out the jumps without making it bigger, that would be the best way to go.

nhaehnle · on June 17, 2015

If anybody else is having trouble accessing the presentation linked as an attachment: the download from the original LLVM bug at https://llvm.org/bugs/show_bug.cgi?id=5615 appears to be okay.

abc_lisper · on June 18, 2015

Instruction alignment is very important for performance. I remember a similar slow down when working on a VM for Itanium. The architecture manuals for processors usually describe this in detail.

userbinator · on June 18, 2015

The Itanium was a somewhat special case. It was very difficult to optimise for, which is why it performed so poorly in practice. In general x86 is far less sensitive to alignment than other architectures, and has been becoming more so with each new generation.

m_mueller · on June 18, 2015

by "becoming more so" so do you mean "becoming less so"?

userbinator · on June 18, 2015

I see that doubled-modifier could be a bit confusing: "more less sensitive"; I meant that newer processors are becoming less sensitive to alignment.