> short loops of less than 64 instructions that use AH, BH, CH or DH registers a...

pbsd · on June 25, 2017

The loop needs to be short because the loopback buffer is only active in loops of 64 or fewer entries (usually fewer real instructions, something like 40 or so). Moreover, Skylake introduced one loopback buffer per thread, instead of the previous loopback buffer shared between both threads.

My guess is that is where the bug is; the behavior for partial register access stalls---insert one extraneous uop to combine, e.g., ah with rax---is unchanged since Sandy Bridge.

CalChris · on June 25, 2017

Just as information, the Loop Stream Detector was introduced in Intel Core microarchitectures. With Sandy Bridge, it was 28 μops. It grows to 56 μops (with HT off) with Haswell and Broadwell. It grows again (considerably) with Skylake to 64 μops per thread (HT on or off).

The LSD resides in the BPU (branch prediction unit) and it is basically telling the BPU to stop predicting branches and just stream from the LSD. This saves energy. However, predicting is different than resolving. Branch resolution still happens and when resolution (speculation) fails, the LSD bails out.

In any case, 64 μops is a lot. That's a good sized inner loop.

Symmetry · on June 26, 2017

It's also a problem with SMT[1]. The design cost is pretty small, it's a fairly straightforward extension of what an out of order CPU is already doing. But due to the concurrency issues debugging/verifying it is incredibly difficult.

[1]Simultanious MultiThreading, which is marketed by Intel under the name Hyperthreading when using two threads.

pcwalton · on June 25, 2017

This is an amazing analysis, and seems entirely likely to be right to me. Thanks for writing it up.

valarauca1 · on June 25, 2017

You really don't know what you're talking about.

---

     For Skylake, they probably optimized partial register 
     writes to these four partial "high" registers (AH, BH, 
     CH, DH), but the optimization was buggy in some hard-to-
     hit corner case.

They did not do this.

The high registers (AH/BH/DH/CH) are nearly written out of existence with the REX Prefix in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

The 16bit registers (AX/BX/DX/CX) are in worse situation, but it ends up costs additional cycles to even decode these instructions as the main encoder can't handle these instructions and you have to swap to the legacy encoder, and you'll end up losing alignment. This costs ~4-6 cycles, also the perf registers to track were only added in Haswell (and require Ring0 to use [2]).

High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

    That corner case probably can only be reached when some
    part of the out-of-order pipeline is completely full,
    which is why it needs a short loop (so the decoder is not
    the bottleneck, AFAIK there's a small u-op cache after the decoder)

There is a 64uOP cache between the decoder and L1i cache that is called loop stream detector. Normally this exists to do batched writes to the L1i cache.

But in _some_ scenarios when a loop can fit completely within this cache it'll be given extremely priority. This is a way to max out the 5uOP per cycle Intel gives you [1]. It'll flush its register file to L1 cache piece meal as it continues to predict further and further and further ahead speculatively executing EVERYPART OF IT in parallel. [3]

In short this scenario is extremely rare. uOPs have stupidly weird alignment rules. Which you can boil down to:

    Intel x64 Processor are effectively 16byte VLIW RISC processors
    that can pretend to be 1-15byte AMD64 CISC processors at a minor performance
    cost.

---

The real issue here is when Loop Stream mode ends it is properly reloading the register file, and OoO state.

This is likely just a small micro-code fix. The 8low/8high/16bit/32bit/64bit weirdness is likely somebody wasn't doing alignment checks when flushing the register file.

---

[1] On Skylake/KabyLake. IvyBridge, SandyBridge, Haswell, and Boardwell limited this to 4.

[2] Volume 3 performance counting registers I think we're up to 12 now on Boardwell.

[3] Volume 3 Chapter 3.4.1.7 (Page 107)

CalChris · on June 25, 2017

> The high registers (AH/BH/DH/CH) are nearly written out of existence with the RAX flag in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

I think you meant REX prefix but even that doesn't make any sense.

High registers are a first class element of the Intel 64 and IA-32 architectures. They aren't going anywhere. Microarchitectural implementations are an entirely different thing.

That aside, where in the manuals does Intel say not to use the high registers? They're pretty clear about such warnings and usually state them in Assembly/Compiler Coding Rules.

From the parent:

> For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case.

That is about right. I don't agree with the preceding slap at x86 but this is a good summary.

BTW writing to the low registers is in principle also a partial register hazard but then Intel sees fit to optimize that as a more common case.

In particular, mov AH,BH is not emulated from the MS-ROM which is just hella slow. It uses two μops for Sandy Bridge and above. This is covered in 3.5.2.4 Partial Register Stalls.

Lastly, there is no section 3.4.1.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual which is 3 volumes. You must be talking about the Intel® 64 and IA-32 Architectures Optimization Reference Manual which is a single volume. And it isn't clear how that section furthers your argument.

brigade · on June 25, 2017

> High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.

Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...

What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.

userbinator · on June 25, 2017

Indeed, I've worked with CPUs that didn't have the register split of x86 and they are far less friendly to implementing certain algorithms, which would otherwise require many additional registers and lots of shift/add instructions to achieve the same effect. ("MOV AH, AL" being one simple example -- try doing that on a MIPS, for instance.)

pcwalton · on June 26, 2017

How often have you really needed to do "a = (a & 0xffff00ff) | ((a & 0xff) << 8)"? I don't think I've ever needed to do it, and I wouldn't be surprised if compilers don't even generate "mov ah,al" for that, due to the fact that AH/BH/CH/DH only exist for 4 registers.

Anyway, since you asked: In AArch64 that would be written "bfi w0,w0,#8,#8". "bfi" is an alias of "bfm", an instruction far more flexible and useful than any of the baroque x86 register names. BFM can move an arbitrary number of bits from any position in any register to any other register, and it has optional zero-extension and sign-extension modes.

Espressosaurus · on June 26, 2017

If you're talking to any memory-mapped registers, you'll be doing it all the time. Granted, you're much more likely to be using something like ARM to do that. x86 is a bit large/expensive/power hungry for embedded programming.