I'd be curious to learn why CPUs don't have conditional move speculation.

Tuna-Fish · on July 7, 2023

Because modern CPUs as a rule don't speculate on values to arithmetic, only on control flow, and CMOV acts like arithmetic.

That is, if there is an add instruction on rax and rbx, no matter what, the add instruction will not execute until both rbx and rbx are available. If the result went into rax, and there is an another instruction that uses that as a source, no matter what that instruction will not execute until the add has completed.

CMOV is implemented as an ALU instruction that always writes into it's output, and either writes the value that is already in there (which is why it depends on the value of it's output) or the value provided, depending on flags.

torstenvl · on July 7, 2023

I'm not saying you're wrong — I'm completely ignorant at the microcode level — but it seems to me like between

    cmp x, y
    je z

and

    cmp x, y
    sete z

the actual speculative part is the same: speculating as to the result of cmp x, y

If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

I probably just have a bad mental model of what's going on under the (under the) hood, so whatever patience you have to deal with my stupid questions would be greatly appreciated.

Tuna-Fish · on July 7, 2023

The two sequences look very similar, and could be implemented the same way, but the actual implementation could not be more different.

> If that's true, why would it not simply pipeline sete and the following instructions and simply execute (or not execute) sete according to its prediction, and then double check itself and reverse (or apply) the operation if the prediction was wrong?

You cannot just reverse or apply one operation. The way speculation works, when the frontend encounters a conditional jump, the entire architectural state of the current thread is stored, and all future memory writes are held in the store buffer and not written out. Then a long time, potentially dozens of cycles later, after the je is executed in the backend either the old state is restored and the pending writes are discarded, or the saved state is discarded and the pending writes are released.

In contrast, in ALUs, the inputs for instructions are always available before the instructions are scheduled to execute. It would be possible to implement sete like je, but this would imply significant changes to how and where it is executed. ALU ops cannot trigger speculation because there is no machinery for storing state at that part of the pipeline.

And no-one is ever going to implement cmov or sete like a jump, because moving the op from being an ALU op to being one that is speculatively executed in the frontend like jmp would make both positive and negative changes, and that would be a significant pessimization of existing software because for decades cmovs have been used for unpredictable values, where sequencing and waiting for the real value is a better idea than speculating and failing half the time. Using a cmov serializes execution when any following operations use the value, but if you can have independent work after it, you can always successfully execute that. Speculating at an unpredictable CMOV would cause that to be thrown away uselessly half the time.

celrod · on July 7, 2023

Taking the example:

      cmpb $115, %cl
      sete %dl
      addl %edx, %eax

vs

      cmpb $115, %cl
      jne _run_switches_jmptgt1
      mov $1,   %dl
     _run_switches_jmptgt1:  
      addl %edx, %eax

The argument about why `jne` might be faster is that that in the former case, the CPU always executes a dependency chain of length 3: `cmpb` -> `sete` -> `addl`. Each of these instructions have to be computed one after the other, as `sete` depends on the result of `cmpb`, and `addl` depends on the result of `sete`.

With `jne`, the CPU might predict the branch is not taken, in which case, the dependency chain is `mov` -> `addl` (the `mov` of an immediate might be handled by register renaming?).

Or that it is taken, in which case in which case the dependency chain is just `addl`.

I guess you're arguing that the CPU should handle `sete` the same way? That is, instead of treating `addl` as dependent on the result, predict what `sete` does and start executing `addl` before `sete` finishes, rewinding if that went wrong?

torstenvl · on July 7, 2023

Yeah, or at least I don't understand why that wouldn't be possible.

Microcode can set the EIP register based on its prediction of what the result of cmpb $115, %cl will be.

Why can't it set the EDX register based on its prediction of what the result of cmpb $115, %cl will be?

gpderetta · on July 7, 2023

In principle is perfectly possible to speculatively execute cmov (and viceversa to change jump-over-one-instruction into conditional execution).

But Intel historically didn't do it as programs tend to use cmov when the condition is unpredictable , so there was little reason to optimize it.

After Spectre, I believe intel has given an architectural guarantee that cmov is never speculated so it can be used as part of speculation attack prevention.

ithkuil · on July 7, 2023

The purpose of control flow speculation is to avoid stalling the pipeline.

If each instruction was executed in one single clock cycle, the cost of executing a branch would be one cycle and that's it.

However since there is a maximum speed at which operations can happen in hardware, the period of such a clock cycle that can execute a whole instruction would be very long and so the amount of "instructions per second" the CPU could execute would be low.

Now, if you can break up each instruction in smaller steps and execute the smaller steps in an overlapping manner, such that while you're executing the second step of the first instruction you're executing the first step of the next instruction and so on (like on an assembly line in a factory) you can have a much shorter clock period for each of these steps, and at the end of each clock tick an instruction would complete execution. The CPU will be still running one instruction per clock cycle, but since each clock period is shorter the overall instruction per second rate will be higher.

But for this to work the next instruction you want to execute must be known in advance so that at each clock cycle the CPU can start step 1 of a new instruction.

That's easy when the program is executing sequentially but when there are branches involved it's more tricky.

And that's tricky also if the branch is not conditional! If the instruction execution is broken into many small steps, it may take one or more steps before figuring out that you have a branch in the first place, let alone decoding where you need to branch to. In the meantime the CPU will have happily started to execute the first "steps" of the next instruction.

This is called a "branch hazard"

Early CPU implementations handled branch hazards by just throwing away the intermediate states if the few instructions that we're half way through the pipeline and call it a day (stalling the pipeline).

Early RISC CPUs attempted to be clever and use a trick called "delay slots": the instruction(s) already in the pipeline will continue to execute as if they were logically before the branch. This puta the onus to the programmer (or the compiler) to make sure that only instructions that are safe to be executed regardless of whether the branch is taken or not, are actually put after the branch instruction (otherwise you can just write nops).

But branch delay slots are not a panacea. As pipelines got deeper it became I practical to have a large number of delay slots and even a small number of delay slots were often just filled with nops anyway.

Improving on UNconditional branches was done by "looking ahead" in the instruction stream for branch instructions. When the instructions are all of the same size it's easy to quickly look a few instructions ahead and tell when you found a branch. You also need an instruction encoding scheme that is relatively fast to decode, at the very least it should be fast to decode branches (the more complicated the logic to decode a branch is, the farther ahead you'd have to look in the instruction stream, which in turn would limit the size of the sequence of instructions you can fill your pipeline with between subsequent branches).

To further complicate the matter, even if you found the branch instruction and you decoded it, it doesn't mean you yet know where it will branch to!

Indirect jumps (where the address is in a register) are similar to conditional jumps in that you don't know the address you're jumping to by merely looking ahead in the instruction stream and noticing the branch instruction. You need to either wait until you execute the branch and stall the pipeline in the meantime, or keep them in the pipeline and flush the pipeline once you know the target of the branch.

The next trick that CPU designers came up way before speculative execution is "branch target prediction".

The CPU keeps a little associative memory that maps addresses of a branch instruction to branch targets. When the lookahead logic spots a branch instruction it looks in this map and gets a guess of the branch target and uses that immediately ad the next instruction so that the pipeline is kept fed with something.

If by the time the branch instruction is executed the guess turned out to be wrong, the pipeline is flushed in the same way it would have to be flushed anyway if we had no clever branch lookahead in the first place. But if the guess was right we paid only one cycle to execute the branch.

This works for indirect unconditional branches and also for conditional branches! The prediction logic can be more subtle and complicated, many many things gave been attempted but this the general idea.

sitkack · on July 7, 2023

I hope you work on compiler backends.

epcoa · on July 7, 2023

With all due respect this is quite literally the level of stuff covered in an undergrad EE architecture course and is covered in an elementary text like Patterson and Hennessy.

sitkack · on July 8, 2023

> With all due respect

> quite literally

You could have conveyed the close to the same thing by saying, "things like this are covered in Patterson and Hennessy"

> elementary text

Jesus, do you even lift? The rest of the discussion is amazing.

epcoa · on July 9, 2023

For those not aware Patterson and Hennessy is elementary (“relating to the basic elements of a subject.”) because it is often used in an introductory course of computer architecture. This isn’t a slight.

nwallin · on July 7, 2023

Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

A conditional jump can put one of two values into the instruction pointer, they will either increment the instruction pointer (jump not taken) or put the immediate value into the instruction pointer. (jump taken)

cmov/sete are utterly deterministic; they always increment the instruction pointer. There's nothing to speculate on, there's nothing to predict. They just go to the next instruction.

sweetjuly · on July 7, 2023

> Speculative execution is all about control flow

It's murkier than that. Speculation also deals with the order in which instructions can be executed. Take for example memory ordering (discussed in a mini essay elsewhere here): we typically speculate that all loads are unrelated to any other older in-flight stores with unresolved addresses so that we can optimistically launch them. This is not a control flow issue but it is something we both speculate and predict (memory dependence predictors!) despite the next PC being essentially deterministic.

kazinator · on July 7, 2023

> Speculative execution is all about control flow. It's about what value is in the instruction pointer at some nebulous point in the future.

.. and all about what we can wheedle out of all the background speculation that will help us get root on this box.

sweetjuly · on July 7, 2023

One other perspective is that by speculating the outcomes of conditional instructions, you naturally open yourself up to mispeculating them. This sounds obvious but the consequences for the uarch are quite severe. This is because anytime you mispeculate an instruction, most (all?) contemporary CPUs throw out all younger speculative progress (even if it is unrelated!) and restart at the instruction it originally mispeculated. Throwing out all this work is both i) a waste of power/cycles (you did all this speculative work for nothing!) and ii) quite an expensive operation because you either have to iteratively rollback the state (slow!) or take a snapshot the state on every conditional instruction (expensive from power/area perspective).

A similar idea to what you're proposing (and a possible solution to the above issue) does come up in another part of the processor however! Specifically, high performance processors launch loads very aggressively and often times return data as soon as the address is known. This is because memory is often the bottleneck for performance. This, unfortunately, has some challenges. Namely, memory ordering violations. Take for example the following snippet (ARMv8):

    mov x1, #1    
    udiv x3, x2, x1
    str x2, [x3]
    ldr x4, [x2]
    add x5, x4, x4

This is a silly and somewhat contrived code sequence, but note here that both str x2 and ldr x4 access the same address and thus the value in x4 should be x2. Note, however, that since str x2's address (x3) is produced by a slow division operation but ldr x4's address (x2) is available much more quickly, ldr x4 likely will launch before the CPU even knows that str x2 conflicts with it. Thus, the data returned by the load will be whatever random old stale data is in the cache rather than the correct value that is currently sitting in x2. This means that the subsequent add which consumes this data will produce an incorrect value, leading the whole program to derail. Once the CPU detects this issue, it has to throw away all the state and restart execution of the program at ldr x4 in order to fix its mistake and fix up the memory ordering violation. In essence, the CPU is speculating that str x2 and ldr x4 are unrelated because doing so is very important for performance. Unfortunately, however, memory ordering violations are actually somewhat common and constantly having to restart execution has negative performance implication.

Now, this is actually a very similar problem as we'd see with conditional instruction speculation! So how do we solve this issue for memory ordering violations? Well, we predict which pairs of stores and loads are dependent and block the load from launching until the address of its supposed dependent store resolves. If this predictor is functioning well, we are able to both aggressively launch loads while also avoiding many costly fixups!

So, how would we translate this to conditional instruction speculation? Well, one idea is that we could predict both whether a given instruction is predictable and, if so, which way we should predict it. If a conditional instruction is predicted as unpredictable, its result will not be speculated (thereby avoiding frequent costly restarts) but if it is predicted to be predictable, we can try to predict which one to take.

Would this work? Maybe. Will anyone actually do this? Likely not. As others have suggested, conditional instructions are almost exclusively used for hard to predict conditions specifically because CPUs don't speculate them. Thus, in most existing code the predictor would just say "yep can't predict it" and we'd just have ended up wasting a bunch of area and power on a predictor that never gets used.

If you're really dedicated to this cause though, feel free to write a paper on it. Spitballing performance numbers is easy but often wrong in quite surprising ways, so maybe this might just work for some weird reason I've missed :)