So all this speculative business is more about using idle bits of the chip to wa...

vardump · on April 4, 2016

> About right?

Close.

There are two kinds of speculative execution.

1) CPU guesses (branch predicts) one path by default. If it guesses wrong, it'll need to throw away speculative results and to execute the alternative path. No speculative state is leaked to other CPU cores or memory (writes or I/O). Typically CPUs guess right way over 99% of time -- there are of course cases when prediction fails, sometimes pathologically. Those times it guesses wrong, 15-20 cycles are lost. To put it in perspective, that's enough cycles for 500 floating point computations.

2) Programmer / compiler produced speculative execution. Typical for SIMD, for both CPUs and GPUs. For example with AVX2 you could compute results for 16x 16 bit integer lanes (256 bits wide) per instruction (so maybe about 32x 16-bit operations per clock cycle). Computations for both branches are done in parallel and right results are masked before writing data somewhere. The benefit is ability to avoid branching, gaining performance.

Sometimes CPU branch predictor does really badly. For example, if you have somewhat random data dependant branching, CPU is going to guess wrong 50% time. So computing both sides in parallel and throwing 50% of the results away might mean an order of magnitude speedup!

cm3 · on April 4, 2016

How does it determine it was wrong without checking the condition? Or does it?

vardump · on April 4, 2016

It catches up with the branch condition with some delay, because CPUs have a lot of pipelined stages. CPU is not executing one or two instructions at a time, but a window of 10-50 (guesstimate, maybe more) instructions over roughly 15 clock cycles -- pipeline depth.

So after it knows the condition result that determines the taken branch say 15 cycles later, it compares the guess to the actual path that needs to be taken. If they agree, speculative results are marked valid. If not, they're thrown away, CPU pipeline is flushed and execution starts again from the other path.

CPUs also have hundreds (current crop is about 200 uops (read: instructions) reorder buffer (ROB), Intel Haswell has 192), where CPU tries to sort the cross instruction dependencies in an order that's faster to execute. Deep pipeline means if it can't reorder the instructions, it'll have to stall waiting for the earlier results -- it just doesn't know the value of certain register (or cached memory location), until the earlier computation dependency chain is finished.

They're complicated and weird machines. They don't really execute the code sequentially at all, just make it look like as if they did.

Everything I said above is oversimplified. I left out register renaming, cross core / CPU socket cache coherency -- and so much more. I can't really say I completely understand the beast myself.