SSE: mind the gap

johnt15 · on April 3, 2016

It's much better to use any of the numerous SIMD wrappers such as libsimdpp or Vc and get various benefits for free. It's possible to target everything from SSE and NEON to AVX512 with what is essentially a single code path.

clevernickname · on April 4, 2016

Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument. Portability between SSE and AVX is a better argument.

But in any case, if you're using SIMD in anger, chances are you have hard performance requirements that you really care about, and a one size fits all approach is going to leave valuable performance on the table. Whether you just have to target your own servers, or any x86 CPU made in the past 6 years, or that plus NEON-equipped ARMs, it will probably be worth the effort to duplicate the code paths, especially in comparison to the initial effort of figuring out how to vectorize your problem in the first place.

And while it's nowhere near "leftpad", if you really want an SIMD wrapper and know what you're doing, it should be well within your capabilities to write your own. Maybe not quite as spiffy as the one on github, but when I get anywhere close to assembly I find that I get more value out of doing everything from scratch and truly understanding what I'm dealing with, rather than leaving anything in someone else's hands.

cm3 · on April 4, 2016

> Realistically the vast majority of C and C++ codebases today will never touch anything more than x86 and ARM, and I wouldn't be surprised if most never even get past x86, so I don't buy the portability argument.

Just recently a Gentoo developer ported GHC to m68k and found some portability issues who fixed in the process, which benefit all architectures. This is also why OpenBSD devs are still on gcc3.

RISC and POWER are just two very modern ISAs to mention and not something you can ignore easily. We need more ISAs like in the past, not just two. It's very dangerous to limit ourselves to just ARM/x86 and diversity is a plus for writing more correct code and having more options. lowRISC is a nice fit for many things as is POWER, while of course ARM and x86 are here to stay. I'd count Nvidia's and AMD's GPUs as the other major architectures, but we don't usually deal directly at that level with GPUs. You choose the right chip for the job, just as phones select different SoCs for different use cases.

clevernickname · on April 4, 2016

The idea that compiling your code for 68000 or MIPS can reveal bugs in your code does not change the fact that x86 and ARM are pretty much the only relevant CPU architectures that all but the most entrenched of government contractors could ship a product on today or in the foreseeable future that would have any use for SIMD. If you actually have a need to do extensive SIMD optimizations (say, it could shave 5ms off your frame time in a game, or save you $XXXXXX/year in your data center), PowerPC does not enter your mind at any moment.

You see it as weeding out bugs and future proofing your code in case x86 or ARM disappears tomorrow, I see it as a load of completely wasted work and optimization opportunities.

Also lowRISC learned nearly nothing from the past 20 years of CPU architecture advancement. It is not modern, it is a naive copy of a very outdated design.

johnt15 · on April 4, 2016

By saying single code path, I don't mean single instruction stream. libsimdpp, for example, supports building same code for different instruction sets, linking into the same executable and then dispatching dynamically. Doing this by hand would mean that either:

- lots of time is wasted creating slightly different versions of code. I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.

- micro-optimization opportunities are wasted by only coding for major revisions of the instruction set

Even when optimal performance may only be achieved via completely different approaches, the SIMD wrappers are easier to use, because they present consistent interface. Any specialized instructions may be used by simply falling back to native intrinsics.

Thus I don't see much benefit of writing SIMD code without a wrapper. The only advantage is that it's harder to shoot oneself into the foot with naive use of these wrappers, e.g. if one doesn't actually look into the generated assembly code.

clevernickname · on April 4, 2016

Yeah, I understood what you meant, I've used wrappers like that before. My contention was with your original comment,

>It's possible to target everything from SSE and NEON to AVX512 with what is essentially a single code path.

the practice of which does not generally make the best usage of any particular instruction set, emulating certain operations that aren't available on a platform with multiple instructions, etc. It might be good enough for many light optimization jobs, in which case I'd say go for it, you're doing so much better than the vast majority of programmers writing Python or whatever. But what I was trying to argue was that if you really need to crunch the hell out of some numbers, then you probably have a small set of target platforms that you can justify directly using intrinsics (or even assembly) for.

This claim, however:

>I'm talking about e.g. AVX vs. AVX2 for floating-point code not SSE2 vs. AVX.

is a lot more reasonable, but you could do the same with some strategically placed #ifdefs with native intrinsics or assembly.

wmu · on April 4, 2016

Not sure about "single code path". Differences amid SIMD flavors are significant, there are cases when translation one-to-one is either impossible or unpractical. A bright example are AVX2 instructions operating on 128-bit lanes rather whole 256-bit registers.

And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.

exDM69 · on April 4, 2016

> And wrappers exists in the C++ ecosystem, C programmers are stuck to intrinsics.

If you can accept working with GNU extensions that are available in recent-ish GCC and Clang (but not MSVC, not sure about Intel ICC), there are pretty nice vector extensions [0].

With them you can get standard binary operators working for arithmetic (+,-,*,/ etc) and shuffling with __builtin_shuffle. These are CPU independent, the same code compiles neatly to ARM NEON as well as x86 SSE+AVX+FMA. All you need is a typedef with an __attribute__.

The vector extension functions don't cover the whole instruction sets but the vector types are compatible with _mm128 and NEON native formats so you can resort to intrinsics when necessary.

However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.

If you want to see some examples, take a look at my collection of 3d graphics and physics related SIMD routines [1]. (note: this project could use some help, let me know if you're interested in doing something with it or porting some of the hand optimized routines to more used math libs like glm)

[0] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html#Ve... [1] https://github.com/rikusalminen/threedee-simd

wmu · on April 5, 2016

> If you can accept working with GNU extensions that are available in recent-ish GCC and Clang

I do my private project in C++ so it's not a case, but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.

> However, for a lot of SIMD tasks I encounter, just basic arithmetic + shuffles is more than 80% of what I need.

Your remaining 20% is my 80%. :)

exDM69 · on April 5, 2016

> ... but at my current company we use also MSVC. I wish we could abandon that compiler and work with GCC or clang only.

Good news! These days you can produce MSVC compatible binaries with Clang or even use Clang as a compiler from the C++ IDE.

Whether or not you can do this in practice is another matter, but it can be done.

> Your remaining 20% is my 80%. :)

Yeah, if you look at my examples, they're rather straightforward arithmetic with 4 dimensional vectors. There's very little need for any integer arithmetic or more exotic combinations of operations. A little fused multiply-and-add here and there.

But I haven't seen a better method for this, most of the code is CPU-agnostic and will compile to x86 or ARM code using all the available instruction sets (depending on compiler arguments, e.g. -mavx2 or -march=native). I really haven't seen a SIMD math lib with so little duplication for different CPUs elsewhere.

johnt15 · on April 4, 2016

The property of AVX and AVX2 you mentioned actually helps having single code path. If the SIMD wrapper allows parameterization on vector width (most do that), you can simply increase vector width when compiling for AVX and that's it.

wmu · on April 4, 2016

I understand you point, however it not as simple as it seems. Of course, for trivial code transition between different SIMD flavors could be seamless. But the world is cruel. :)

Think about shuffling instructions (pshufb), lookup vector for the instruction are different in AVX2 and SSE. Even if an AVX2 vector could be created by cloning SSE vector twice, this must be a programmer decision.

Another example is algorithm using video-encoding instruction mpsadbw to locate substrings (http://0x80.pl/articles/sse4_substring_locate.html#introduct...). AVX2 instruction vmpsadw operates on 128-bit lanes and the algorithm have to be rewritten in some parts to align with this limitation.

andrewf · on April 4, 2016

Would you be able to point me towards a shipping product/library that does this? It's easy to find examples of people hardcoding x64 assembly (x264, zlib, libyuv) but I haven't stumbled across anybody making good use of a high level wrapper.

johnt15 · on April 4, 2016

There is entire high-lever scientific computing framework built using a SIMD wrapper: https://github.com/jfalcou/nt2.

Though I must note in this case the SIMD wrapper has significant problems. Due certain design decisions the wrapper performs suboptimally on mixed float-integer code on AVX for example.

speps · on April 4, 2016

Mentioned just in the parent, here is the link : https://github.com/p12tic/libsimdpp

Reaching 2.0 very soon (in RC phase right now), with support for VS which was lacking before.

jmfisch · on April 4, 2016

Although it's way more than an SSE wrapper, the Eigen library is excellent in my experience and targets multiple platforms.

http://eigen.tuxfamily.org/index.php?title=Main_Page

Ono-Sendai · on April 4, 2016

I had a look at the matrix*vector multiplication code for Eigen once and it was rubbish.

drmpeg · on April 4, 2016

https://github.com/gnuradio/volk

nivertech · on April 6, 2016

Why to use wrapper libraries when you can use OpenCL for CPU compute device?

cm3 · on April 4, 2016

I'm trying to understand speculative execution.

Given this

    int result = foo != bar ? do_side_effect_and_return() : safe_return();

C code, am I right to assume both functions will be executed speculatively?

What other potential bugs/gotchas are lurking with speculative execution?

daemin · on April 4, 2016

In the context of a CPU, especially one with a deeper pipeline, the comparison value will only be known at some stage deep within the pipeline. Therefore to not stall the pipeline until the result is known, the CPU will start to execute either one of the branches. Then once the value is known, if it guessed the branch correctly it will continue executing as normal, having already partially executed it. But if it guesses incorrectly it has to flush the work that it has done and start executing the branch not taken.

Edit: One particular thing to note is that side effects that occur within the pipeline do not actually occur until the latter stages of the pipeline, where the writes to memory and registers are realised. By that stage the condition result would already be known and the correct branch would be executing within the processor.

cm3 · on April 4, 2016

Yeah, but what if do_side_effect_and_return() deletes file. This surely cannot be prevented.

What I'm mostly wondering is how it is that half of our code doesn't break all the time due to speculative execution.

I'm looking for an explanation how it's prevented or if it's just careful luck because compiler and runtime writers took precautions.

daemin · on April 4, 2016

Ok, from what I see you are thinking at too high a level for this.

The CPU has a pipeline where it executes instructions. This pipeline has stages in it for: fetching the instruction from L1 cache (a), decoding the instruction (b), fetching data from registers or memory (c), computing the instruction (d), storing the data back into registers or memory (e). This is a rough grouping of these stages, and each of these (a-e) stages is comprised of smaller stages, one of which executes in each clock cycle.

So between the first stage where a CPU starts executing a comparison instruction (a), and when it knows what the value is (d) that can be many clock cycles. So instead of waiting and stalling it instead guesses which branch will be taken and starts feeding it in.

It is only in stage (e) where it stores values to registers or memory where actions can actually take place, and by the time it gets there it is executing the correct branch. Either because it guessed the correct branch from the beginning or it has mispredicted, flushed the pipeline, and is now executing the correct branch.

Edit: Note that there is no such instruction as "delete file on SSD". The CPU has various ways of working with external devices (such as sounds/video chips, ssd's, etc), with memory mapping being one of the more popular one, but there's also IO pins and a variety of hardware protocols that it can write instructions to use. If you want to learn up on this get a small device like a Raspberry Pi and play around.

cm3 · on April 4, 2016

I see, are CLWB and PCOMMIT for NVME safe?

vardump · on April 4, 2016

Yes. They're no exception. They actually make DMA transfer safe, without doing unnecessary work.

cm3 · on April 4, 2016

So all this speculative business is more about using idle bits of the chip to warm it up for either branch to be taken, thereby reducing some of the time for that code to execute, but any and all op that would actually modify memory or access stuff on the bus is exempt from speculation.

About right?

vardump · on April 4, 2016

> About right?

Close.

There are two kinds of speculative execution.

1) CPU guesses (branch predicts) one path by default. If it guesses wrong, it'll need to throw away speculative results and to execute the alternative path. No speculative state is leaked to other CPU cores or memory (writes or I/O). Typically CPUs guess right way over 99% of time -- there are of course cases when prediction fails, sometimes pathologically. Those times it guesses wrong, 15-20 cycles are lost. To put it in perspective, that's enough cycles for 500 floating point computations.

2) Programmer / compiler produced speculative execution. Typical for SIMD, for both CPUs and GPUs. For example with AVX2 you could compute results for 16x 16 bit integer lanes (256 bits wide) per instruction (so maybe about 32x 16-bit operations per clock cycle). Computations for both branches are done in parallel and right results are masked before writing data somewhere. The benefit is ability to avoid branching, gaining performance.

Sometimes CPU branch predictor does really badly. For example, if you have somewhat random data dependant branching, CPU is going to guess wrong 50% time. So computing both sides in parallel and throwing 50% of the results away might mean an order of magnitude speedup!

cm3 · on April 4, 2016

How does it determine it was wrong without checking the condition? Or does it?

vardump · on April 4, 2016

It catches up with the branch condition with some delay, because CPUs have a lot of pipelined stages. CPU is not executing one or two instructions at a time, but a window of 10-50 (guesstimate, maybe more) instructions over roughly 15 clock cycles -- pipeline depth.

So after it knows the condition result that determines the taken branch say 15 cycles later, it compares the guess to the actual path that needs to be taken. If they agree, speculative results are marked valid. If not, they're thrown away, CPU pipeline is flushed and execution starts again from the other path.

CPUs also have hundreds (current crop is about 200 uops (read: instructions) reorder buffer (ROB), Intel Haswell has 192), where CPU tries to sort the cross instruction dependencies in an order that's faster to execute. Deep pipeline means if it can't reorder the instructions, it'll have to stall waiting for the earlier results -- it just doesn't know the value of certain register (or cached memory location), until the earlier computation dependency chain is finished.

They're complicated and weird machines. They don't really execute the code sequentially at all, just make it look like as if they did.

Everything I said above is oversimplified. I left out register renaming, cross core / CPU socket cache coherency -- and so much more. I can't really say I completely understand the beast myself.

panic · on April 4, 2016

The CPU won't speculatively execute past certain instructions. AFAIK it's up to (e.g.) the SSD driver to prevent the CPU from speculatively sending commands to the drive.

gliptic · on April 4, 2016

It's not up to the SSD driver. The SSD driver will not see any effects of speculatively executed code and there's no way the CPU could speculatively send commands to the drive.

panic · on April 4, 2016

What if you're communicating with the SSD via DMA? Don't you have to make sure that the writes aren't speculatively executed?

daemin · on April 4, 2016

Because DMA occurs when the CPU writes to memory, and as my comment above states this occurs after the condition is evaluated and the jump instruction (not) taken.

Speculative execution is purely within the CPU and doesn't leak out - except in the case of worse than ideal performance.

pjc50 · on April 4, 2016

Firstly, the article doesn't mention speculative execution - it refers to a programming technique which is effectively (assume f,g,h are arithmetic functions, not 'program' functions):

  compute f
  while f is in the pipeline, compute g
  while g is in the pipeline, compute h
  let bits = 111111..111 if h is true or 00000...000 if h is false
  let result = (f & bits) | (g & ~bits)

In your example, if they're function calls, and not inlined, then they won't execute speculatively at all: speculative execution usually only applies to straight-line instructions and in any case only applies to letting the instruction go off into an "execution" unit. Only one branch should ever make it into the "commit" phase of the pipeline (think like a database commit), and only one should ever have effects that are visible off the processor die.

Also things like system calls, interrupts, context switches, and so on tend to flush the pipeline and insert a "barrier", at which point all the instructions before the barrier have committed and none of the ones after the barrier have committed.