Even better, if you use vector library, prefetch instructions are automatically (and nearly optimally) inserted for your loops written in higher-order functions like fold.
Nope. That's not in the release version. Simd is still second class in ghc. Should be viable for ghc 7.10 though. Note well, the prefetches in that paper aren't optimal. In many cases prefetch is over issued.
You should only use prefetch when benchmarks show that the hardware prefetch isn't performing up to sniff. Or when your access pattern for data doesn't have good locality and doesn't resemble a linear arithmetic sequence. (Just rules of thumbs mind you. For more precise heuristics, please read your CPU vendors optimization manual. The intel manual has quite a few tricks that should apply to most modern CPUs overall)