It doesn’t claim linked lists are fast. And it doesn’t have any benchmarking data.
Actually the only reference to linked lists I see is in the prefetcher section (3.7.3), where it explicitly recommends that consecutive items are stored in a single 4kb cache line or in consecutive cache lines. They give a positive code example using an array, like other commenters are suggesting.
That depends more on what they link to. Not prefetching a worthless list of pointers when you could prefetch data instead is typically faster. Prefetchers have limited capabilities and capacity for loads.
The prefetcher doesn't even have to get involved for linear sequences, it would only get involved after a cache miss threshold is met. The fastest and simplest prefetcher, L1, is stream based and will work great for sequential data.
Prefetching just one element ahead does fuck-all when ram is 100x slower than cache especially if you need to precharge and activate a new row which is always the case when you're spraying tiny objects all over the place.
So don't spray tiny objects around. :) Keep them nicely tucked away in an arena.
Vectors actually tend to create a spray of tiny objects... As opposed to a true dynamic array which has limitations in resize. If you're lucky, they will keep a single dynamic array per vector. You're usually not this lucky with bigger data.
If you never free your objects, you may as well use an array anyway - it'll be much more simple. And if you use an arena with an object pool, you're back in cache miss territory.
Vectors only "create a spray of tiny objects" if you have a vector-of-references or something like that. And even then, reading data requires 2 reads from memory (1 to read the pointer in the vector, and another to read the object). Finding item N in a linked list will require N reads (since you need to read each item before the target). Yes, resizing a vector is O(n). But every time I measure memcpy, it always goes faster than I expect.
If you disagree, try to prove it with code. At least one of us will learn something that way.
"I made an extra test that wraps matrix4x4 in std::unique_ptr."
And that is your test? One piece of code? Not even presenting disassembly? Who the hell knows what your compiler wrought in response to your "std::unique_ptr" and "matrix4x4" ?
Sorry for my ignorance – linked list matching the performance of vector is something new to me. I would like to learn more. The best way to prove your point is to show us a benchmark. However, I couldn't find one with a quick google search.
Well I’ve provided 1 benchmark to your 0. I’d say I’m ahead.
My code benchmark is actually far more pre-fetch friendly than a LinkedList because it can prefetch more than one “node” at a time. In a LinkedList you can’t prefetch N+1 and N+2 at the same time because N+2 is dependent on the result of N+1.
I’m always open to learning new things. Modern compiler magic is deep and dark. If sometimes a compiler magically makes it fast and sometimes slow that’d be quite interesting! If you have any concrete examples I’d love to read them.
It's a link to a document on architectures well over a decade old - the most recent mentioned is the original Core architecture (from 2008). Honestly, based on your first comment I thought you were implying something about content dependent prefetch or other techniques that I am familiar in the academic literature but unaware of ever being used in mainstream hardware.
> Organize the data so consecutive accesses can usually be found in the same 4-KByte page.
> • Access the data in constant strides forward or backward IP Prefetcher. 3-72
> Method 2:
>• Organize the data in consecutive lines.
> • Access the data in increasing addresses, in sequential cache lines.
Nothing new here that contradicts the GPs skepticism. Certainly not enough evidence for you to be a dick.
Ok, but the hardware data prefetch functionality seems basically unchanged:
"Characteristics of the hardware prefetcher are:
• It requires some regularity in the data access patterns.
— If a data access pattern has constant stride, hardware prefetching is effective if the access stride
is less than half of the trigger distance of hardware prefetcher.
— If the access stride is not constant, the automatic hardware prefetcher can mask memory latency if the strides of two successive cache misses are less than the trigger threshold distance (small- stride memory traffic).
— The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes."
It details the prefetchers, how they work, what patterns they recognize (section 3.7)