Feast your eyes on the "Intel® 64 and IA-32 Architectures Optimization Reference...

josephg · on Nov 4, 2022

It doesn’t claim linked lists are fast. And it doesn’t have any benchmarking data.

Actually the only reference to linked lists I see is in the prefetcher section (3.7.3), where it explicitly recommends that consecutive items are stored in a single 4kb cache line or in consecutive cache lines. They give a positive code example using an array, like other commenters are suggesting.

insanitybit · on Nov 4, 2022

This is a great document that will absolutely confirm, in a number of ways, that linked lists are going to be terrible for prefetching.

AstralStorm · on Nov 4, 2022

That depends more on what they link to. Not prefetching a worthless list of pointers when you could prefetch data instead is typically faster. Prefetchers have limited capabilities and capacity for loads.

insanitybit · on Nov 4, 2022

The prefetcher doesn't even have to get involved for linear sequences, it would only get involved after a cache miss threshold is met. The fastest and simplest prefetcher, L1, is stream based and will work great for sequential data.

NavinF · on Nov 4, 2022

Prefetching just one element ahead does fuck-all when ram is 100x slower than cache especially if you need to precharge and activate a new row which is always the case when you're spraying tiny objects all over the place.

AstralStorm · on Nov 4, 2022

So don't spray tiny objects around. :) Keep them nicely tucked away in an arena.

Vectors actually tend to create a spray of tiny objects... As opposed to a true dynamic array which has limitations in resize. If you're lucky, they will keep a single dynamic array per vector. You're usually not this lucky with bigger data.

josephg · on Nov 5, 2022

If you never free your objects, you may as well use an array anyway - it'll be much more simple. And if you use an arena with an object pool, you're back in cache miss territory.

Vectors only "create a spray of tiny objects" if you have a vector-of-references or something like that. And even then, reading data requires 2 reads from memory (1 to read the pointer in the vector, and another to read the object). Finding item N in a linked list will require N reads (since you need to read each item before the target). Yes, resizing a vector is O(n). But every time I measure memcpy, it always goes faster than I expect.

If you disagree, try to prove it with code. At least one of us will learn something that way.

forrestthewoods · on Nov 4, 2022

My benchmarks say you’re wrong.

https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...

dmitrygr · on Nov 4, 2022

You're joking right?

"I made an extra test that wraps matrix4x4 in std::unique_ptr."

And that is your test? One piece of code? Not even presenting disassembly? Who the hell knows what your compiler wrought in response to your "std::unique_ptr" and "matrix4x4" ?

attractivechaos · on Nov 4, 2022

Sorry for my ignorance – linked list matching the performance of vector is something new to me. I would like to learn more. The best way to prove your point is to show us a benchmark. However, I couldn't find one with a quick google search.

forrestthewoods · on Nov 4, 2022

Well I’ve provided 1 benchmark to your 0. I’d say I’m ahead.

My code benchmark is actually far more pre-fetch friendly than a LinkedList because it can prefetch more than one “node” at a time. In a LinkedList you can’t prefetch N+1 and N+2 at the same time because N+2 is dependent on the result of N+1.

I’m always open to learning new things. Modern compiler magic is deep and dark. If sometimes a compiler magically makes it fast and sometimes slow that’d be quite interesting! If you have any concrete examples I’d love to read them.

fuckstick · on Nov 4, 2022

It's a link to a document on architectures well over a decade old - the most recent mentioned is the original Core architecture (from 2008). Honestly, based on your first comment I thought you were implying something about content dependent prefetch or other techniques that I am familiar in the academic literature but unaware of ever being used in mainstream hardware.

> Organize the data so consecutive accesses can usually be found in the same 4-KByte page. > • Access the data in constant strides forward or backward IP Prefetcher. 3-72

> Method 2: >• Organize the data in consecutive lines. > • Access the data in increasing addresses, in sequential cache lines.

Nothing new here that contradicts the GPs skepticism. Certainly not enough evidence for you to be a dick.

dmitrygr · on Nov 4, 2022

yup, was link to old a document with same title. updated. thanks.

fuckstick · on Nov 4, 2022

Ok, but the hardware data prefetch functionality seems basically unchanged:

"Characteristics of the hardware prefetcher are: • It requires some regularity in the data access patterns. — If a data access pattern has constant stride, hardware prefetching is effective if the access stride is less than half of the trigger distance of hardware prefetcher. — If the access stride is not constant, the automatic hardware prefetcher can mask memory latency if the strides of two successive cache misses are less than the trigger threshold distance (small- stride memory traffic). — The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes."