In defense of linked lists

readams · on Nov 4, 2022

Linked lists are great. But they have the problem that, almost always, whatever problem you're trying to solve would be better done with a regular resizable vector.

This includes problems that they should be great for, like insert into the middle or the front.

The reason is that in practice the way computers actually work is that there is an enormous time penalty for jumping around randomly in memory, and it's large enough that it's often worth paying a O(lg n) cost to switch to something that will be contiguous in memory and allow the CPU to prefetch data.

There are exceptions when you really should use an actual linked list, but your default should be a vector and not a linked list.

kabdib · on Nov 4, 2022

The first week into my prior job, I had to fix an in-production bug involving quadratic vector growth that was tipping over servers. The vector implementation guaranteed too much, namely consecutive memory layout (which wasn't needed, and if we had needed it, it would have been a mistake).

Decomposing that vector into recursive subvectors solved the problem. Going from a few tens of millions of elements in a single contiguous vector (with consequent -- and bad -- heap fragmentation!) to nested vectors with a few thousand elements each brought us back online again.

Which is to say: Vectors are nice. Lists are nice. Use appropriate data structures for your scale, watch your semantic dependencies, and don't get hung up on dogma.

mst · on Nov 4, 2022

The VList paper - which I archived at http://trout.me.uk/lisp/vlist.pdf to avoid having to google the bloody thing every time - is an interesting hybrid.

sparkie · on Nov 7, 2022

Regarding section 4.1 and 8 in this paper, it uses several software implementations of MSB and gives a performance comparison between the VList and RAOTS from which it was derived. It doesn't utilize hardware instructions which can improve the MSB calculation and eliminate branches (eg, clz/lzcnt or bsr). It's worth having another look at RAOTS using an optimized MSB which will leverage instructions available in modern hardware, and see how its performance compares to the VList.

https://cs.uwaterloo.ca/~imunro/cs840/ResizableArrays.pdf

mst · on Nov 8, 2022

Thank you!

Added to my /lisp/ directory for when I inevitably realise I've forgotten where I saw it :D

GeorgeWBasic · on Nov 5, 2022

Thank you for that. That's a concept I remembered reading about, but I couldn't remember the name and had no way to find it again.

mst · on Nov 5, 2022

trout.me.uk/lisp/ and trout.me.uk/gc/ are both basically archives of "stuff I knew I'd suffer that problem with" and it's a pleasure every time the contents come in handy for people who aren't me.

Also back when I was still pretending to be a mathematician I used both GWBASIC and UBASIC for assorted purposes so thank -you- for the nostalgia kick.

simplotek · on Nov 5, 2022

Thanks for referring to the article. Great read!

mst · on Nov 5, 2022

http://trout.me.uk/lisp/ and http://trout.me.uk/gc/ are both "selected stuff I knew I'd have trouble finding again later" and I wish you great joy of them.

Still sad that John Shutt (creator of 'kernel lisp') passed away while I was still procrastinating dropping him an email - I think the only other 'thank you' I've most regretted not sending in time was to Terry Pratchett.

ndriscoll · on Nov 5, 2022

Ironically, if you weren't using huge pages, the operating system was already doing this for you anyway, and your contiguous segments weren't contiguous.

This does suggest functionality that the OS should provide: let the program remap memory somewhere else in its address space to obtain a larger contiguous chunk of addresses without copying data. This wouldn't help much if you have direct pointers, but I think e.g. Java has a layer of indirection that would let it benefit from remapping to grow ArrayLists. I don't know if things already work this way. It should make speculative fetches easier for the processor since now it understands what you're doing via the memory map.

Edit: taking this idea a little further is actually fascinating: use the type system to separate objects into pools based on type, and instead of pointers, use indexes into your pool. You can move the pool for free to grow/shrink it. I think you could even do this with pointers and full hardware acceleration today if you used the high bits to tag pointers with their types and used remapping at the hypervisor level to move the pools (so the address in the application level doesn't change). You could do some crazy stuff if the application could coordinate multiple levels of virtual memory.

amluto · on Nov 5, 2022

You can certainly allocate memory at controlled addresses using mmap or VirtualAlloc.

But you should not play virtual memory games at runtime in an application that you expect to be scalable. Unmapping memory or changing its permissions requires propagating the change to all threads in the process. On x86, the performance impact is dramatic. On ARM64, it’s not as bad but it’s still there.

Also, even 64 bits can go fast when you start using high bits, separate pools for different purposes, etc.

kabdib · on Nov 5, 2022

Right, the ability to relocate a chunk of memory to a new virtual address (with more capacity at the end) without doing a physical copy would be an interesting way to solve the problem.

I don't know if anyone has done research on integrating data structures and OS functionality in this way. Interesting idea.

We needed to get things running again rather quickly, though :-)

cyber_kinetist · on Nov 5, 2022

Some game engine developers have certainly used this idea, to prevent any reallocations when the dynamic array needs to be resized. Such as:

https://ruby0x1.github.io/machinery_blog_archive/post/virtua...

Some ECS implementations use this to reduce the overhead of dynamic arrays, as well as ensuring pointer stability of the items stored. For example entt:

https://skypjack.github.io/2021-06-12-ecs-baf-part-11/

And here's a library implementing a resizable, pointer-stable vector using virtual memory functionality from the OS:

https://github.com/mknejp/vmcontainer

ultrahax · on Nov 6, 2022

End of page allocation has been a real life-saver. I have it rigged up to the core allocators of the game engine I work with, so you can switch it with a cmdline switch; if you suspect an overwrite, you switch on the EOP allocator ( using more memory temporarily ), and bang, you get a segfault where you went off the end / used-after-free.

bee_rider · on Nov 5, 2022

I somehow misread your comment the first time through and thought you were suggesting the ability to ask the OS: "Hey, I'm a pointer, is there anything in front of my block of memory, if so can you move it elsewhere so that I can expand?" This would of course be a pretty silly idea but it would be funny to give the OS folks a whole new class of use-after bugs to worry about.

Edit: Let's call it "memoveouttamyway"

FeepingCreature · on Nov 5, 2022

Isn't that `realloc`? At the allocator level rather than OS level, but still.

colanderman · on Nov 5, 2022

glibc realloc(3) does this on Linux, by way of mremap(2) with MREMAP_MAYMOVE [1]. (Just confirmed on my local machine using strace(1).)

Unfortunately, libstdc++ does not (to my knowledge) use realloc(3) or mremap(2) for vector resize, nor could it in many situations due to the semantics of C++.

[1] https://github.com/bminor/glibc/blob/15a94e6668a6d7c5697e805...

comex · on Nov 5, 2022

Linux has mremap() for this. macOS has vm_remap(). Not sure about other operating systems.

pencilguin · on Nov 5, 2022

OSes do already provide this ability: mmap.

kabdib · on Nov 5, 2022

That sounds like a lot of work unless you absolutely have to have a contiguous allocation of every element.

If you don't need that, you can get "O(1.1)" access and nice heap behavior with an indirection or two, and the implementation is likely to be lots more portable and standards-friendly than if you also dragged in a bunch of OS mmap() semantics (which tend to have exciting, platform-dependent misfeatures).

Also, I would have to reach for the mmap() man page. I have malloc() right in front of me. :-)

ndriscoll · on Nov 5, 2022

Maybe I'm just not seeing it, but I don't see how mmap let's you remap an existing address to a new one? Looks like there is a mremap on Linux in any case.

comex · on Nov 5, 2022

If you want to stick to mmap, you can do it by creating a file which doesn’t have any actual disk backing (one way to do this is with POSIX shm_open), and handling all memory allocations by mmapping parts of that file rather than mapping true anonymous memory. Then, ‘remapping’ an address to a new one is a matter of calling mmap again with the same file offset. Unlike mremap, this allows mapping the same memory at different addresses simultaneously, which can be useful for some purposes.

pencilguin · on Nov 5, 2022

/dev/hugepages/ is a good place to put this if you have enough control over the host.

pencilguin · on Nov 5, 2022

Or, you can map new space onto the end of an existing mapping, at a slightly greater cost of keeping track. Mremap is usually better.

bluGill · on Nov 5, 2022

If your elements are in the millions than a tree might be better, depending on what you do. If you traverse the whole thing often then vector is best, but at a million a tree will do a search for element faster despite the cache misses.

I don't know where the line is, but in general you are probably not close to that line so default go vector. However as your point is, sometimes you are over that line.

sally_glance · on Nov 5, 2022

And if you really need it you can even do both tree and vector simultaneously - just make sure both consist of pointers only so you avoid update/sync issues.

bluGill · on Nov 5, 2022

You can, but probably shouldn't. The pointers means a cache miss when you follow it. Putting everything in the vector is the point, otherwise the map is potentially faster

jcelerier · on Nov 5, 2022

If you have to do a binary search in a million elements you'll have cache misses anyways

bluGill · on Nov 5, 2022

Sure, ln cache misses. With a linear search through pointers you have n cache misses.

diabeetusman · on Nov 4, 2022

Correct me if I'm wrong, but it sounds like you're describing an Unrolled Linked List [1]? If so, neat!

[1]: https://en.wikipedia.org/wiki/Unrolled_linked_list

pencilguin · on Nov 4, 2022

Another name for that is "B* tree". Or anyway a thing with similar performance characteristics.

kabdib · on Nov 4, 2022

Right, though we didn't need full tree semantics (just fast random access and an append that didn't spool up the fans on the servers).

I would hate to have to do a full balanced b-tree on short notice. (We also had rather terrible extra requirements that made using an off-the-shelf thing from, say, the STL work in our environment).

dilap · on Nov 5, 2022

simple vector implementations will do a full realloc and recopy when you run out of space. so O(1) amortized but O(N) every so often.

another implementation strategy, at the cost of using some extra memory, is to allocate the new space, but only copy over a bit at a time, each time a new element is added -- so every add is still O(1). no long pauses.

(well, assuming the memory alloc is still fast. which i think it should be in practice? tho i'm not super familiar w/ that stuff.)

just curious, do you think that solution would have worked for your situation too?

kabdib · on Nov 5, 2022

That's exactly the strategy that was tipping over. Once you start resizing vectors of tens of megabytes, it's kind of over.

You can start playing games by accelerating capacity allocation when a vector rapidly starts growing extremely large. This doesn't solve the problem that vectors sometimes guarantee too much for what you need.

dilap · on Nov 5, 2022

Ah thanks, I didn't read closely enough the first time.

samsquire · on Nov 5, 2022

For a simple and readable implementation of a Python btree see this:

https://github.com/samsquire/btree

I tried to keep the implementation as simple as I could.

cobalt · on Nov 5, 2022

also what a deque (C++) does

pencilguin · on Nov 5, 2022

True... except in MSVC. std::deque in MSVC is cursed.

Targeting MSVC, use boost::deque or something, instead. Boost containers are generally fairly good, although they are more or less unmaintained nowadays. (E.g., boost::unordered_map::reserve doesn't work.) Or, like many of us, ignore performance on MS targets because anybody who cares about performances is running something else.

cyber_kinetist · on Nov 5, 2022

Many of the STL containers in MSVC are incredibly cursed. Even std::vector is implemented in such a weird way, that you can't debug it easily with any debugger other than Visual Studio. And it's incredibly slow on Debug mode, up to the point that it's unusable... (Because of these reasons I just switched to using my own Vector<T> class recently)

I think there's a very good reason why old gamedevs who were stuck with developing on Windows largely ignored the STL and made their own containers / utility functions: MSVC's STL was just awful.

simplotek · on Nov 5, 2022

> Many of the STL containers in MSVC are incredibly cursed. Even std::vector is implemented in such a weird way, that you can't debug it easily with any debugger other than Visual Studio. And it's incredibly slow on Debug mode, up to the point that it's unusable... (Because of these reasons I just switched to using my own Vector<T> class recently)

Here's a link to an old HN discussion that offers some insight on the performance of MSVC's STL in debug.

https://news.ycombinator.com/item?id=18939260

The "cursed" keyword hand-waves over the problem and suggests it originates in the acritical propagation of ignorant beliefs. Sometimes clarifying these is just a Google search away.

cyber_kinetist · on Nov 5, 2022

Yes, I know that already. And good luck with dealing all the linker/ABI issues from changing that setting.

brantonb · on Nov 5, 2022

I always wondered why we weren’t allowed to use the STL in the Windows division in the 00’s. This probably explains it.

pencilguin · on Nov 5, 2022

There was a lot of BS, as well. It is impossible to tease out which was what.

ahtihn · on Nov 5, 2022

> ignore performance on MS targets because anybody who cares about performances is running something else.

That's a strange take. Surely games still represent a significant chunk of the demand for C++ developers that need to care about performance?

pencilguin · on Nov 5, 2022

Thus, "many". Game coders would not be among those.

q-big · on Nov 5, 2022

pencilguin · on Nov 5, 2022

https://github.com/AcademySoftwareFoundation/openvdb/issues/...

Longer answer is that the size of the "blocks" is limited to 512 bytes or one element, whichever is larger. So unless your elements are really tiny, it is strictly a pessimization.

taf2 · on Nov 5, 2022

My Forman always would tell me KISS - (keep it simple stupid), maybe there is another single word for just solve the problem at hand silly - j,s,p,a,h,s - doesn’t really roll off the tongue the same way…

AtlasBarfed · on Nov 5, 2022

This sounds like you need a log structured merge tree.

Smaug123 · on Nov 4, 2022

Nobody has mentioned immutability yet - linked lists are easily used as one of the simplest immutable persistent data structures. Your definition of "better" appears to be solely about keeping the CPU burning, but if CPU isn't the bottleneck, I prefer "immutable linked list" over "clone a vector many times" merely on aesthetic and reason-ability grounds.

gleenn · on Nov 4, 2022

Or you could go the Clojure route and do the tree-style immutable vectors where the cost is log32(N) for most operations, small enough in all the relevant operations, and you get the amazing usability of immutable data structures that are performant enough in most cases.

kazinator · on Nov 5, 2022

How does that scale down to small sequences of under ten items?

Jim_Heckler · on Nov 5, 2022

an array is used for the last 1-32 elements of the vector (the "tail") so there would be no trie at all, just the tail

yongjik · on Nov 5, 2022

Well, of course if a solution is "cloning a vector many times" then arguably you're using the vectors wrong - in which case it's totally appropriate to find a different data structure (or a better way to use vectors, if possible).

dan-robertson · on Nov 5, 2022

In C where using the simplest data structures is valuable, you’d likely do much better by upgrading your programming language to something that offers a better data structure. If you’re using a higher level language where you can abstract details about your sequence data structure, I think there’s very little benefit to using the simplest data structure and you should instead have widely used libraries that offer a good immutable sequence type, e.g. RRB trees.

One exception is ML style languages where singly linked lists get a big syntax advantage (Haskell only half counts because most of its lists are really just iterators). I think this was, in hindsight, a mistake for a widely used language. It’s also a source of pointless pain for learners (eg I don’t think there’s much benefit to people having to write lots of silly tail-recursive functions or learning that you have to build your list backwards and then reverse it). Another exception would be some lisps where lists are everywhere and the cons-cell nature of them makes it hard to change.

mst · on Nov 4, 2022

cons cells are a local maximum.

often they're sufficiently maximal.

manv1 · on Nov 4, 2022

It's surprising people are arguing about this. That fact alone shows that lots of developers don't understand the relationship between L1/L2 cache, data access patterns, and how prefetching works.

That makes sense, given how abstract things have gotten. Decades ago there was an article that showed the penalty for non-linear data access was massive. That was before speculative access, branch prediction and cache prefetching were standard features. The performance hit today would be even more (except presumably for machines that have implemented the speculative execution security mitigations).

couchand · on Nov 5, 2022

As with any such subject, it's worth discussing the details but probably not the generalities really (except perhaps in the whimsical style of antirez).

You're right to point out that given the performance characteristics of most modern CPUs, accessing memory linerally is optimum. But it doesn't necessarily follow that all collection data structures should be stored as a vector. Consider a collection traversed very slowly relative to the overall computation. There may not be any cache win if the line's always evicted by the time you come back to it.

fiddlerwoaroof · on Nov 5, 2022

I’ve always wondered about whether you could have your cake and eat it too with linked lists where nodes are allocated in contiguous blocks of N nodes. So, the normal case of traversing the next pointer is not a discontinuous memory access.

tsimionescu · on Nov 5, 2022

That's quite likely to happen to a linked list when using a compacting GC (such as Java's or C#'s generational garbage collectors) (assuming the elements of the list are only reachable from the preceding element, not from many other places as well).

fiddlerwoaroof · on Nov 6, 2022

That depends on the GC’s traversal strategy, I think. If you assume that linked lists are implemented as in lisps (a pair of pointers), then if you traverse the head pointer before the tail pointer, the compacting GC will tend to spread the linked list out over time.

simplotek · on Nov 5, 2022

> It's surprising people are arguing about this. That fact alone shows that lots of developers don't understand the relationship between L1/L2 cache, data access patterns, and how prefetching works.

Your misconception lies in the assumption that the only conceivable use case is iterating over a preallocated data structure to do read-only operations.

Once you add real world usages, with CRUD operations involving memory allocations/reallocations then your misconception about cache crumbles, and you start to understand why "people are arguing over this".

mannyv · on Nov 8, 2022

You're expanding the scope of the problem. In real life nobody is pulling shit off of a database and stuffing it into a linked list. And if they are the performance will still be worse because each list will be allocated. Even if you do your own malloc you will blow cache with a linked list, because as time goes on your locality will be blown out. Iterating over a linked list will be worse as well.

In fact, on modern architectures it may be faster to iterate over arrays just to take advantage of l2/l3 behaviors.

Dylan16807 · on Nov 5, 2022

Which misconceptions? Messy memory allocations hurt lists even more.

pizza · on Nov 5, 2022

It’s hard for me to blame developers for their misaligned mental models since, as you yourself point out, the relationship between your mental model of the hardware’s behavior and what it actually does is complex and not really obvious. Mitigations etc means that the hardware you bought might totally change its behavior in the future. If all you can rely on is time measurements, it doesn’t explain that the software you wrote last year is now eg 40% slower due to bug mitigations. I think it’s easier to expect developers to put in the effort once they’re empowered with powerful tools to explain how the software gets mapped onto the hardware.

I think tools that result in greater transparency at lower levels will probably emerge, something like a compiler-generated dynamic analysis + pre-flight checklist hybrid. But since it’s so low-level, it might actually affect results, which may be a problem. I think intel vtune seems interesting in that regard.

teo_zero · on Nov 5, 2022

I think the argument of locality and linearity and cache hits is valid, but specific for one access pattern: reading items in order and adding to the tail. If that's what your code ends up doing most of the time, then an array wins over a linked list hands down.

As soon as you start adding or deleting items in the middle, the game changes because the data you need to move is not contained in the cache anymore (except for small arrays).

I can even imagine a workflow that uses both data structures: an app could build up a linked list with the data it reads from an external file, for example, where they can be scattered and not in order; when the file is finally closed, the data are considered read-only and rearranged into an array for the remaining part of the execution.

So don't dismiss linked lists as inefficient "a priori", which is exactly what antirez claims in the article.

tsimionescu · on Nov 5, 2022

That analysis is lacking. Linked lists turn out to be inferior to arrays for essentially any access pattern, at least when the linked list can be assumed to be randomly stored and not accidentally sorted.

Say you want to add an element in the middle of an N-element ll/array.

For the linked list, this means you need to do 2 * N/2 memory reads (read value, read next pointer), which will mean on average N/2 cache misses, to reach the element at index N/2. Then you'll do O(1) operations to add your new element. So, overall you've done O(N) operations and incurred O(N) cache misses.

For an array, there are two cases:

1. In the happy case, you'll be able to re-alloc the array in-place, and then you'll have to move N/2 elements one to the left to make room for your new element. You'll incur one cache miss to read the start of the array, and one cache miss to read the middle of the array, and one cache miss for realloc() to check if there is enough space. The move itself will not incur any more cache misses, since your memory access pattern is very predictable. So, overall you've done O(N) operations, but incurred only O(1) cache misses - a clear win over the linked list case.

2. Unfortunately, sometimes there is no more room to expand the array in place, and you actually have to move the entire array to a new place. Here you'll have to move N elements instead of N/2, but you'll still incur only 1 cache miss for the same reason as above. Given the massive difference between a cache miss and a memory access, doing N operations with 1 cache miss should easily be faster than doing N/2 operations with N/2 cache misses.

The only access patterns where the linked list should be expected to have an advantage are cases where you are adding elements very close to an element which you already have "in hand" (e.g. at the head, or at the tail if you also keep a pointer to the tale, or to the current element if you're traversing the list while modifying it). In those cases, the linked list would need O(1) operations and cache misses, while the array would need O(N) operations and O(1) cache misses, so it would start to lose out.

teo_zero · on Nov 6, 2022

> The only access patterns where the linked list should be expected to have an advantage are...

Exactly my point: such access patterns exist.

pencilguin · on Nov 5, 2022

The fact is that in 95+% of cases, the performance doesn't matter at all, and a linked list is fine.

The only problem is that use cases evolve, and what was fine suddenly isn't, anymore.

eru · on Nov 5, 2022

Well, ideally your programming language (and its standard library) should provide you with basic data structures that are cache oblivious (or optimized for your arrangements cache). So programmers wouldn't need to worry.

sgtnoodle · on Nov 5, 2022

I often write code destined to run on a microcontroller, and linked lists are great for avoiding dynamic memory. It's nice to be able to statically or transiently allocate storage without having to bound the size of anything.

Of course, the same unintuitive tradeoffs tend to apply even if there isn't much caching to worry about. A trivial linear iteration is often faster than a more elaborate algorithm when the input is small.

bitexploder · on Nov 5, 2022

Carmack is a fan of this approach. Allocate some reasonably large number as your design constraint and if you start approaching it that is a time to ask why and is this data structure still reasonable here

G3rn0ti · on Nov 5, 2022

That‘s why the Amiga OS (Exec) used linked lists as its primary data structure. It was one of the first micro computer operating systems using dynamic memory mapping (i.e. no fixed addresses for OS calls and data structures) and supported multi tasking. Any object you received from Exec basically was a pointer to a list node. And, yes, the Amiga‘s biggest problem was memory fragmentation and leakage because unreachable linked list nodes would be starting to accumulate whenever working with a buggy program.

KerrAvon · on Nov 5, 2022

A memory-constrained microcontroller is actually one of the valid exceptions to the rule.

rcxdude · on Nov 5, 2022

It's also an area where the costs are a lot less significant either: a cortex m4 and below often doesn't have a cache (at least for RAM: flash often has a cache and predictive lookup, but the perfomance difference is small enough that the cache can basically render executing code from flash as identical to RAM), and the RAM is about as fast as the processor, so random access and sequential access are pretty similar performance wise. (And with the speed of modern microcontroller cores the size of the data is rarely even an issue: most STM32s can iterate through their entire RAM in under a millisecond).

That said, in my embedded work I still rarely find a linked list to be a useful data structure (but I know FreeRTOS, which I use a lot, uses them extremely heavily internally).

Sirened · on Nov 5, 2022

Well, yes, and a lot of people treat kernel development as if they were developing for a constrained microcontroller. I don't think this is even a bad thing, we want the kernel to be as lean as possible because every page and cycle the kernel burns is one that users don't get to use.

jstimpfle · on Nov 5, 2022

I suppose a more important reason is the "not have to bound anything" part. OS kernels almost definition don't know how many objects will be created. Implementing queues as vectors of pointers will require dynamic allocation and/or introduce many more failure points when going out of resources. With linked lists, once you have an object you can always append it to a linked-list queue.

Dylan16807 · on Nov 5, 2022

This is a memory vs. time tradeoff. And the time cost gets worse as the machine gets bigger.

Designing for embedded is not the same thing as being lean.

dragontamer · on Nov 4, 2022

> regular resizable vector

At gross complexity / fragmentation / to your memory allocator.

Linked Lists are definitely easier to use if you're ever in a position where you're writing your own malloc(). The infrastructure needed to cleanly resize arrays (and also: the pointers all going bad as you do so) has a lot of faults IMO.

EDIT: In particular, linked lists have a fixed size, so their malloc() hand-written implementation is extremely simple.

ay · on Nov 4, 2022

Doubling the vector size each time it needs to be enlarged takes care of fragmentation/memory overhead, in practice. Or preallocating if you know the rough size.

It does take some discipline to avoid the storage of pointers, but once you get used to that it’s quite fine.

Source: I work on VPP, which uses vectors quite extensively.

saagarjha · on Nov 4, 2022

Usually you want a smaller growth factor than that to allow for more reuse. (Also: what’s VPP?)

ay · on Nov 5, 2022

I use doubling because it’s simple to reason about and hard to screw up the math.

What kind of growth factors heuristics worked for you the best ?

VPP: rather fast user mode dataplane. https://fd.io/ is the “marketing” site. https://wiki.fd.io/view/VPP is the “less flashy” dev wiki.

Kranar · on Nov 5, 2022

The optimal growth factor is the golden ratio (1.6), in practice many vectors use a growth factor of 1.5.

The reason for not going above the golden ratio is that it prevents any previously allocated memory from ever being reused. If you are always doubling the size of your vector, then it is never possible to reclaim/reuse any previously allocated memory (for that vector) which means every time your vector grows you are causing more and more memory fragmentation, as opposed to using a growth factor of 1.5 which results in memory compaction.

llbeansandrice · on Nov 5, 2022

Why does that happen when doubling size? I don’t understand

Kranar · on Nov 5, 2022

Sure we can go over both cases:

Case 1: Growth factor of 2 and an initial size of 10 bytes.

Start with an initial allocation of 10 bytes of memory.

On growth allocate 20 bytes and release the 10 bytes, leaving a hole 10 bytes.

On growth allocate 40 bytes and release the 20 bytes, the hole in memory is now 30 bytes large (the initial 10 byte hole + the new 20 byte hole).

On growth allocate 80 bytes and release the 40 bytes, the hole is now 60 bytes.

On growth allocate 160 bytes and release the 80 bytes, the hole is now 140 bytes.

So on so forth... using this strategy it is never possible for the dynamic array to reclaim the hole it left behind in memory.

Case 2: Growth factor of 1.5 and an initial size of 10 bytes.

Start with an initial allocation of 10 bytes of memory. On growth allocate 15 bytes and release the 10 bytes, leaving a hole 10 bytes.

On growth allocate 22 bytes and release the 15 bytes, the hole in memory is now 25 bytes large (the initial 10 byte hole + the new 15 byte hole).

On growth allocate 33 bytes and release the 22 bytes, the hole is now 47 bytes.

On growth allocate 50 bytes and release the 33 bytes, the hole is now 80 bytes.

On growth reuse 75 bytes from the hole in memory left over from previous growths, the hole is now 5 bytes.

With a growth factor of 1.5 (or anything less than the golden ratio), the hole grows up to a point and then shrinks, grows and shrinks, allowing the dynamic array to reuse memory from past allocations.

With a growth factor of 2, the hole in memory continues to grow and grow and grow.

sfink · on Nov 5, 2022

This appears to assume that there will be no intervening allocations that are allowed to use the same region of memory, since otherwise your "hole" will be very discontiguous.

But if that is the case, then why are you releasing memory? That's just costing you time moving the data. With a doubling mechanism:

Start with an initial allocation of 10 bytes of memory.

On growth allocate another 10, and don't copy anything, leaving no hole at all.

On growth allocate another 20, and don't copy anything, leaving no hole at all.

etc.

In what scenario would the growth factor of 1.5 actually help? If you're restricting the allocation API to only malloc then you can't grow the size of your allocation and what you said might make sense as something the allocator could take advantage of internally. But realloc exists, and will hopefully try to extend an existing allocation if possible. (If your earlier allocation was fairly small, then the allocator might have put it in an arena for fixed-size allocations so it won't be possible to merge two of them, but once things get big they generally get their own VMA area. Obviously totally dependent on the allocator implementation, and there are many.)

Dylan16807 · on Nov 5, 2022

99.9% of the time, each allocation leaves a separate hole. Your goal is to enable other allocations to use those holes without further splitting them, not to get your ever-growing vector to reuse memory.

sfink · on Nov 5, 2022

I usually use doubling, but I just recently landed a patch to expand by a factor of 8 instead, which sped up a microbenchmark (of constructing a JSON string, fwiw) by 50%.

Someone saw in a profile that we were spending a lot of time in realloc in a task that involved building a string in contiguous memory. But the final string was realloc'd down to its actual size, so it was safe to get a lot more aggressive with the growth to avoid copying. The overhead was only temporary. It turns out that powers of 8 get big fast, so there are now many fewer copies and the few that happen are copying a small portion of the overall data, before a final octupling that provided mostly unused space that didn't cost much.

Know your problem, I guess?

ay · on Nov 5, 2022

Very interesting. And indeed, especially being adjacent with the very interesting “golden ratio” comment, this shows that there is more than one definition of “best” dependent on the task, and that one always needs to verify their hunch by actual profiling.

shaklee3 · on Nov 5, 2022

good to know vpp is still alive. always interesting work coming out of it, and it never seemed to take off.

ay · on Nov 5, 2022

There are quite a few sizable businesses (hundreds of MUSD) built on it / depending on it, that I know of :)

shaklee3 · on Nov 5, 2022

out of curiosity, is it mostly Cisco still funding it? what are the advantages over dpdk? mostly TCP?

ay · on Nov 5, 2022

The funding - I believe organizationally it has now grown into https://lfnetworking.org/about/members/ and fd.io is just one of the projects.

Contributions-wise to VPP: varies from time to time. On my very recent memory there were sizable IPSec acceleration patches from folks at Intel. There is a fair few one off bugfixes coming from smaller users who scratch their own itch. Pim van Pelt [0] has contributed/improved a few very cool and sizable features (Linux control plane integration, VPP yaml configurator [1])

As for difference with DPDK - does it do L3 routing? NAT ? MAP ? MPLS ? DHCP ? VPP can do all of this, and of course it’s just off top of my memory…

https://docs.fd.io/vpp/22.10/aboutvpp/featurelist.html is the autogenerated feature list based on the “claimed official features” that are tracked via YAML files in the source code.

[0] https://ipng.ch/s/articles/2021/08/12/vpp-1.html [1] https://lists.fd.io/g/vpp-dev/topic/feedback_on_a_tool_vppcf...

taneq · on Nov 5, 2022

More complexity than allocating/deallocating a struct for each link in the list?

And the solution to "pointers to elements of a vector going bad" is don't save pointers to elements of a vector.

asddubs · on Nov 5, 2022

or if you can't guarantee allocations will succeed and need to handle that case. in that case you can just have the object that's part of the list have its list element data contained within itself. (aka an intrusive list)

sqrt_1 · on Nov 4, 2022

Here is a performance graph on removing items from a vector vs list - slightly old (2017) https://github.com/dtrebilco/Taren/blob/master/Articles/iter...

From the article https://github.com/dtrebilco/Taren/blob/master/Articles/Eras...

devmunchies · on Nov 5, 2022

That is removing an item at N index, right? (that's why it talking about iterators?) Linked lists are only fast at append/pop with the first node.

osigurdson · on Nov 5, 2022

>> This includes problems that they should be great for, like insert into the middle or the front.

How is a vector good for inserting elements in the middle? That is O(N). Where is the O(lg n) cost coming from?

>> your default be a vector and not a linked list

The default should be to understand the problem.

xenadu02 · on Nov 6, 2022

A good balance is a tree with about a cache line's worth of values at each node. It degrades to a vector for small N, gets the pre-fetching / branch prediction benefit for medium N, but doesn't require fully contiguous memory for large N. It also tends to do fairly well on common mutation patterns (insertions and deletions) by avoiding large copies.

If you want lock-free atomic data structures there are lots of variations (like AVL trees) that can be used as the underlying storage and just like mentioned above it tends to work well for most use cases and access patterns.

If N is fairly small and/or you allocate most of your linked list around the same time it may not matter which is why measuring your actual program with real-world data and access patterns is so important. eg if in production logging or other malloc traffic ends up making your linked list nodes spread out in memory perf is not going to match what you tested at your desk.

Often an algorithm can operate on an entire cache line's worth of data before main memory can chase down a single pointer. Modern CPUs hide that with prediction but even if the address prediction is 100% accurate if the result is a pointer loading chain you run out of pre-fetch slots pretty fast.

To give concrete numbers if DRAM is not busy refreshing and there is no bus contention or inter-core contention then we might get 60-100ns access times all told. If we have a 3Ghz CPU then that's 300 cycles waiting around doing nothing... for each pointer that needs to be dereferenced. For a vector after the first 300 cycle hit the rest would be free - following nodes are in the cache line and address prediction will pre-fetch subsequent cache lines as well, allowing pre-fetching to stream data ahead of when it is needed.

tsimionescu · on Nov 5, 2022

> How is a vector good for inserting elements in the middle? That is O(N). Where is the O(lg n) cost coming from?

Both an array and a linked list are O(n) for adding in the middle. In an array, you finding the middle element is O(1), then moving the rest of the array is O(n) [you have to move n/2 elements]. In a linked list, finding the middle element is O(n) [you have to traverse n/2 elements], but adding it is O(1).

So, asymptotically there is no difference. In practice though, the linked list traversal will cause O(n) cache misses to reach the middle element (since we are traversing random memory), while the array move will only incur O(1) cache misses (since we are accessing memory in order) - so, the array will actually win in practice.

Edit: Note that I also don't know where the GP got the O(lg n).

osigurdson · on Nov 5, 2022

It seems that the case where it is necessary to find the exact middle element and insert something wouldn't be very common but OK, I'll humor this use case. Let's assume that the list is huge and inserting at the middle is the only thing we do. Just maintain a pointer to the middle element and insert there: O(1) for find and insert. If it is necessary to add/remove at other places in the list, it is possible to maintain a doubly linked list and move the middle pointer back and forth accordingly (still O(1)).

I suppose if the LL like structure is mostly read only with rare inserts, not that big and holds simple data types or structs and often needs to be read sequentially then an array/vector/list would be better than a regular LL but then it is obvious that an array is better anyway.

It's poor guidance to tell people that the vector is the "go to" when you need a LL. Sad actually that this poor guidance has been upvoted by so many people such that it is the top comment, all in the name of some FUD about cache misses.

tsimionescu · on Nov 5, 2022

Let's take the case where we need to add to any element of the structure, with uniform probability.

For an array, this means we'll have to move N elements if we need to add at the beginning, N/2 if exactly in the middle, or 0 if at the end. Either way, we'll need to re-size the array, which has some chance of requiring N copies anyway - say, half the time. So, on average, we'll perform 3N/4 moves per element added, requiring 1 cache miss every time.

For a (doubly-)linked list, this means we'll have to traverse 0 elements if we want to add right before Head, 0 if we want to add right after Tail, and worse case we'll need to traverse N/2 nodes if right in the middle - so, we'll have to do on average N/4 traversals per element added. Each of the nodes traversed will generally be a cache miss, so N/4 cache misses. So, the Linked List will indeed be require a third the number of operations on average, but many times more cache misses, so should be expected to lose. If you store the middle pointer as well, you'll again half the number of operations for the list, but that still won't cover the amount of time wasted on cache misses.

Of course, if the distribution is not uniform and is in fact skewed near a predictable element of the linked list, at some level of skewed-ness vs cache speed advantage, the Linked List will overtake the array.

Note that linked lists also have other advantages - for example, they can be used as in the Linux kernel to store the same object in multiple different lists without any copying and without extra pointer indirections (at a small cost of one extra pointer in the object per list it could be a member of). They are also easier to pack into a small address space that has to be dynamically allocated (since you don't require a large contiguous block).

llbeansandrice · on Nov 5, 2022

I think by “middle” they mean “at the current position” not at an arbitrary position in the middle when starting from HEAD

andirk · on Nov 4, 2022

Understanding, and even sympathizing, with the machine's toil re: vectors vs manual linked list can separate a good engineer from a great one. We learn linked lists, assembly, and even machine code in Computer Science majors so that we know what's happening under the hood and can more easily surmise the runtime effects.

BeetleB · on Nov 5, 2022

> Understanding, and even sympathizing, with the machine's toil re: vectors vs manual linked list can separate a good engineer from a great one.

I think your parent's point is that if you're a C++ programmer, recognizing that vectors outperform other data structures for most use cases is what separates a good engineer from a great one. Often even for things that require lookup (i.e. vector outperforming a set).

During code reviews, a senior C++ programmer in my company (formerly sat in the C++ standards committee) would always demand benchmarks from the developer if they used anything else. In most cases, using representative real world data, they discovered that vectors outperformed even when the theoretical complexity of the other data structure was better.

A great engineer knows the limitation of theory.

Of course, my comment is only regarding C++. No guarantees that vectors are great in other languages.

GuB-42 · on Nov 5, 2022

> Often even for things that require lookup (i.e. vector outperforming a set)

I am always uncomfortable using higher algorithmic complexity. Sure, for small data sets, the vector will outperform the set, but it may not stay small forever. I may waste a millisecond now, but at least, I won't waste an hour if the user decides to use 100x the amount of data I initially considered. And using much more data than expected is a very common thing. It is a robustness thing more than optimization.

If you want both, there are staged algorithms, a typical example is a sort algorithm that goes insertion -> quicksort -> mergesort. Insertion sort is O(N^2), but it is great for small arrays, quicksort is O(N log N) on average but may become O(N^2) in some cases, and mergesort is guaranteed O(N log N) and parallelizable, but it is also the slowest for small arrays. For lists, that would be a list of vectors.

jll29 · on Nov 5, 2022

I like the staging idea, but curiously I haven't seen any papers investigating this.

Obviously, there is an overhead cost of the staging itself, which hits the use case for small data, and there is the question "From what data size onwards to use what algorithm?".

For example, I just had a look at sort in Clang's STL implementation, which seems to set a boundary at N=1000. Since the number is a multiple of 10, it looks like it might be arbitrarily picked, in any case there is no paper references in the comments to back it up.

Are people aware of formal studies that explore whether it is worth the trouble to inspect data structure and then to dispatch to different algorithms based on the findings?

the_sleaze_ · on Nov 5, 2022

You're saying you think there's still optimization to be done re. the N=1000 constant?

Other than that, I'd suggest that big O complexity is already telling you that the inspection is worth it. Assuming inspection of an array is constant.

BeetleB · on Nov 5, 2022

> Sure, for small data sets, the vector will outperform the set, but it may not stay small forever.

In over 90% of SW applications written in C++, there is little to no uncertainty on the size of the data set. Most applications are written for a very specific purpose, with fairly well known needs.

If my data set is size N, and I can benchmark and show the vector outperforms the theoretical option up to, say, 20N, it's a safe choice to go with vector. Doing these analyses is engineering.

In any case, it's almost trivial to swap out the vector with other options when needed, if you plan for it.

zelphirkalt · on Nov 5, 2022

I hope that senior engineer is also the one person, who will refactor things, if ever there are more elements in any of those vectors, making linear search slower than an actual set or other data structure. Oh and of course that senior engineer will also hopefully be the one monitoring the lengths of all those vectors in production.

josefx · on Nov 5, 2022

> vectors, making linear search slower

The standard library gives you plenty of tools to maintain and search in a sorted vector. Often enough still faster than maps.

> Oh and of course that senior engineer will also hopefully be the one monitoring the lengths of all those vectors in production.

Doesn't a std::map have more space overhead than a vector?

BeetleB · on Nov 5, 2022

I believe he was referring to the number of elements in a vector, to know when to swap with a different data structure - not the size in bytes.

But to your point - I once needed to do a key/value lookup for a large data set, and so of course I used a map. Then I ran out of RAM (had 80GB of it too!). So I switched to a vector and pre-sorted it and used binary_search for lookups (data set was static - populate once and the rest of the codebase didn't modify it). And of course, the RAM consumption was less than half that of the map.

sfink · on Nov 5, 2022

Just describe your sorted list as "a balanced binary tree with implicit child links" and then you get credit for being fast, space-efficient, and using fancy data structure.

asveikau · on Nov 5, 2022

> [V]ectors outperform other data structures ... even for things that require lookup (i.e. vector outperforming a set)

This is something I've heard before (not often benchmarked) and it makes intuitive sense, and sometimes I've also told it to myself in order to not feel bad for writing O(n) where n is always small.

But somehow, reading this snippet, I'm reminded that the vector does not maintain the semantic reminder that it's used in a key/value use case. So it has me thinking there should be a map/hashtable/dictionary interface that's really just a vector with linear lookup.

Then I think, hmm, maybe the standard containers should just do that. Maybe the authors of those should have some heuristic of "when to get fancy".

dabitude · on Nov 5, 2022

These are flat_map, flat_set, flat_hash_map, etc ...

See: https://en.cppreference.com/w/cpp/header/flat_map

usefulcat · on Nov 5, 2022

I believe that flat_map and flat_set at least are sorted, so that binary_search/lower_bound can be used for lookups.

Linear search generally does outperform for small collections though (it’s better for caches and the branch predictor).

asveikau · on Nov 5, 2022

C++23! The future is here.

bluGill · on Nov 5, 2022

Std::find_if works great if you want to treat a vector like a map. Plus you can have different keys if that is useful for your problem (look up employee by name, id, or address?)

KerrAvon · on Nov 5, 2022

I think the cases where lookup is faster for a linearly scanned vector is where the hash computation overhead is greater than a linear scan. Assuming you have a good hash implementation, that should be true only for a small number of entries, where you need to determine "small" by measuring.

asveikau · on Nov 5, 2022

I think it's about locality and quickness of compare.

So if you have a small data structure and/or large cache, and the key is not complicated.

bluGill · on Nov 5, 2022

Even if the key is cheap, that it isn't in cache means that vector will be faster just because we avoid the cache miss of loading the value into memory.

kibwen · on Nov 4, 2022

All of these are indeed important, and will eventually (one hopes) result in the student realizing that linked lists are to be avoided because of their poor mechanical sympathy resulting from their awful cache locality.

Valid uses for a linked list are extremely niche. Yes, you can name some, and I can name several valid uses for bloom filters. Just because a simple linked list is easy to implement does not mean it should be used, any more than bubble sort should be used for its simplicity.

throwawaymaths · on Nov 4, 2022

Wait why do linked lists have bad cache locality? It depends on how your heap is set up. For example, you could have an allocator that gives relatively good locality by having high granularity and only bailing out if say your LL gets super long (so if your lists are usually short, they could have awesome locality)

duskwuff · on Nov 4, 2022

That only works if you allocate lists all at once and never modify them.

If there are any other allocations taking place at the same time -- say, allocations from other threads, or for other data you had to create while building the list, that locality is shot. Same goes if you insert more elements to the list later, or if you perform an operation which changes its order (like sorting it).

wtetzner · on Nov 5, 2022

> That only works if you allocate lists all at once and never modify them.

You can reorder and modify the contents of the list without losing locality.

duskwuff · on Nov 5, 2022

I mean, if you modify the contents of the list nodes in place, sure. But at that point, you're just doing an array with extra steps.

Changing any of the pointers, however, will make memory prefetch much less efficient. CPUs like linear memory access patterns; they're easy to predict. Chasing pointers is much less predictable, even if the pointers are all to the same region of memory.

wtetzner · on Nov 5, 2022

It might still be cheaper than copying the values if they’re large.

cerved · on Nov 4, 2022

That doesn't seem to make much sense to me. LL are useful in cases where you deal with changing lists, where you'd be forced to allocate and deallocate memory often. If you know before hand the exact or approximate size of the list and it's contents you can store that continuously in memory with greater cache locality because you're not storing pointers in addition to data. Seems to me the use case of LL is perhaps as ill suited for cache locality as the structure itself

wtetzner · on Nov 5, 2022

They’re also useful when you want to cheaply reorder nodes without copying the elements.

cerved · on Nov 5, 2022

True. You could implement that using an array of structs where each struct contains indexes of the array to link other structures. instead of memory pointers, since they are 8 bytes in 64-bit applications. This way you get the nice properties of both linked lists and continuously allocated arrays

CalChris · on Nov 5, 2022

If your list is short then you won't have a problem. But if your list is long then traversing it can be a major problem.

For example, if your list item is page aligned like the task_struct in the Linux kernel, rescheduling the next process could traverse the process list. The process list links will be at the same page offset and associate with a restricted set of the L1/L2/L3 caches lines and thrash from memory. Worse, the TLB will thrash as well. This was actually the case for Linux until about 2000.

https://www.usenix.org/legacy/publications/library/proceedin...

AstralStorm · on Nov 4, 2022

If you are in that place, you're probably using a linked list over small statically allocated memory, but you still need random ordering and fast remove or reorder.

atemerev · on Nov 5, 2022

An order book in electronic trading.

It is not too large (a few thousand of entries usually), but it needs to be accessed and modified extremely fast. There can be insertions and deletions anywhere, but it is strongly biased towards front. It is ordered by price, either ascending or descending.

Various “ordered queue” implementations are usually a poor fit, as there are many insertions and deletions in the middle, and they need to avoid any additional latency. The same goes to vectors. Red-black trees etc. are too complex (and slow) for such a small size.

So we start with a humble linked list, and optimize it (unrolled, skip search etc). But we respect linked lists, they work.

mst · on Nov 4, 2022

http://trout.me.uk/lisp/vlist.pdf

xaedes · on Nov 5, 2022

I learned to love linked lists as soon as I discovered that I can just store them in vectors to get the performance of guaranteed contiguous memory:

  // LinkedListItem[k]: item[k], prev[k], next[k]
  std::vector<T> item;
  std::vector<uint> prev;
  std::vector<uint> next;

Similar is used in transparency rendering with per-pixel linked-lists.

nynx · on Nov 5, 2022

The performance of vectors comes from iterating through them and letting the cpu prefetch items before you need them. Random access in a vector doesn’t really get you that if the vector is larger than your L1/L2 caches, which linked lists would be in anyway if you used them recently enough.

anthomtb · on Nov 5, 2022

Interesting idea but how does C++ guarantee contiguous memory for a vector? I just don’t see how a data structure with an arbitrary, dynamic size can also reside in a contiguous range of address space.

fnbr · on Nov 5, 2022

When the array is resized, it’s moved to a new contiguous block of memory: everything is copied or moved over. See: https://stackoverflow.com/questions/8261037/what-happen-to-p...

frankchn · on Nov 5, 2022

Simple, you just allocate a bigger contiguous chunk of memory and copy the entire vector over when the current chunk maxes out.

zabzonk · on Nov 5, 2022

the size of a c++ object is fixed at compile-time

nraynaud · on Nov 4, 2022

I think a lot of graph edits are a nightmare to perform with an adjacency list, and a breeze when you just edit linked pointers.

Imagine the Euler Operators on a manifold (ie a 3D mesh) or on a planar graph, but on a vector instead of on a DCEL.

jstimpfle · on Nov 4, 2022

That works if the data that you want to put in the sequence is copyable (doesn't have "identity"), or if you can arrange so that it gets created in the right spot from the start and you can always choose the iteration order to be sequential in memory. For many more complex structures, that is not the case.

ay · on Nov 4, 2022

Could you give an example of a data that is uncopyable ? A struct witj self-referential pointers ?

jlokier · on Nov 5, 2022

Example: Any object shared among multiple lists or other data structures at the same time, or directly pointed to by other objects, such that any of those views must reach the same cheaply mutable state. This includes objects in the object-oriented programing (OOP) model, actors in the actor model, and file objects in a kernel which are in multiple lists.

For those you have to store object pointers in the vector, instead of copies of the objects. That works and is often done, but it defeats or reduces the cache locality perfornance improvement which motivates using a vector in the first place.

On an in-order CPU an intrusive list (pointers inside the objects) can be traversed faster than a vector or B-tree of object pointers, though a non-intrusive list (list of nodes containing object pointers) won't be faster. On an out-order CPU with heavy speculative execution it is less clear cut because the vector allows successive objects to be dereferenced speculatively in parallel to a limited extent, while this is not possible traversing an intrusive list.

If the list is going to be traversed looking for objects based on some filter criterion such as flags or a number comparison, based on non-mutable state (or where it's ok for mutation to be expensive), traversal can be sped up by storing criterion data in the vector alongside the pointers, or in a second vector, to avoid dereferencing every touched object pointer during traversal.

gpderetta · on Nov 5, 2022

Linked list are indeed great for secondary traversal orders (or for subsets).

But I don't understand your comments re in-order CPUs: vectors will be faster there too as they avoid the load-load dependency. Most modern inorder CPUs are still superscalar and fully pipelined.

AstralStorm · on Nov 4, 2022

If the data is relatively big buffers, you really do not want to copy them. Your memory bandwidth will thank you, and performance will vastly increase. Cache locality will be good anyway since you're referring to long blocks.

ay · on Nov 4, 2022

This kind of data structure often has a buffer + associated metadata. If the metadata+pointer fits into 1-2 cache lines, storing it in a vector can give be win vs. storing the next/prev pointers next to data + metadata.

In any case, there is always only one authoritative answer: “perf top”. :)

hither_shores · on Nov 4, 2022

It's about the meaning, not the raw bits. I might be able to duplicate your signature, but that doesn't mean I can sign documents for you.

Const-me · on Nov 5, 2022

For example, an std::mutex is uncopyable, and even immovable.

bjoli · on Nov 4, 2022

While I generally agree with you, I have had multiple occasions where amortized O(1) was unacceptable, whereas the overhead of linked lists was ok.

eru · on Nov 5, 2022

You can make dynamic arrays have worst case O(1) at the cost of some extra overhead.

bjoli · on Nov 5, 2022

Copy the array on every append? :)

Slasher1337 · on Nov 6, 2022

That would not be O(1), but O(n) per insert.

chrisseaton · on Nov 4, 2022

> almost always, whatever problem you're trying to solve would be better done with a regular resizable vector

Resizing a vector brings the whole of it into cache, potentially dragging all of it all the way from actual RAM. Prepending an entry brings nothing into cache.

Cache is king.

jameshart · on Nov 5, 2022

> Resizing a vector brings the whole of it into cache

I don’t do low level code on modern systems; is there not a way to just blit chunks of memory around without a CPU looping it’s way through word by word carrying out MOVQ instructions which do a full RAM read/write?

If you’re streaming data out of memory to a PCIe bus or something, that wouldn’t go via cpu cache, right? There’s some kind of DMA model, so there’s something in the system architecture that allows memory access to be done off-CPU (I vaguely think of this as being what the northbridge was for in the old school x86 architecture but I’m fuzzy on the details?).

Mechanical copying of byte arrays feels like something a CPU should be able to outsource (precisely because you don’t want to pollute your cache with that data)

tsimionescu · on Nov 5, 2022

In the ideal case (and assuming a vector large enough that it spans many pages), you would actually be able to work with the OS's virtual memory manager to re-allocate and move only the page that contains the address you want to write to, while all other pages can stay untouched. But there is quite a bit of careful work to do to implement something like this and make sure you are not invalidating pointers or even internal malloc() data structures.

chrisseaton · on Nov 5, 2022

Not in architectures I've seen or maybe I'm missing the instructions to do it? Even if there were and you can bypass the CPU, hauling data out of RAM is still going to take a long time.

NobodyNada · on Nov 5, 2022

> maybe I'm missing the instructions to do it?

Yes, they’re called “non-temporal” load and store instructions. AFAIK most memcpy implementations will automatically use non-temporal access when copying blocks larger than a certain threshold, precisely to avoid destroying the cache during bulk copies.

Additionally, a well-optimized memcpy implementation (e.g. making good use of prefetch hints) should not suffer too much from RAM latency and be able to pretty much max out the bandwidth available to the CPU core.

chrisseaton · on Nov 5, 2022

You're thinking of like MOVNTDQA? Those still have to reach into RAM - that's still double-digits nanoseconds away and still creates a stall in the processor at n complexity!! No thanks? I can allocate a new linked list node with memory guaranteed already entirely in cache (except for TLA failure.)

> be able to pretty much max out the bandwidth available to the CPU core

Why max out a slow link at n complexity when you can just avoid doing that entirely and use a linked list?

Dylan16807 · on Nov 5, 2022

Surely you know the answer, if you're analyzing at this level of detail?

It's because spending 1 millisecond per X operations doing vector moves is preferable to spending 20 milliseconds per X operations stalling on list links.

If you set a good vector size from the start, the moves will be rare to nonexistent. Or you could reserve some address space and never have to move the data.

tsimionescu · on Nov 5, 2022

Sure, prepending or adding at the end(assuming you're keeping a tail pointer, which is easy and common) are basically the only cases where a linked list can beat an array.

chrisseaton · on Nov 5, 2022

Yeah, and we deliberately prepend because that doesn't involve writing to a field in a node which may not be in cache. It's the consumer's job to reverse the list if they ever actually need it.

KerrAvon · on Nov 5, 2022

So use an array deque if you need O(1) prepending.

kragen · on Nov 5, 2022

That's amortized O(1), not WCET O(1). For transformational systems they're equivalent, for reactive systems they aren't.

chrisseaton · on Nov 5, 2022

Resizing the array in the array dequeue brings the entire thing into cache. That's a disaster.

CyberDildonics · on Nov 5, 2022

It's far from a disaster because the best way to deal with a list is still to use a contiguous array.

Allocation isn't free. You have to do it sometime and doing it all in bulk is going to have the same consequences as just one small allocation anyway. If you are mapping new pages into main memory, making system calls or looping through a heap structure, you're not only affecting the data cache but also the TLB while also pointer chasing.

Making a list out of an allocation for every element is performance suicide on modern computers. Even the amortized resizing of an array is done at the speed of memory bandwidth because of the prefetcher. Thinking that is some sort of performance hindrance is absurd because it outweighs the alternative by orders of magnitude. If there really are hard limits on the time every insertion can take, then the solution needs to be links of large chunks of memory to still minimize allocations on every element.

jameshart · on Nov 6, 2022

Obviously the compromise is to use a linked list, but preallocate a contiguous block of memory to hold all your linked list nodes. Like an array of linked list nodes, each with a pointer to the array index of the next node.

Then you just need to double the size of that block of memory and copy all the nodes to the new block whenever it fills up.

Hmm, but you’d probably end up needing some kind of a list of free blocks of space within that array to deal with fragmentation… you’d need another linked list to store those blocks in.

If only there was some sort of system that could take care of all that allocation and freeing of memory for you?

CyberDildonics · on Nov 6, 2022

It looks like you are conflating a linked list with a heap. Once you have an array/vector for your node list the unused memory can be a list itself, each index pointing to the next free index. You just need the next free index to get memory and you point to the current free index when you free a node. You don't need to manage the memory further than that.

kragen · on Nov 5, 2022

It's a disaster but it's an amortized O(1) disaster.

layer8 · on Nov 4, 2022

That depends on the number of elements and the size of the payload data. Both approaches can be combined by using a linked list of arrays.

marcosdumay · on Nov 4, 2022

Yes.

There is a reason why it's hard to create either a bare linked list or a bare array on modern language. It's because both are bad extremes, and the optimum algorithm is almost always some combination of them.

pfdietz · on Nov 4, 2022

Compacting garbage collectors put linked lists into nearby memory.

josephg · on Nov 4, 2022

Even if memory was packed by a GC, linked lists still have to pay the cost of allocating and garbage collecting every cell of memory individually. I doubt a GC would solve linked lists’ woes.

Do you have some benchmarks demonstrating your claim?

Jach · on Nov 5, 2022

There are also prefetch instructions. I listened to https://www.youtube.com/watch?v=SetLtBH43_U recently (transcript: https://signalsandthreads.com/memory-management/), part of it talked about some work in OCaml's GC.

> ...each individual memory request that’s not in cache still has to wait the full 300 cycles. But if we can get 10 of them going at the same time, then we can be hitting a new cache line every 30 cycles. I mean, that’s not as good as getting one every 16 cycles, but it’s close. You’re actually able to get to a reasonable fraction of the raw memory bandwidth of the machine while just traversing randomly over this huge one gigabyte heap

https://github.com/ocaml/ocaml/pull/10195 shows the change adding prefetching to the marking phase (https://github.com/ocaml/ocaml/pull/9934 was done earlier for sweeping). There are some benchmarks in the thread/linked from the thread.

tsimionescu · on Nov 5, 2022

I don't have a benchmark, but it's important to remember that GCs only scan live objects, and they always have to scan all live objects; while allocation is normally an extremely simple instruction (bump the memory pointer). It's true though that scanning a live linked list would be expected to be slower than scanning a live array (though some optimized scanning strategies may be possible).

Overall I couldn't tell what the impact on overall performance would be, but I can tell you for sure that de-allocating an N-element linked list or array are both O(1) operations for a copying garbage collector (which are the most common kinds of compacting GCs), since they simply will never reach nor copy any of the elements.

throwawaymaths · on Nov 4, 2022

Erlang GC is pretty good, plus there are some nice highly local allocators

josephg · on Nov 5, 2022

Be that as it may, it'll still take a benchmark to convince me.

josefx · on Nov 4, 2022

Will they? I don't think they reorder live objects relative to each other, so any object allocated between two nodes would still end up between those nodes after the GC ran.

sfink · on Nov 5, 2022

Disagreeing with other answers here, but usually they will, and not because the authors tried to make it happen. It's just sort of the default behavior.

A copying collector has to trace through the whole graph to find everything reachable. The tracing order is usually going to be depth first. (It's the easiest to code if you're using recursion and therefore maintaining state with the runtime's stack, but even with an explicit stack you'll want to push and pop the end for locality.) Which means that objects arranged into a linked list will be visited in list order, and you generally copy as you visit an object for the first time. So... reordering live objects relative to each other "just happens". Even if it's a doubly-linked list, since the back pointers have already been visited and so don't matter.

This is less true if your objects have a lot of other pointers in them. But that's just saying that if your object graph isn't a list, then it won't end up laid out in list order. It'll still end up in DFS order according to the true graph.

pfdietz · on Nov 4, 2022

It is very possible for a compacting garbage collector to reorder live objects.

pencilguin · on Nov 5, 2022

It might or might not. Commonly, not.

WhitneyLand · on Nov 5, 2022

I wish the word vector didn’t have so many different uses in science and engineering.

halayli · on Nov 5, 2022

if you can replace linked list with vector then you don't need a linked list and the two shouldn't be compared as it is almost always obvious when you need one over the other. Link lists are used in areas when there is a relationship between nodes like a list of running processes. one of the frequent operations is to remove a node from whatever position it's in and add it on either side.

Sirened · on Nov 5, 2022

Right. Like there are so many algorithms that I shudder to think about how you'd implement them without linked lists. Like...how about a buddy allocator? Sure you could use a vector for each but you'd be copy and resizing huge swathes of memory constantly for very little gain. Use the right tool for the job!

tsimionescu · on Nov 5, 2022

The thing is, the cost of copying memory is essentially the same as the cost of reading it, and linked lists force you to keep re-reading the same memory over and over (assuming random access).

Adding an element in the middle of an array vs the middle of a linked list actually has the same asymptotic complexity (O(n)), but far better cache impact for the array (since moving a contiguous block of n/2 elements should only incur 1 cache miss, while reading n/2 randomly distributed nodes of the list will incur on average something like n/4 cache misses). Adding closer to the beginning of the linked list is faster, but adding close to the end of the vector is faster still, so overall with random access, the vector should win.

halayli · on Nov 7, 2022

I think you missed the point here. You don't use link lists when you are going to walk the list and insert in the middle or some random place. The only time you end up walking a link list in typical scenarios is when you want to print them out or some unimportant operation.

Link list is used to hold relationship between nodes and when you are operating on that node the common patterns are remove and add it to another list, or re-add it on either side. Take a look at bsd or linux source for inspiration. A process struct has more than dozen member variables of node* type because the object ends up being in several lists like sched, signal queue, child threads etc.

_19qg · on Nov 5, 2022

what makes you think that list operations make the CPU jump randomly in memory and that implementors of, say, Lisp systems haven't optimized memory management for linked lists? A modern GC provides features generations, areas, copying and compacting.

One a Lisp Machine from the 80s/90s the system saves a type sorted and locality optimized memory image, from which it later boots the Lisp system. Additionally to the typical GC optimizations (incl. integration with the paging system) it also provided CDR coding, which allocates lists as continuous memory. The GC also creates those CRD coded lists. CDR coding fell out of fashion in modern Lisp implementations, because the space and time savings aren't that great to justify the more complex implementation.

tylerhou · on Nov 4, 2022

If you're iterating, sure, use a vector. If you're pulling from a queue and doing a lot of processing, maybe an RPC -- does the ~70ns memory latency hit really matter that much? Probably not.

kibwen · on Nov 4, 2022

Why impose that latency if you don't need to? It costs me nothing to reach for a vector instead.

AstralStorm · on Nov 4, 2022

Costs you a dynamic memory allocation, code overhead and higher memory use. There are good reasons why std::array and plain arrays exist. Vectors are for when your data is unbounded, which is actually a risk most of the time. For truly big data, you want a special structure anyway.

tylerhou · on Nov 5, 2022

Without extra work you can’t use a vector like a queue, and a circular buffer doesn’t necessarily support all types and/or preserve iterators on insert. Could use a deque, a blocked linked list. But IMO list is fine if it’s not clearly a bottleneck.

pornel · on Nov 4, 2022

If you're pulling from a queue use a ring buffer.

AstralStorm · on Nov 4, 2022

Sometimes. Depends on whether you are filling in a data buffer or trying to send an already calculated data buffer.

A standard ring buffer does copies/fill. A ring buffer of pointers has similar performance characteristics to a linked list.

Phelinofist · on Nov 4, 2022

OT but does anyone know a general purpose ring buffer implementation for Java?

grogers · on Nov 5, 2022

LMAX disruptor: https://lmax-exchange.github.io/disruptor/user-guide/index.h...

It's used by log4j2 and several other java libraries and is quite solid. You can use it in blocking or polling modes, so it's quite generic depending on your latency requirements.

speed_spread · on Nov 5, 2022

Unfortunately log4j2 is no longer a shining example of good engineering, if you're trying to sell something...

I'd prefer saying LMAX is / was used as the core of high frequency trading applications.

im3w1l · on Nov 4, 2022

One data structure I recently found myself wanting is a tree-based "array", with O(log n) access by index / insertion anywhere / deletion anywhere.

Kinda weird how rare it seems to be.

Munksgaard · on Nov 4, 2022

That sounds like a pretty run-of-the-mill balanced binary tree? Rust has a BTreeMap: https://doc.rust-lang.org/std/collections/struct.BTreeMap.ht...

vanviegen · on Nov 4, 2022

Inserting into an array increments the indexes of all subsequent values. How would a regular BTree emulate that behavior?

AstralStorm · on Nov 4, 2022

It would rebalance on insert with O(log N) performance but chunky constant factor, keeping the ordering so access by index is also O(log N) with low constant factor...

Which is why usually you would use a red-black tree rather than a BTree, as it has much lower constant for insertion and access by index. However higher for traversal in order.