Linked lists are great. But they have the problem that, almost always, whatever problem you're trying to solve would be better done with a regular resizable vector.
This includes problems that they should be great for, like insert into the middle or the front.
The reason is that in practice the way computers actually work is that there is an enormous time penalty for jumping around randomly in memory, and it's large enough that it's often worth paying a O(lg n) cost to switch to something that will be contiguous in memory and allow the CPU to prefetch data.
There are exceptions when you really should use an actual linked list, but your default should be a vector and not a linked list.
The first week into my prior job, I had to fix an in-production bug involving quadratic vector growth that was tipping over servers. The vector implementation guaranteed too much, namely consecutive memory layout (which wasn't needed, and if we had needed it, it would have been a mistake).
Decomposing that vector into recursive subvectors solved the problem. Going from a few tens of millions of elements in a single contiguous vector (with consequent -- and bad -- heap fragmentation!) to nested vectors with a few thousand elements each brought us back online again.
Which is to say: Vectors are nice. Lists are nice. Use appropriate data structures for your scale, watch your semantic dependencies, and don't get hung up on dogma.
The VList paper - which I archived at http://trout.me.uk/lisp/vlist.pdf to avoid having to google the bloody thing every time - is an interesting hybrid.
Regarding section 4.1 and 8 in this paper, it uses several software implementations of MSB and gives a performance comparison between the VList and RAOTS from which it was derived. It doesn't utilize hardware instructions which can improve the MSB calculation and eliminate branches (eg, clz/lzcnt or bsr). It's worth having another look at RAOTS using an optimized MSB which will leverage instructions available in modern hardware, and see how its performance compares to the VList.
trout.me.uk/lisp/ and trout.me.uk/gc/ are both basically archives of "stuff I knew I'd suffer that problem with" and it's a pleasure every time the contents come in handy for people who aren't me.
Also back when I was still pretending to be a mathematician I used both GWBASIC and UBASIC for assorted purposes so thank -you- for the nostalgia kick.
Still sad that John Shutt (creator of 'kernel lisp') passed away while I was still procrastinating dropping him an email - I think the only other 'thank you' I've most regretted not sending in time was to Terry Pratchett.
Ironically, if you weren't using huge pages, the operating system was already doing this for you anyway, and your contiguous segments weren't contiguous.
This does suggest functionality that the OS should provide: let the program remap memory somewhere else in its address space to obtain a larger contiguous chunk of addresses without copying data. This wouldn't help much if you have direct pointers, but I think e.g. Java has a layer of indirection that would let it benefit from remapping to grow ArrayLists. I don't know if things already work this way. It should make speculative fetches easier for the processor since now it understands what you're doing via the memory map.
Edit: taking this idea a little further is actually fascinating: use the type system to separate objects into pools based on type, and instead of pointers, use indexes into your pool. You can move the pool for free to grow/shrink it. I think you could even do this with pointers and full hardware acceleration today if you used the high bits to tag pointers with their types and used remapping at the hypervisor level to move the pools (so the address in the application level doesn't change). You could do some crazy stuff if the application could coordinate multiple levels of virtual memory.
You can certainly allocate memory at controlled addresses using mmap or VirtualAlloc.
But you should not play virtual memory games at runtime in an application that you expect to be scalable. Unmapping memory or changing its permissions requires propagating the change to all threads in the process. On x86, the performance impact is dramatic. On ARM64, it’s not as bad but it’s still there.
Also, even 64 bits can go fast when you start using high bits, separate pools for different purposes, etc.
Right, the ability to relocate a chunk of memory to a new virtual address (with more capacity at the end) without doing a physical copy would be an interesting way to solve the problem.
I don't know if anyone has done research on integrating data structures and OS functionality in this way. Interesting idea.
We needed to get things running again rather quickly, though :-)
Some ECS implementations use this to reduce the overhead of dynamic arrays, as well as ensuring pointer stability of the items stored. For example entt:
End of page allocation has been a real life-saver. I have it rigged up to the core allocators of the game engine I work with, so you can switch it with a cmdline switch; if you suspect an overwrite, you switch on the EOP allocator ( using more memory temporarily ), and bang, you get a segfault where you went off the end / used-after-free.
I somehow misread your comment the first time through and thought you were suggesting the ability to ask the OS: "Hey, I'm a pointer, is there anything in front of my block of memory, if so can you move it elsewhere so that I can expand?" This would of course be a pretty silly idea but it would be funny to give the OS folks a whole new class of use-after bugs to worry about.
glibc realloc(3) does this on Linux, by way of mremap(2) with MREMAP_MAYMOVE [1]. (Just confirmed on my local machine using strace(1).)
Unfortunately, libstdc++ does not (to my knowledge) use realloc(3) or mremap(2) for vector resize, nor could it in many situations due to the semantics of C++.
That sounds like a lot of work unless you absolutely have to have a contiguous allocation of every element.
If you don't need that, you can get "O(1.1)" access and nice heap behavior with an indirection or two, and the implementation is likely to be lots more portable and standards-friendly than if you also dragged in a bunch of OS mmap() semantics (which tend to have exciting, platform-dependent misfeatures).
Also, I would have to reach for the mmap() man page. I have malloc() right in front of me. :-)
Maybe I'm just not seeing it, but I don't see how mmap let's you remap an existing address to a new one? Looks like there is a mremap on Linux in any case.
If you want to stick to mmap, you can do it by creating a file which doesn’t have any actual disk backing (one way to do this is with POSIX shm_open), and handling all memory allocations by mmapping parts of that file rather than mapping true anonymous memory. Then, ‘remapping’ an address to a new one is a matter of calling mmap again with the same file offset. Unlike mremap, this allows mapping the same memory at different addresses simultaneously, which can be useful for some purposes.
If your elements are in the millions than a tree might be better, depending on what you do. If you traverse the whole thing often then vector is best, but at a million a tree will do a search for element faster despite the cache misses.
I don't know where the line is, but in general you are probably not close to that line so default go vector. However as your point is, sometimes you are over that line.
And if you really need it you can even do both tree and vector simultaneously - just make sure both consist of pointers only so you avoid update/sync issues.
You can, but probably shouldn't. The pointers means a cache miss when you follow it. Putting everything in the vector is the point, otherwise the map is potentially faster
Right, though we didn't need full tree semantics (just fast random access and an append that didn't spool up the fans on the servers).
I would hate to have to do a full balanced b-tree on short notice. (We also had rather terrible extra requirements that made using an off-the-shelf thing from, say, the STL work in our environment).
simple vector implementations will do a full realloc and recopy when you run out of space. so O(1) amortized but O(N) every so often.
another implementation strategy, at the cost of using some extra memory, is to allocate the new space, but only copy over a bit at a time, each time a new element is added -- so every add is still O(1). no long pauses.
(well, assuming the memory alloc is still fast. which i think it should be in practice? tho i'm not super familiar w/ that stuff.)
just curious, do you think that solution would have worked for your situation too?
That's exactly the strategy that was tipping over. Once you start resizing vectors of tens of megabytes, it's kind of over.
You can start playing games by accelerating capacity allocation when a vector rapidly starts growing extremely large. This doesn't solve the problem that vectors sometimes guarantee too much for what you need.
True... except in MSVC. std::deque in MSVC is cursed.
Targeting MSVC, use boost::deque or something, instead. Boost containers are generally fairly good, although they are more or less unmaintained nowadays. (E.g., boost::unordered_map::reserve doesn't work.) Or, like many of us, ignore performance on MS targets because anybody who cares about performances is running something else.
Many of the STL containers in MSVC are incredibly cursed. Even std::vector is implemented in such a weird way, that you can't debug it easily with any debugger other than Visual Studio. And it's incredibly slow on Debug mode, up to the point that it's unusable... (Because of these reasons I just switched to using my own Vector<T> class recently)
I think there's a very good reason why old gamedevs who were stuck with developing on Windows largely ignored the STL and made their own containers / utility functions: MSVC's STL was just awful.
> Many of the STL containers in MSVC are incredibly cursed. Even std::vector is implemented in such a weird way, that you can't debug it easily with any debugger other than Visual Studio. And it's incredibly slow on Debug mode, up to the point that it's unusable... (Because of these reasons I just switched to using my own Vector<T> class recently)
Here's a link to an old HN discussion that offers some insight on the performance of MSVC's STL in debug.
The "cursed" keyword hand-waves over the problem and suggests it originates in the acritical propagation of ignorant beliefs. Sometimes clarifying these is just a Google search away.
Longer answer is that the size of the "blocks" is limited to 512 bytes or one element, whichever is larger. So unless your elements are really tiny, it is strictly a pessimization.
My Forman always would tell me KISS - (keep it simple stupid), maybe there is another single word for just solve the problem at hand silly - j,s,p,a,h,s - doesn’t really roll off the tongue the same way…
Nobody has mentioned immutability yet - linked lists are easily used as one of the simplest immutable persistent data structures. Your definition of "better" appears to be solely about keeping the CPU burning, but if CPU isn't the bottleneck, I prefer "immutable linked list" over "clone a vector many times" merely on aesthetic and reason-ability grounds.
Or you could go the Clojure route and do the tree-style immutable vectors where the cost is log32(N) for most operations, small enough in all the relevant operations, and you get the amazing usability of immutable data structures that are performant enough in most cases.
Well, of course if a solution is "cloning a vector many times" then arguably you're using the vectors wrong - in which case it's totally appropriate to find a different data structure (or a better way to use vectors, if possible).
In C where using the simplest data structures is valuable, you’d likely do much better by upgrading your programming language to something that offers a better data structure. If you’re using a higher level language where you can abstract details about your sequence data structure, I think there’s very little benefit to using the simplest data structure and you should instead have widely used libraries that offer a good immutable sequence type, e.g. RRB trees.
One exception is ML style languages where singly linked lists get a big syntax advantage (Haskell only half counts because most of its lists are really just iterators). I think this was, in hindsight, a mistake for a widely used language. It’s also a source of pointless pain for learners (eg I don’t think there’s much benefit to people having to write lots of silly tail-recursive functions or learning that you have to build your list backwards and then reverse it). Another exception would be some lisps where lists are everywhere and the cons-cell nature of them makes it hard to change.
It's surprising people are arguing about this. That fact alone shows that lots of developers don't understand the relationship between L1/L2 cache, data access patterns, and how prefetching works.
That makes sense, given how abstract things have gotten. Decades ago there was an article that showed the penalty for non-linear data access was massive. That was before speculative access, branch prediction and cache prefetching were standard features. The performance hit today would be even more (except presumably for machines that have implemented the speculative execution security mitigations).
As with any such subject, it's worth discussing the details but probably not the generalities really (except perhaps in the whimsical style of antirez).
You're right to point out that given the performance characteristics of most modern CPUs, accessing memory linerally is optimum. But it doesn't necessarily follow that all collection data structures should be stored as a vector. Consider a collection traversed very slowly relative to the overall computation. There may not be any cache win if the line's always evicted by the time you come back to it.
I’ve always wondered about whether you could have your cake and eat it too with linked lists where nodes are allocated in contiguous blocks of N nodes. So, the normal case of traversing the next pointer is not a discontinuous memory access.
That's quite likely to happen to a linked list when using a compacting GC (such as Java's or C#'s generational garbage collectors) (assuming the elements of the list are only reachable from the preceding element, not from many other places as well).
That depends on the GC’s traversal strategy, I think. If you assume that linked lists are implemented as in lisps (a pair of pointers), then if you traverse the head pointer before the tail pointer, the compacting GC will tend to spread the linked list out over time.
> It's surprising people are arguing about this. That fact alone shows that lots of developers don't understand the relationship between L1/L2 cache, data access patterns, and how prefetching works.
Your misconception lies in the assumption that the only conceivable use case is iterating over a preallocated data structure to do read-only operations.
Once you add real world usages, with CRUD operations involving memory allocations/reallocations then your misconception about cache crumbles, and you start to understand why "people are arguing over this".
You're expanding the scope of the problem. In real life nobody is pulling shit off of a database and stuffing it into a linked list. And if they are the performance will still be worse because each list will be allocated. Even if you do your own malloc you will blow cache with a linked list, because as time goes on your locality will be blown out. Iterating over a linked list will be worse as well.
In fact, on modern architectures it may be faster to iterate over arrays just to take advantage of l2/l3 behaviors.
It’s hard for me to blame developers for their misaligned mental models since, as you yourself point out, the relationship between your mental model of the hardware’s behavior and what it actually does is complex and not really obvious. Mitigations etc means that the hardware you bought might totally change its behavior in the future. If all you can rely on is time measurements, it doesn’t explain that the software you wrote last year is now eg 40% slower due to bug mitigations. I think it’s easier to expect developers to put in the effort once they’re empowered with powerful tools to explain how the software gets mapped onto the hardware.
I think tools that result in greater transparency at lower levels will probably emerge, something like a compiler-generated dynamic analysis + pre-flight checklist hybrid. But since it’s so low-level, it might actually affect results, which may be a problem. I think intel vtune seems interesting in that regard.
I think the argument of locality and linearity and cache hits is valid, but specific for one access pattern: reading items in order and adding to the tail. If that's what your code ends up doing most of the time, then an array wins over a linked list hands down.
As soon as you start adding or deleting items in the middle, the game changes because the data you need to move is not contained in the cache anymore (except for small arrays).
I can even imagine a workflow that uses both data structures: an app could build up a linked list with the data it reads from an external file, for example, where they can be scattered and not in order; when the file is finally closed, the data are considered read-only and rearranged into an array for the remaining part of the execution.
So don't dismiss linked lists as inefficient "a priori", which is exactly what antirez claims in the article.
That analysis is lacking. Linked lists turn out to be inferior to arrays for essentially any access pattern, at least when the linked list can be assumed to be randomly stored and not accidentally sorted.
Say you want to add an element in the middle of an N-element ll/array.
For the linked list, this means you need to do 2 * N/2 memory reads (read value, read next pointer), which will mean on average N/2 cache misses, to reach the element at index N/2. Then you'll do O(1) operations to add your new element. So, overall you've done O(N) operations and incurred O(N) cache misses.
For an array, there are two cases:
1. In the happy case, you'll be able to re-alloc the array in-place, and then you'll have to move N/2 elements one to the left to make room for your new element. You'll incur one cache miss to read the start of the array, and one cache miss to read the middle of the array, and one cache miss for realloc() to check if there is enough space. The move itself will not incur any more cache misses, since your memory access pattern is very predictable. So, overall you've done O(N) operations, but incurred only O(1) cache misses - a clear win over the linked list case.
2. Unfortunately, sometimes there is no more room to expand the array in place, and you actually have to move the entire array to a new place. Here you'll have to move N elements instead of N/2, but you'll still incur only 1 cache miss for the same reason as above. Given the massive difference between a cache miss and a memory access, doing N operations with 1 cache miss should easily be faster than doing N/2 operations with N/2 cache misses.
The only access patterns where the linked list should be expected to have an advantage are cases where you are adding elements very close to an element which you already have "in hand" (e.g. at the head, or at the tail if you also keep a pointer to the tale, or to the current element if you're traversing the list while modifying it). In those cases, the linked list would need O(1) operations and cache misses, while the array would need O(N) operations and O(1) cache misses, so it would start to lose out.
Well, ideally your programming language (and its standard library) should provide you with basic data structures that are cache oblivious (or optimized for your arrangements cache). So programmers wouldn't need to worry.
I often write code destined to run on a microcontroller, and linked lists are great for avoiding dynamic memory. It's nice to be able to statically or transiently allocate storage without having to bound the size of anything.
Of course, the same unintuitive tradeoffs tend to apply even if there isn't much caching to worry about. A trivial linear iteration is often faster than a more elaborate algorithm when the input is small.
Carmack is a fan of this approach. Allocate some reasonably large number as your design constraint and if you start approaching it that is a time to ask why and is this data structure still reasonable here
That‘s why the Amiga OS (Exec) used linked lists as its primary data structure. It was one of the first micro computer operating systems using dynamic memory mapping (i.e. no fixed addresses for OS calls and data structures) and supported multi tasking. Any object you received from Exec basically was a pointer to a list node. And, yes, the Amiga‘s biggest problem was memory fragmentation and leakage because unreachable linked list nodes would be starting to accumulate whenever working with a buggy program.
It's also an area where the costs are a lot less significant either: a cortex m4 and below often doesn't have a cache (at least for RAM: flash often has a cache and predictive lookup, but the perfomance difference is small enough that the cache can basically render executing code from flash as identical to RAM), and the RAM is about as fast as the processor, so random access and sequential access are pretty similar performance wise. (And with the speed of modern microcontroller cores the size of the data is rarely even an issue: most STM32s can iterate through their entire RAM in under a millisecond).
That said, in my embedded work I still rarely find a linked list to be a useful data structure (but I know FreeRTOS, which I use a lot, uses them extremely heavily internally).
Well, yes, and a lot of people treat kernel development as if they were developing for a constrained microcontroller. I don't think this is even a bad thing, we want the kernel to be as lean as possible because every page and cycle the kernel burns is one that users don't get to use.
I suppose a more important reason is the "not have to bound anything" part. OS kernels almost definition don't know how many objects will be created. Implementing queues as vectors of pointers will require dynamic allocation and/or introduce many more failure points when going out of resources. With linked lists, once you have an object you can always append it to a linked-list queue.
At gross complexity / fragmentation / to your memory allocator.
Linked Lists are definitely easier to use if you're ever in a position where you're writing your own malloc(). The infrastructure needed to cleanly resize arrays (and also: the pointers all going bad as you do so) has a lot of faults IMO.
EDIT: In particular, linked lists have a fixed size, so their malloc() hand-written implementation is extremely simple.
Doubling the vector size each time it needs to be enlarged takes care of fragmentation/memory overhead, in practice. Or preallocating if you know the rough size.
It does take some discipline to avoid the storage of pointers, but once you get used to that it’s quite fine.
Source: I work on VPP, which uses vectors quite extensively.
The optimal growth factor is the golden ratio (1.6), in practice many vectors use a growth factor of 1.5.
The reason for not going above the golden ratio is that it prevents any previously allocated memory from ever being reused. If you are always doubling the size of your vector, then it is never possible to reclaim/reuse any previously allocated memory (for that vector) which means every time your vector grows you are causing more and more memory fragmentation, as opposed to using a growth factor of 1.5 which results in memory compaction.
Case 1: Growth factor of 2 and an initial size of 10 bytes.
Start with an initial allocation of 10 bytes of memory.
On growth allocate 20 bytes and release the 10 bytes, leaving a hole 10 bytes.
On growth allocate 40 bytes and release the 20 bytes, the hole in memory is now 30 bytes large (the initial 10 byte hole + the new 20 byte hole).
On growth allocate 80 bytes and release the 40 bytes, the hole is now 60 bytes.
On growth allocate 160 bytes and release the 80 bytes, the hole is now 140 bytes.
So on so forth... using this strategy it is never possible for the dynamic array to reclaim the hole it left behind in memory.
Case 2: Growth factor of 1.5 and an initial size of 10 bytes.
Start with an initial allocation of 10 bytes of memory.
On growth allocate 15 bytes and release the 10 bytes, leaving a hole 10 bytes.
On growth allocate 22 bytes and release the 15 bytes, the hole in memory is now 25 bytes large (the initial 10 byte hole + the new 15 byte hole).
On growth allocate 33 bytes and release the 22 bytes, the hole is now 47 bytes.
On growth allocate 50 bytes and release the 33 bytes, the
hole is now 80 bytes.
On growth reuse 75 bytes from the hole in memory left over from previous growths, the hole is now 5 bytes.
With a growth factor of 1.5 (or anything less than the golden ratio), the hole grows up to a point and then shrinks, grows and shrinks, allowing the dynamic array to reuse memory from past allocations.
With a growth factor of 2, the hole in memory continues to grow and grow and grow.
This appears to assume that there will be no intervening allocations that are allowed to use the same region of memory, since otherwise your "hole" will be very discontiguous.
But if that is the case, then why are you releasing memory? That's just costing you time moving the data. With a doubling mechanism:
Start with an initial allocation of 10 bytes of memory.
On growth allocate another 10, and don't copy anything, leaving no hole at all.
On growth allocate another 20, and don't copy anything, leaving no hole at all.
etc.
In what scenario would the growth factor of 1.5 actually help? If you're restricting the allocation API to only malloc then you can't grow the size of your allocation and what you said might make sense as something the allocator could take advantage of internally. But realloc exists, and will hopefully try to extend an existing allocation if possible. (If your earlier allocation was fairly small, then the allocator might have put it in an arena for fixed-size allocations so it won't be possible to merge two of them, but once things get big they generally get their own VMA area. Obviously totally dependent on the allocator implementation, and there are many.)
99.9% of the time, each allocation leaves a separate hole. Your goal is to enable other allocations to use those holes without further splitting them, not to get your ever-growing vector to reuse memory.
I usually use doubling, but I just recently landed a patch to expand by a factor of 8 instead, which sped up a microbenchmark (of constructing a JSON string, fwiw) by 50%.
Someone saw in a profile that we were spending a lot of time in realloc in a task that involved building a string in contiguous memory. But the final string was realloc'd down to its actual size, so it was safe to get a lot more aggressive with the growth to avoid copying. The overhead was only temporary. It turns out that powers of 8 get big fast, so there are now many fewer copies and the few that happen are copying a small portion of the overall data, before a final octupling that provided mostly unused space that didn't cost much.
Very interesting. And indeed, especially being adjacent with the very interesting “golden ratio” comment, this shows that there is more than one definition of “best” dependent on the task, and that one always needs to verify their hunch by actual profiling.
Contributions-wise to VPP: varies from time to time. On my very recent memory there were sizable IPSec acceleration patches from folks at Intel. There is a fair few one off bugfixes coming from smaller users who scratch their own itch. Pim van Pelt [0] has contributed/improved a few very cool and sizable features (Linux control plane integration, VPP yaml configurator [1])
As for difference with DPDK - does it do L3 routing? NAT ? MAP ? MPLS ? DHCP ? VPP can do all of this, and of course it’s just off top of my memory…
or if you can't guarantee allocations will succeed and need to handle that case. in that case you can just have the object that's part of the list have its list element data contained within itself. (aka an intrusive list)
A good balance is a tree with about a cache line's worth of values at each node. It degrades to a vector for small N, gets the pre-fetching / branch prediction benefit for medium N, but doesn't require fully contiguous memory for large N. It also tends to do fairly well on common mutation patterns (insertions and deletions) by avoiding large copies.
If you want lock-free atomic data structures there are lots of variations (like AVL trees) that can be used as the underlying storage and just like mentioned above it tends to work well for most use cases and access patterns.
If N is fairly small and/or you allocate most of your linked list around the same time it may not matter which is why measuring your actual program with real-world data and access patterns is so important. eg if in production logging or other malloc traffic ends up making your linked list nodes spread out in memory perf is not going to match what you tested at your desk.
Often an algorithm can operate on an entire cache line's worth of data before main memory can chase down a single pointer. Modern CPUs hide that with prediction but even if the address prediction is 100% accurate if the result is a pointer loading chain you run out of pre-fetch slots pretty fast.
To give concrete numbers if DRAM is not busy refreshing and there is no bus contention or inter-core contention then we might get 60-100ns access times all told. If we have a 3Ghz CPU then that's 300 cycles waiting around doing nothing... for each pointer that needs to be dereferenced. For a vector after the first 300 cycle hit the rest would be free - following nodes are in the cache line and address prediction will pre-fetch subsequent cache lines as well, allowing pre-fetching to stream data ahead of when it is needed.
> How is a vector good for inserting elements in the middle? That is O(N). Where is the O(lg n) cost coming from?
Both an array and a linked list are O(n) for adding in the middle. In an array, you finding the middle element is O(1), then moving the rest of the array is O(n) [you have to move n/2 elements]. In a linked list, finding the middle element is O(n) [you have to traverse n/2 elements], but adding it is O(1).
So, asymptotically there is no difference. In practice though, the linked list traversal will cause O(n) cache misses to reach the middle element (since we are traversing random memory), while the array move will only incur O(1) cache misses (since we are accessing memory in order) - so, the array will actually win in practice.
Edit: Note that I also don't know where the GP got the O(lg n).
It seems that the case where it is necessary to find the exact middle element and insert something wouldn't be very common but OK, I'll humor this use case. Let's assume that the list is huge and inserting at the middle is the only thing we do. Just maintain a pointer to the middle element and insert there: O(1) for find and insert. If it is necessary to add/remove at other places in the list, it is possible to maintain a doubly linked list and move the middle pointer back and forth accordingly (still O(1)).
I suppose if the LL like structure is mostly read only with rare inserts, not that big and holds simple data types or structs and often needs to be read sequentially then an array/vector/list would be better than a regular LL but then it is obvious that an array is better anyway.
It's poor guidance to tell people that the vector is the "go to" when you need a LL. Sad actually that this poor guidance has been upvoted by so many people such that it is the top comment, all in the name of some FUD about cache misses.
Let's take the case where we need to add to any element of the structure, with uniform probability.
For an array, this means we'll have to move N elements if we need to add at the beginning, N/2 if exactly in the middle, or 0 if at the end. Either way, we'll need to re-size the array, which has some chance of requiring N copies anyway - say, half the time. So, on average, we'll perform 3N/4 moves per element added, requiring 1 cache miss every time.
For a (doubly-)linked list, this means we'll have to traverse 0 elements if we want to add right before Head, 0 if we want to add right after Tail, and worse case we'll need to traverse N/2 nodes if right in the middle - so, we'll have to do on average N/4 traversals per element added. Each of the nodes traversed will generally be a cache miss, so N/4 cache misses. So, the Linked List will indeed be require a third the number of operations on average, but many times more cache misses, so should be expected to lose. If you store the middle pointer as well, you'll again half the number of operations for the list, but that still won't cover the amount of time wasted on cache misses.
Of course, if the distribution is not uniform and is in fact skewed near a predictable element of the linked list, at some level of skewed-ness vs cache speed advantage, the Linked List will overtake the array.
Note that linked lists also have other advantages - for example, they can be used as in the Linux kernel to store the same object in multiple different lists without any copying and without extra pointer indirections (at a small cost of one extra pointer in the object per list it could be a member of). They are also easier to pack into a small address space that has to be dynamically allocated (since you don't require a large contiguous block).
Understanding, and even sympathizing, with the machine's toil re: vectors vs manual linked list can separate a good engineer from a great one. We learn linked lists, assembly, and even machine code in Computer Science majors so that we know what's happening under the hood and can more easily surmise the runtime effects.
> Understanding, and even sympathizing, with the machine's toil re: vectors vs manual linked list can separate a good engineer from a great one.
I think your parent's point is that if you're a C++ programmer, recognizing that vectors outperform other data structures for most use cases is what separates a good engineer from a great one. Often even for things that require lookup (i.e. vector outperforming a set).
During code reviews, a senior C++ programmer in my company (formerly sat in the C++ standards committee) would always demand benchmarks from the developer if they used anything else. In most cases, using representative real world data, they discovered that vectors outperformed even when the theoretical complexity of the other data structure was better.
A great engineer knows the limitation of theory.
Of course, my comment is only regarding C++. No guarantees that vectors are great in other languages.
> Often even for things that require lookup (i.e. vector outperforming a set)
I am always uncomfortable using higher algorithmic complexity. Sure, for small data sets, the vector will outperform the set, but it may not stay small forever. I may waste a millisecond now, but at least, I won't waste an hour if the user decides to use 100x the amount of data I initially considered. And using much more data than expected is a very common thing. It is a robustness thing more than optimization.
If you want both, there are staged algorithms, a typical example is a sort algorithm that goes insertion -> quicksort -> mergesort. Insertion sort is O(N^2), but it is great for small arrays, quicksort is O(N log N) on average but may become O(N^2) in some cases, and mergesort is guaranteed O(N log N) and parallelizable, but it is also the slowest for small arrays. For lists, that would be a list of vectors.
I like the staging idea, but curiously I haven't seen any papers investigating this.
Obviously, there is an overhead cost of the staging itself, which
hits the use case for small data, and there is the question "From what data size onwards to use what algorithm?".
For example, I just had a look at sort in Clang's STL implementation, which seems to set a boundary at N=1000. Since the number is a multiple of 10, it looks like it might be arbitrarily picked, in any case there is no paper references in the comments to back it up.
Are people aware of formal studies that explore whether it is worth the trouble to inspect data structure and then to dispatch to different algorithms based on the findings?
You're saying you think there's still optimization to be done re. the N=1000 constant?
Other than that, I'd suggest that big O complexity is already telling you that the inspection is worth it. Assuming inspection of an array is constant.
> Sure, for small data sets, the vector will outperform the set, but it may not stay small forever.
In over 90% of SW applications written in C++, there is little to no uncertainty on the size of the data set. Most applications are written for a very specific purpose, with fairly well known needs.
If my data set is size N, and I can benchmark and show the vector outperforms the theoretical option up to, say, 20N, it's a safe choice to go with vector. Doing these analyses is engineering.
In any case, it's almost trivial to swap out the vector with other options when needed, if you plan for it.
I hope that senior engineer is also the one person, who will refactor things, if ever there are more elements in any of those vectors, making linear search slower than an actual set or other data structure. Oh and of course that senior engineer will also hopefully be the one monitoring the lengths of all those vectors in production.
I believe he was referring to the number of elements in a vector, to know when to swap with a different data structure - not the size in bytes.
But to your point - I once needed to do a key/value lookup for a large data set, and so of course I used a map. Then I ran out of RAM (had 80GB of it too!). So I switched to a vector and pre-sorted it and used binary_search for lookups (data set was static - populate once and the rest of the codebase didn't modify it). And of course, the RAM consumption was less than half that of the map.
Just describe your sorted list as "a balanced binary tree with implicit child links" and then you get credit for being fast, space-efficient, and using fancy data structure.
> [V]ectors outperform other data structures ... even for things that require lookup (i.e. vector outperforming a set)
This is something I've heard before (not often benchmarked) and it makes intuitive sense, and sometimes I've also told it to myself in order to not feel bad for writing O(n) where n is always small.
But somehow, reading this snippet, I'm reminded that the vector does not maintain the semantic reminder that it's used in a key/value use case. So it has me thinking there should be a map/hashtable/dictionary interface that's really just a vector with linear lookup.
Then I think, hmm, maybe the standard containers should just do that. Maybe the authors of those should have some heuristic of "when to get fancy".
Std::find_if works great if you want to treat a vector like a map. Plus you can have different keys if that is useful for your problem (look up employee by name, id, or address?)
I think the cases where lookup is faster for a linearly scanned vector is where the hash computation overhead is greater than a linear scan. Assuming you have a good hash implementation, that should be true only for a small number of entries, where you need to determine "small" by measuring.
Even if the key is cheap, that it isn't in cache means that vector will be faster just because we avoid the cache miss of loading the value into memory.
All of these are indeed important, and will eventually (one hopes) result in the student realizing that linked lists are to be avoided because of their poor mechanical sympathy resulting from their awful cache locality.
Valid uses for a linked list are extremely niche. Yes, you can name some, and I can name several valid uses for bloom filters. Just because a simple linked list is easy to implement does not mean it should be used, any more than bubble sort should be used for its simplicity.
Wait why do linked lists have bad cache locality? It depends on how your heap is set up. For example, you could have an allocator that gives relatively good locality by having high granularity and only bailing out if say your LL gets super long (so if your lists are usually short, they could have awesome locality)
That only works if you allocate lists all at once and never modify them.
If there are any other allocations taking place at the same time -- say, allocations from other threads, or for other data you had to create while building the list, that locality is shot. Same goes if you insert more elements to the list later, or if you perform an operation which changes its order (like sorting it).
I mean, if you modify the contents of the list nodes in place, sure. But at that point, you're just doing an array with extra steps.
Changing any of the pointers, however, will make memory prefetch much less efficient. CPUs like linear memory access patterns; they're easy to predict. Chasing pointers is much less predictable, even if the pointers are all to the same region of memory.
That doesn't seem to make much sense to me. LL are useful in cases where you deal with changing lists, where you'd be forced to allocate and deallocate memory often. If you know before hand the exact or approximate size of the list and it's contents you can store that continuously in memory with greater cache locality because you're not storing pointers in addition to data. Seems to me the use case of LL is perhaps as ill suited for cache locality as the structure itself
True. You could implement that using an array of structs where each struct contains indexes of the array to link other structures. instead of memory pointers, since they are 8 bytes in 64-bit applications. This way you get the nice properties of both linked lists and continuously allocated arrays
If your list is short then you won't have a problem. But if your list is long then traversing it can be a major problem.
For example, if your list item is page aligned like the task_struct in the Linux kernel, rescheduling the next process could traverse the process list. The process list links will be at the same page offset and associate with a restricted set of the L1/L2/L3 caches lines and thrash from memory. Worse, the TLB will thrash as well. This was actually the case for Linux until about 2000.
If you are in that place, you're probably using a linked list over small statically allocated memory, but you still need random ordering and fast remove or reorder.
It is not too large (a few thousand of entries usually), but it needs to be accessed and modified extremely fast. There can be insertions and deletions anywhere, but it is strongly biased towards front. It is ordered by price, either ascending or descending.
Various “ordered queue” implementations are usually a poor fit, as there are many insertions and deletions in the middle, and they need to avoid any additional latency. The same goes to vectors. Red-black trees etc. are too complex (and slow) for such a small size.
So we start with a humble linked list, and optimize it (unrolled, skip search etc). But we respect linked lists, they work.
The performance of vectors comes from iterating through them and letting the cpu prefetch items before you need them. Random access in a vector doesn’t really get you that if the vector is larger than your L1/L2 caches, which linked lists would be in anyway if you used them recently enough.
Interesting idea but how does C++ guarantee contiguous memory for a vector? I just don’t see how a data structure with an arbitrary, dynamic size can also reside in a contiguous range of address space.
That works if the data that you want to put in the sequence is copyable (doesn't have "identity"), or if you can arrange so that it gets created in the right spot from the start and you can always choose the iteration order to be sequential in memory. For many more complex structures, that is not the case.
Example: Any object shared among multiple lists or other data structures at the same time, or directly pointed to by other objects, such that any of those views must reach the same cheaply mutable state. This includes objects in the object-oriented programing (OOP) model, actors in the actor model, and file objects in a kernel which are in multiple lists.
For those you have to store object pointers in the vector, instead of copies of the objects. That works and is often done, but it defeats or reduces the cache locality perfornance improvement which motivates using a vector in the first place.
On an in-order CPU an intrusive list (pointers inside the objects) can be traversed faster than a vector or B-tree of object pointers, though a non-intrusive list (list of nodes containing object pointers) won't be faster. On an out-order CPU with heavy speculative execution it is less clear cut because the vector allows successive objects to be dereferenced speculatively in parallel to a limited extent, while this is not possible traversing an intrusive list.
If the list is going to be traversed looking for objects based on some filter criterion such as flags or a number comparison, based on non-mutable state (or where it's ok for mutation to be expensive), traversal can be sped up by storing criterion data in the vector alongside the pointers, or in a second vector, to avoid dereferencing every touched object pointer during traversal.
Linked list are indeed great for secondary traversal orders (or for subsets).
But I don't understand your comments re in-order CPUs: vectors will be faster there too as they avoid the load-load dependency. Most modern inorder CPUs are still superscalar and fully pipelined.
If the data is relatively big buffers, you really do not want to copy them. Your memory bandwidth will thank you, and performance will vastly increase.
Cache locality will be good anyway since you're referring to long blocks.
This kind of data structure often has a buffer + associated metadata. If the metadata+pointer fits into 1-2 cache lines, storing it in a vector can give be win vs. storing the next/prev pointers next to data + metadata.
In any case, there is always only one authoritative answer: “perf top”. :)
> almost always, whatever problem you're trying to solve would be better done with a regular resizable vector
Resizing a vector brings the whole of it into cache, potentially dragging all of it all the way from actual RAM. Prepending an entry brings nothing into cache.
> Resizing a vector brings the whole of it into cache
I don’t do low level code on modern systems; is there not a way to just blit chunks of memory around without a CPU looping it’s way through word by word carrying out MOVQ instructions which do a full RAM read/write?
If you’re streaming data out of memory to a PCIe bus or something, that wouldn’t go via cpu cache, right? There’s some kind of DMA model, so there’s something in the system architecture that allows memory access to be done off-CPU (I vaguely think of this as being what the northbridge was for in the old school x86 architecture but I’m fuzzy on the details?).
Mechanical copying of byte arrays feels like something a CPU should be able to outsource (precisely because you don’t want to pollute your cache with that data)
In the ideal case (and assuming a vector large enough that it spans many pages), you would actually be able to work with the OS's virtual memory manager to re-allocate and move only the page that contains the address you want to write to, while all other pages can stay untouched. But there is quite a bit of careful work to do to implement something like this and make sure you are not invalidating pointers or even internal malloc() data structures.
Not in architectures I've seen or maybe I'm missing the instructions to do it? Even if there were and you can bypass the CPU, hauling data out of RAM is still going to take a long time.
Yes, they’re called “non-temporal” load and store instructions. AFAIK most memcpy implementations will automatically use non-temporal access when copying blocks larger than a certain threshold, precisely to avoid destroying the cache during bulk copies.
Additionally, a well-optimized memcpy implementation (e.g. making good use of prefetch hints) should not suffer too much from RAM latency and be able to pretty much max out the bandwidth available to the CPU core.
You're thinking of like MOVNTDQA? Those still have to reach into RAM - that's still double-digits nanoseconds away and still creates a stall in the processor at n complexity!! No thanks? I can allocate a new linked list node with memory guaranteed already entirely in cache (except for TLA failure.)
> be able to pretty much max out the bandwidth available to the CPU core
Why max out a slow link at n complexity when you can just avoid doing that entirely and use a linked list?
Surely you know the answer, if you're analyzing at this level of detail?
It's because spending 1 millisecond per X operations doing vector moves is preferable to spending 20 milliseconds per X operations stalling on list links.
If you set a good vector size from the start, the moves will be rare to nonexistent. Or you could reserve some address space and never have to move the data.
Sure, prepending or adding at the end(assuming you're keeping a tail pointer, which is easy and common) are basically the only cases where a linked list can beat an array.
Yeah, and we deliberately prepend because that doesn't involve writing to a field in a node which may not be in cache. It's the consumer's job to reverse the list if they ever actually need it.
It's far from a disaster because the best way to deal with a list is still to use a contiguous array.
Allocation isn't free. You have to do it sometime and doing it all in bulk is going to have the same consequences as just one small allocation anyway. If you are mapping new pages into main memory, making system calls or looping through a heap structure, you're not only affecting the data cache but also the TLB while also pointer chasing.
Making a list out of an allocation for every element is performance suicide on modern computers. Even the amortized resizing of an array is done at the speed of memory bandwidth because of the prefetcher. Thinking that is some sort of performance hindrance is absurd because it outweighs the alternative by orders of magnitude. If there really are hard limits on the time every insertion can take, then the solution needs to be links of large chunks of memory to still minimize allocations on every element.
Obviously the compromise is to use a linked list, but preallocate a contiguous block of memory to hold all your linked list nodes. Like an array of linked list nodes, each with a pointer to the array index of the next node.
Then you just need to double the size of that block of memory and copy all the nodes to the new block whenever it fills up.
Hmm, but you’d probably end up needing some kind of a list of free blocks of space within that array to deal with fragmentation… you’d need another linked list to store those blocks in.
If only there was some sort of system that could take care of all that allocation and freeing of memory for you?
It looks like you are conflating a linked list with a heap. Once you have an array/vector for your node list the unused memory can be a list itself, each index pointing to the next free index. You just need the next free index to get memory and you point to the current free index when you free a node. You don't need to manage the memory further than that.
There is a reason why it's hard to create either a bare linked list or a bare array on modern language. It's because both are bad extremes, and the optimum algorithm is almost always some combination of them.
Even if memory was packed by a GC, linked lists still have to pay the cost of allocating and garbage collecting every cell of memory individually. I doubt a GC would solve linked lists’ woes.
Do you have some benchmarks demonstrating your claim?
> ...each individual memory request that’s not in cache still has to wait the full 300 cycles. But if we can get 10 of them going at the same time, then we can be hitting a new cache line every 30 cycles. I mean, that’s not as good as getting one every 16 cycles, but it’s close. You’re actually able to get to a reasonable fraction of the raw memory bandwidth of the machine while just traversing randomly over this huge one gigabyte heap
I don't have a benchmark, but it's important to remember that GCs only scan live objects, and they always have to scan all live objects; while allocation is normally an extremely simple instruction (bump the memory pointer). It's true though that scanning a live linked list would be expected to be slower than scanning a live array (though some optimized scanning strategies may be possible).
Overall I couldn't tell what the impact on overall performance would be, but I can tell you for sure that de-allocating an N-element linked list or array are both O(1) operations for a copying garbage collector (which are the most common kinds of compacting GCs), since they simply will never reach nor copy any of the elements.
Will they? I don't think they reorder live objects relative to each other, so any object allocated between two nodes would still end up between those nodes after the GC ran.
Disagreeing with other answers here, but usually they will, and not because the authors tried to make it happen. It's just sort of the default behavior.
A copying collector has to trace through the whole graph to find everything reachable. The tracing order is usually going to be depth first. (It's the easiest to code if you're using recursion and therefore maintaining state with the runtime's stack, but even with an explicit stack you'll want to push and pop the end for locality.) Which means that objects arranged into a linked list will be visited in list order, and you generally copy as you visit an object for the first time. So... reordering live objects relative to each other "just happens". Even if it's a doubly-linked list, since the back pointers have already been visited and so don't matter.
This is less true if your objects have a lot of other pointers in them. But that's just saying that if your object graph isn't a list, then it won't end up laid out in list order. It'll still end up in DFS order according to the true graph.
if you can replace linked list with vector then you don't need a linked list and the two shouldn't be compared as it is almost always obvious when you need one over the other. Link lists are used in areas when there is a relationship between nodes like a list of running processes. one of the frequent operations is to remove a node from whatever position it's in and add it on either side.
Right. Like there are so many algorithms that I shudder to think about how you'd implement them without linked lists. Like...how about a buddy allocator? Sure you could use a vector for each but you'd be copy and resizing huge swathes of memory constantly for very little gain. Use the right tool for the job!
The thing is, the cost of copying memory is essentially the same as the cost of reading it, and linked lists force you to keep re-reading the same memory over and over (assuming random access).
Adding an element in the middle of an array vs the middle of a linked list actually has the same asymptotic complexity (O(n)), but far better cache impact for the array (since moving a contiguous block of n/2 elements should only incur 1 cache miss, while reading n/2 randomly distributed nodes of the list will incur on average something like n/4 cache misses). Adding closer to the beginning of the linked list is faster, but adding close to the end of the vector is faster still, so overall with random access, the vector should win.
I think you missed the point here. You don't use link lists when you are going to walk the list and insert in the middle or some random place. The only time you end up walking a link list in typical scenarios is when you want to print them out or some unimportant operation.
Link list is used to hold relationship between nodes and when you are operating on that node the common patterns are remove and add it to another list, or re-add it on either side. Take a look at bsd or linux source for inspiration. A process struct has more than dozen member variables of node* type because the object ends up being in several lists like sched, signal queue, child threads etc.
what makes you think that list operations make the CPU jump randomly in memory and that implementors of, say, Lisp systems haven't optimized memory management for linked lists? A modern GC provides features generations, areas, copying and compacting.
One a Lisp Machine from the 80s/90s the system saves a type sorted and locality optimized memory image, from which it later boots the Lisp system. Additionally to the typical GC optimizations (incl. integration with the paging system) it also provided CDR coding, which allocates lists as continuous memory. The GC also creates those CRD coded lists. CDR coding fell out of fashion in modern Lisp implementations, because the space and time savings aren't that great to justify the more complex implementation.
If you're iterating, sure, use a vector. If you're pulling from a queue and doing a lot of processing, maybe an RPC -- does the ~70ns memory latency hit really matter that much? Probably not.
Costs you a dynamic memory allocation, code overhead and higher memory use. There are good reasons why std::array and plain arrays exist. Vectors are for when your data is unbounded, which is actually a risk most of the time.
For truly big data, you want a special structure anyway.
Without extra work you can’t use a vector like a queue, and a circular buffer doesn’t necessarily support all types and/or preserve iterators on insert. Could use a deque, a blocked linked list. But IMO list is fine if it’s not clearly a bottleneck.
It's used by log4j2 and several other java libraries and is quite solid. You can use it in blocking or polling modes, so it's quite generic depending on your latency requirements.
It would rebalance on insert with O(log N) performance but chunky constant factor, keeping the ordering so access by index is also O(log N) with low constant factor...
Which is why usually you would use a red-black tree rather than a BTree, as it has much lower constant for insertion and access by index. However higher for traversal in order.
That does sound useful, although I'm having some trouble thinking of what for exactly. :-)
And indeed, I don't recall seeing something like that. At a first glance O(log n) seems easy to do in the average case, but perhaps not in the worst case.
I suppose it's quite similar to the idea of Bagwell's persistent data structures[1] — those are particularly useful for lock free coding and with a high enough branching factor could be not as much overhead as a linked list or just plain copying.
Well, doesn’t the “index” need to be stored in the tree itself for this to work? If so, the tree just represents an ordered map. (Changing indexes after deletion or insertion would be a problem, though.)
An underlying problem here is that we have come to favor computers that scales out the Von Neumann architecture with more cache coherent cores and more memory. If we take the path of hardware that implements smaller Von Neumann scales, and rather partitions smaller sets of cores and memory into actors that communicate with memssages we would get lower memory latency and less penalty from random access patterns.
> Linked lists are great. But they have the problem that, almost always, whatever problem you're trying to solve would be better done with a regular resizable vector.
One important exemption is when you want your datastructure to be persistent.
So, two things. One: 1980s machines don't have the memory access performance you see today. If N list chase operations takes the same time as N sequential reads then the linked list has the nice properties you were probably taught in school and actual in-memory representation of a linked list is a good choice. On today's machines that list chase might incur a sizeable cache stall, making it thousands or even tens of thousands of times slower.
Two: Lisp doesn't actually care whether there are literally linked lists behind everything. Today you would use a different data structure reflecting the hardware you have.
> 1980s machines don't have the memory access performance you see today
They had similar problems. CPU with tiny caches <-> caches <-> expensive RAM in the range from 500kbytes to a few MB <-> virtual memory paging to slow disks.
For example a typical Lisp Machine might have had 20 MB RAM. But the Lisp image it ran was probably already much larger. Thus paging spaces upwards from 60 megabytes were not uncommon. I had a Lisp Machine with 40 MB RAM and have used > 200 MB paging space. Disks were very slow. Were are talking about ESDI (2.5 Mbyte/sec or less) interfaces or later 5-10 Mbyte/sec SCSI 1 and 2.
I had a 600MB ESDI disk inside a 1 MIPS Lisp Machine with 8 Megawords RAM of 36bit memory + 8 bit ECC.
Thus locality plaid an extremely large role for usable performance. In the early days machines had to be rebooted when they ran out of memory, since a garbage collection could take a long time. Rebooting a machine was just a few minutes. Doing a full GC over 200 MB virtual memory could take half an hour.
When I was making a new Lisp image (called a world), the size was upwards 50MB. 100 MB was common. A special command ran for roughly 30 minutes and reordered the objects in main memory to improve locality. The another command saved the image - which took also tens of minutes.
A big breakthrough in usability came with the introduction of the Ephemeral Garbage Collector, which only touched RAM and took care of the short lived objects, with some hardware support to identify and track RAM pages with changed content.
Features back then were:
* cdr coded lists which were allocated like vectors
* lots of other data structures like vectors, n-dimensional arrays, hashtables, records, objects, ...
* an ephemeral garbage collector with hardware support tracking changes in RAM pages
* incremental garbage collection
* a copying/compacting generational garbage collector with type sorted memory regions
* cooperation between the garbage collector and the virtual memory pager
* various manual or semi-manual memory management facilities
* incremental memory image saves
The main reason to develop Lisp Machines in the end 70s was to get Lisp development off of time-shared computers (with limited shared RAM and virtual memory) onto single user workstations, where RAM and virtual memory is not shared between different users.
The same problem appeared then on UNIX machines, where Lisp systems often were among the most memory hungry programs -> thus they needed lots of RAM, which was expensive. Thus a lot of virtual memory was used. But access to Lisp objects in virtual memory was much slower than Lisp objects in RAM. It took many years to have competitive GCs on those machines.
When RAM got more affordable and larger, things improved.
No, only if you are speaking about the first implementations of LISP until LISP 1.5 and similar timeframe.
By the time Xerox, MIT, TI started delving into Lisp machines, the language had already gotten support for stack allocation, value types, arrays, structures (records).
This is only a problem with implementation of linked lists. The article talks about how linked lists are conceptual. They make a nice data structure to work with in code. But several ways could easily be conceived of that optimize them so that elements are adjacent to each other.
Has there been any work on nudging the allocator towards putting all the nodes on the linked list on the same page? Or close together in memory more generally (to leverage the L1 & L2 caches)
If you're allocating the entire linked list at once, that may well happen in practice. Otherwise, there is no good way it can happen automatically; but, compacting GCs will usually do this for you (assuming that the list elements are only reachable from the previous element).
As any generalization this one too is, of course, incorrect.
Outside of the realm of academia, the need to keep data pieces in several containers at the same time, with no heap operations on insertion or removal and O(1) removal is very common, especially in the kernel space and embedded contexts. The only option that fits the bill - you guessed it - are the linked lists.
Note that this (and the article) describes an intrusive linked list. Extrusive linked lists (like you might see in a class library or CS 101 project), where the node structure is heap-allocated separately from the data that it points to, have very few practical advantages over vectors, which is why standard libraries are increasingly defaulting to vectors even when the class is named "list".
Yes. An intrusive list puts the pointers to next & prev in the struct itself. IntrusiveList<Member>::iterator is a Member*, then member->next points to another Member*. An extrusive list puts the pointers in a separate node structure, which then has a pointer to the actual member type. ExtrusiveList<Member>::iterator is a ExtrusiveListNode<Member>*, node->next points to another ExtrusiveListNode*, and you access the actual Member* with node->data. Basically it's a double indirection.
The advantage of extrusive linked lists is that you don't need to modify the data layout of Member at all. All the linked list manipulation happens on ListNodes, which have a pointer to the actual member, which can be accessed with just a cast. That's why they're so well-suited to class libraries: you can easily write a generic ExtrusiveLinkedList that works with any data that can be represented with a pointer, hide all the traversal (and the existence of individual ExtrusiveListNodes) in accessor functions, and present a simple API with no codegen or modifications to Member.
The advantages of intrusive linked lists are 1) performance 2) no heap allocations 3) easy traversal when given a Member*. It's only a single indirection rather than a double, and usually that indirection is necessary anyway to avoid copies. Insertion and deletion are just pointer manipulation; you don't need to allocate new space for the nodes. Oftentimes your functions will take Member* anyway, and it's nice not to take or traverse the container if you just need to peek at the next element coming up.
The point of this subthread is that intrusive linked lists have very clear use cases in the kernel & embedded spaces, where these advantages are often critical. Extrusive linked lists, however, have very few advantages over vectors (which also require heap allocation but are more cache & locality friendly). In a common business app responding to user input, the difference between O(1) amortized and O(1) is negligible; you can eat the occasional pause for a vector resizing, because the user isn't going to care about a few milliseconds. They both require heap allocation. The vector will give you much faster traversal, because subsequent reads all hit cache. The vector takes less memory (1 pointer per element, with a max of 50% overhead, while an extrusive doubly-linked list is 3 pointers per element). There's just little reason to use an extrusive linked list because the use-cases where intrusive linked lists are not better are precisely the same ones where vectors are better.
As any generalization this one too is, of course, incorrect.
In the real world of application development when choosing between Vector and LinkedList the answer is almost always Vector. There are, of course, certain problems where LinkedList is always the correct answer.
However for all problem that have chosen either Vector or LinkedList as their container of choice a super majority should choose Vector.
There are essentially no vectors in most of the embedded world, including RTOS. Why would you need that many extra copies of data?
Coopies are not free. Not in terms of stack, neither in terms of memory or its bandwidth. Both of which are in demand there.
For an example, look into how most of Zephyr RTOS is implemented. It's mostly intrusive lists (called queues there for some reason, but they support iteration) or ring buffers (msgq or mpsc_pbuf or pipe).
For bigger systems you can use red-black trees.
There is no vector data structure, you can allocate a block from the heap and manage its size manually, but it's rarely done as most sizes are static and many data structures are preallocated by the bootloader or loaded directly from flash memory. Cache locality of these is of course possible to manipulate manually by use of memory sections.
A list does not copy by default, which is an extremely important performance characteristic. And it does not require overallocation like a vector of pointers, plus has a fast remove, especially at head or tail.
Feel free to replace the term “Vector” with “Contiguous Memory”. The relevant distinction in this conversation is “contiguous vs node pointers”. Embedded loves contiguous blocks.
A dynamic array copies memory if it regrows, inserts, removes. The cost of that operation depends on numerous factors. But it doesn’t result in “extra copies” laying around.
Oh wait maybe you mean array operations perform extra copies, not that they persist. In which case it comes down to which is more expensive: memcpy some bytes or cache inefficient pointer traversal.
It depends! But most problems for most programs are MUCH faster if they avoid LinkedList. Many people, especially new programmers, radically underestimate the penalty of pointer chasing.
Quite the opposite. When you have no dynamic allocations, vectors are out of the question, while pool of nodes is fine and sound.
And quite often you can't copy elements from one place to another in embedded or operating systems, since many consumers can store a pointer to this element. Of course this could be solved by usage of vector of pointers, but this completely ruins the ``cache locality'' argument.
Broadly yes, C++‘s vector is the equivalent of JS arrays.
But the javascript array type is much more complicated. It intelligently changes its internal structure based on how you use it - which is pretty wild.
Would you mind elaborating on js arrays intelligently changing their internal structure based on use? My background is only JS/Python and have never touched C++ so I don’t have the context of how these differ.
A c++ vector internally uses an array - which is a chunk of memory where each item is just in the next spot in memory in sequential order. Eg, the array [1,2,3] might have 1 at memory address 100, 2 at memory address 101 and 3 at memory address 102. Given an array index, you can figure out where a value is stored in memory pretty trivially.
When the array fills up, vector will allocate a fresh array (thats bigger - often 2x bigger) and copy everything over. This is slow - because it has to copy everything. But much faster than you think.
So thats how C++'s vector (and Rust's Vec) work.
Javascript implementations (like v8) can swap out the implementation behind the scenes based on how you're using your array. The API is always the same - but the way the array works in memory will be different based on whatever is fastest given the way you're using your array. (V8 doesn't know ahead of time what sort of data will be in your array or what you're using it for, so it guesses then changes things up if it guesses wrong).
Here's a blog post talking about some of the internal details in V8:
The actual implementation is much more complex than this blog post suggests. (I think they also optimize based on push/pop vs shift/unshift vs other stuff).
Thanks for taking the time to write that up–I appreciate the explanation.
My only adjacent knowledge here was when I checked the V8 implementation of the `sort` method on the array prototype and saw it changes the sorting algorithm used based on array length (IIRC insertion sort for length < 10 and merge sort otherwise).
the linked list has a wonderful capability that's frequently seen in C where you can do O(1) removal from a list. E.g. you can have an "object" detach itself. This is not possible in Java's standard linked list implementation, among other languages.
in general, inserting and removing elements from a vector requires a memory copying.
There are programming languages that have linked lists as their fundamental data structure, like Lisp and Erlang.
For situations where I don't need random access, linked lists are hard to beat.
Linked lists also make wonderful immutable data structures.
linked lists have wonderful filter and flatmap performance. how much memory copying would a vector require?
The problem with your analysis is that you're using big-O notation. Big-O notation is based on an abstract machine that is pretty unrepresentative of real computers. In practice, the memcpy is always going to be faster (until you get to really big sizes). Put another way, linked list is O(1) with a massive constant, and memcpy is O(n) with a tiny constant.
And actually this logic extends to other algorithms too. My rule of thumb is that a linear search (despite being O(n)) will be faster up to around n=100 than binary search (which is O(log n)) just due to the way the CPU operates.
This is assuming you already have a pointer of the element you're removing, or a neighbor thereof. Otherwise, you'd have to traverse a portion of the linked list, which has a significant amount of per-element overhead.
Modern CPUs detect patterns of pointer chasing common in linked list use and prefetch to cache just fine. Your comment would have been valid a decade or more ago.
And plenty of use cases are much better with linked lists than resizable vectors. Eg: queues.
I'm going to call you out on this one, because it's a bold claim, and I'd love to see an explanation and some perf numbers.
For example, I'm wondering how the CPU knows what the next item in the list to prefetch it. Unlike the next item in an array, list items could be pointing anywhere in process memory.
Also, what counts as a modern CPU in this context? Are we talking latest generation desktop CPUs from Intel, or anything after Core 2 Duo? How about mobile devices with ARM CPUs? Would those just be ignored?
Apple's M1 has a data dependent prefetcher (superset form of pointer chase prefetcher), as discovered by the researchers behind the Augury paper [1]. They discuss how it works, how and when it activates, as well as a couple other cool details which should more than answer your curiosity. They don't have benchmarks for performance, but these sorts of prefetchers do exist (and possible in hardware you're using right now!).
The prefetcher described in the paper implemented in the M1 is nowhere near a prefetcher that would be able to prefetched linked-list nodes.
> We tested for the existence of four DMPs: both single- and
two-level versions of pointer-chasing and indirection-based
DMPs (Section IV). Our findings show the existence of a
single-level pointer-chasing DMP....
When activated, the prefetcher described will also issue loads for pointers in a contiguous array of pointers. In particular, the pointer/addresses are already known by the core (because they themselves were presumably prefetched far ahead of time by the stream/stride prefetcher), so the core knows which addresses to prefetch. But in a linked-list, the address of the next node is not known by the core until after the node is retrieved from memory. I.e. there is an additional data dependency, and it takes a round trip to memory to resolve that data dependency.
It's the difference between prefetching B's in
A0 A1 A2 A3 A4
v v v v v
B0 B1 B2 B3 B4
and prefetching B, C, D, E in
A -> B -> C -> D -> E
The former is much easier to do than the latter as 1) the A's are easy to prefetch, and 2) the B's can be fetched in parallel. In the second, the core has to wait until B resolves before it can fetch C, and it has to wait until C resolves before it can fetch D, etc.
It doesn’t claim linked lists are fast. And it doesn’t have any benchmarking data.
Actually the only reference to linked lists I see is in the prefetcher section (3.7.3), where it explicitly recommends that consecutive items are stored in a single 4kb cache line or in consecutive cache lines. They give a positive code example using an array, like other commenters are suggesting.
That depends more on what they link to. Not prefetching a worthless list of pointers when you could prefetch data instead is typically faster. Prefetchers have limited capabilities and capacity for loads.
The prefetcher doesn't even have to get involved for linear sequences, it would only get involved after a cache miss threshold is met. The fastest and simplest prefetcher, L1, is stream based and will work great for sequential data.
Prefetching just one element ahead does fuck-all when ram is 100x slower than cache especially if you need to precharge and activate a new row which is always the case when you're spraying tiny objects all over the place.
So don't spray tiny objects around. :) Keep them nicely tucked away in an arena.
Vectors actually tend to create a spray of tiny objects... As opposed to a true dynamic array which has limitations in resize. If you're lucky, they will keep a single dynamic array per vector. You're usually not this lucky with bigger data.
If you never free your objects, you may as well use an array anyway - it'll be much more simple. And if you use an arena with an object pool, you're back in cache miss territory.
Vectors only "create a spray of tiny objects" if you have a vector-of-references or something like that. And even then, reading data requires 2 reads from memory (1 to read the pointer in the vector, and another to read the object). Finding item N in a linked list will require N reads (since you need to read each item before the target). Yes, resizing a vector is O(n). But every time I measure memcpy, it always goes faster than I expect.
If you disagree, try to prove it with code. At least one of us will learn something that way.
"I made an extra test that wraps matrix4x4 in std::unique_ptr."
And that is your test? One piece of code? Not even presenting disassembly? Who the hell knows what your compiler wrought in response to your "std::unique_ptr" and "matrix4x4" ?
Sorry for my ignorance – linked list matching the performance of vector is something new to me. I would like to learn more. The best way to prove your point is to show us a benchmark. However, I couldn't find one with a quick google search.
Well I’ve provided 1 benchmark to your 0. I’d say I’m ahead.
My code benchmark is actually far more pre-fetch friendly than a LinkedList because it can prefetch more than one “node” at a time. In a LinkedList you can’t prefetch N+1 and N+2 at the same time because N+2 is dependent on the result of N+1.
I’m always open to learning new things. Modern compiler magic is deep and dark. If sometimes a compiler magically makes it fast and sometimes slow that’d be quite interesting! If you have any concrete examples I’d love to read them.
It's a link to a document on architectures well over a decade old - the most recent mentioned is the original Core architecture (from 2008). Honestly, based on your first comment I thought you were implying something about content dependent prefetch or other techniques that I am familiar in the academic literature but unaware of ever being used in mainstream hardware.
> Organize the data so consecutive accesses can usually be found in the same 4-KByte page.
> • Access the data in constant strides forward or backward IP Prefetcher. 3-72
> Method 2:
>• Organize the data in consecutive lines.
> • Access the data in increasing addresses, in sequential cache lines.
Nothing new here that contradicts the GPs skepticism. Certainly not enough evidence for you to be a dick.
Ok, but the hardware data prefetch functionality seems basically unchanged:
"Characteristics of the hardware prefetcher are:
• It requires some regularity in the data access patterns.
— If a data access pattern has constant stride, hardware prefetching is effective if the access stride
is less than half of the trigger distance of hardware prefetcher.
— If the access stride is not constant, the automatic hardware prefetcher can mask memory latency if the strides of two successive cache misses are less than the trigger threshold distance (small- stride memory traffic).
— The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes."
Google for Intel hardware prefetcher and you'll turn up some info.
Re: benchmarks. Not likely exactly what you're looking for, but the GigaUpdates Per Second benchmark (https://en.wikipedia.org/wiki/Giga-updates_per_second) is sort of the most-pathological-case of pointer jumping where the jumps are random. Prefetchers don't do well with that (for obvious reasons) - they tend to work best when there is some pattern to the accesses. Linked data structures often live somewhere in between - they may start with good, regular stride patterns, but then over time as elements are added and moved around, it turns into a tangle of spaghetti.
Edit: I should add more. While prefetchers do exist in hardware, they're a bit of a murky subject. Sometimes they speed things up, sometimes they do so bad they slow everything down. They can be very workload dependent. And to top it off, hardware vendors can sometimes be a bit opaque when explaining what their prefetching hardware actually does well on. It's probably safest to not assume your hardware will help you when it comes to tangled pointer jumping code, and if you can avoid it via a different data structure it's probably good to do so. That said, other data structures may entail other tradeoffs - better cache usage for lookups, but potentially more expensive insertions. As usual with data structures, its a game of balancing tradeoffs.
Cache prefetching works best for linear accesses like with a vector, not for random pointer lookups. Prefetchers are also going to have an extra harder time with double indirection since they're unable to see which value is going to be fed to the lea.
Prefetching is also expensive in and of itself, the CPU doesn't do it unless you're already incurring cache hits. This makes cache misses even worse. There are also multiple prefetchers in the CPU and the L1 prefetcher is very basic and is only going to help with contiguous accesses. Your allocator might be doing some nice things for you that let the one of the L2 prefetchers help, I would expect that you'll see better L2 performance due to prefetching.
If you have an example of a linked list being as fast as a vector for iteration, please do show it.
... I didn't say it was for random pointer lookups. That's the pathological case that was the primary reason the Cray X-MT (formerly Tera MTA) was invented. As I said, when you have pointers with some regularity (hence my reference to regular or patterned striding), then you may get some benefit of prefetching when you're not just dealing with linear contiguous access. The "may" is important there - whether you do or don't get it is workload / data dependent. As I also said, more likely than not over time the data structure will lose regularity as pointers get re-wired, and the benefits of prefetching go out the window.
I'm not sure what you "don't buy" - I didn't say definitively that prefetching does or does not always help. Sometimes it does, and when it does, it's highly dependent on the regularity of the way the pointers cause you to stride through memory. I even basically agreed with you -- to quote me, "It's probably safest to not assume your hardware will help you when it comes to tangled pointer jumping code". You appear to want to argue for some reason.
> As I said, when you have pointers with some regularity (hence my reference to regular or patterned striding), then you may get some benefit of prefetching when you're not just dealing with linear contiguous access.
I would need a source to believe this. The only way I can imagine this working is that the L2 prefetcher might incidentally work well with your allocator - your allocator might try to hand out more or less colocated or evenly strided data.
> I'm not sure what you "don't buy"
This statement, primarily:
> Modern CPUs detect patterns of pointer chasing common in linked list use and prefetch to cache just fine
To me, this statement reads as:
1. Prefetchers in CPUs can prefetch data by examining loads that have yet to be executed
2. That they do this well, or at least that's how I read "just fine"
To my knowledge neither of those is true.
> You appear to want to argue for some reason.
I'm not trying to argue at all, I'm trying to learn. What you're saying does not sound right to be but it would be extremely interesting to me if I learned that CPUs are doing something that I'm not aware of.
If you're actually saying that an allocator may end up striding allocations such that the prefetcher can reliably predict where to load next, sure I'll buy that, at least for L2. That's a fine thing to say, it just isn't what I took away from your initial statement, which is what I wanted to drill into because, again, it would be very interesting to me.
Sorry if I'm coming off as argumentative, I was rushing that previous comment, which isn't really fair to put on you. I genuinely just am curious and if this is a case of miscommunication, no worries at all. I'd just be remiss if I didn't try to suss out if I'm wrong.
how do linked lists fit the regularity requirement? the only regularity in this pattern of access is the offset at which the next address to fetch is stored. this seems like a nontrivial amount of logic and state for the prefetcher to keep track of, unlike something like base * scale + offset when e.g. traversing a 2d array.
It is more complex for hardware to keep track of for sure. It's been a feature of high performance cores for more than a decade though. Kernels tend to be based off of linked lists, so it's a massive win typically.
Look, people who are excited about this and just want to learn are asking for proof. You can't just make a claim like "it's been a feature of high performance for more a decade" and provide 0 evidence for the claim, because it's not common knowledge and explicitly contradicts what is commonly taught.
I can time how long it takes to iterate through an array and show that it is faster than iterating through a linked list with the same number of elements.
Now, what optimization would I have to impleemnt to end up with a program that iterates through this linked list faster than the equivalent array?
I'm sure CPUs will try their best to optimize these usage patterns. But I suspect vector / array based approaches will still always outcompete linked lists in actual benchmarks.
Thats always been my experience of them.
Linked lists still show up in the kernel, in redis, in game engines and so on because you can't build fast intrusive collections using arrays.
Making linked lists "as fast as we can make them" doesn't mean they're competitive with a vector.
A bunch of people keep saying this with no substantiation - the best we saw was a link to the latest Intel optimization manual that quite specifically refutes this claim. I even quoted it upthread.
It is extremely difficult, maybe impossible, to design a prefetcher that can predict the next cacheline(s) to prefetch while traversing in a linked-list. I am not aware of a single CPU that can do this consistently. For instance, if you run multichase (a linked-list chaser) on GCP servers, you generally get the expected memory latency (~70-100ns, depending on the platform).
For most use cases a ring buffer either with or without a maximum length will do much better as a queue than a linked list will.
There are some niche use cases where linked lists are good. Lock-free queues are one, and another big set of use cases is where you need hard O(1) insert even with a high constant factor, not amortized O(1).
The place I see linked lists pop up the mosts are when implementing LRU caches. You need a linked hashmap of sorts where the hashmap gives O(1) lookup to the linked list node in question, and then you can extract that node and move it to the head of the queue in O(1) as well.
This is a special case of where they also appear: linked hashmap implementations, where you want a hashmap that has a consistent iteration order.
I’ve seen them show up in lots of situations through intrusive lists, which I think is the general form of the example you’re giving.
Intrusive lists are used in the Linux kernel, in physics engines (eg Chipmunk2d) and game engines. The way Redis uses them for TTL expiry follows the same pattern.
High-performance SPSC queues are always implemented with an array.
If you need hard O(1) insert you can usually just allocate an array much bigger than you will need. The combination of hard performance requirements and willingness to wait for an allocator all the time is unusual.
>where you need hard O(1) insert even with a high constant factor
You will need a truly concurrent memory allocator for that, too... or if you are just the kernel (one of the reasons the linux kernel can benefit from linked lists)
Overall, allocating new objects/memory is not guaranteed O(1).
I think there's more to cache locality than prefetching. In an array, consecutive elements will be usually be adjacent in physical memory (excepting page boundaries ), so each cache line load where a cache line is larger than the array element will "overfetch" into the next element. This should mean fewer total cache loads and thus more efficient use of memory bandwidth for situations where it is constrained (such as walking a list).
Then again, if you have control over allocations you can prefetch whole chunks of data to which the list refers making this point moot.
Often lists refer to arrays or sections of memory. The performance loss if any appears in bigger structures where you do want to have explicit control anyway.
There are a lot of factors involved, but given a limited number of cache lines that can be stored in low level cache, I would think there'd be at least some performance penalty for prefetching from non-linear blocks of memory vs the inherent spatial locality in an array/vector, if only because it'd cause more old cache to be evicted.
Even if your linked list is backed by an array and all the elements are right next to each other, computing some cheap function of the elements of a linked list is incredibly slow compared to the alternative because of the unnecessary and very long dependency chain.
A ring-buffer based queue is mostly better than a linked list but it is not necessarily the best. The typical STL deque implementation [1], with all its complexity, is usually faster than a ring buffer.
Even with a perfect prefercher a linked list would still be slower to iterate than a vector as it is inherently serial (unless your CPU does address prediction and I don't think any CPU does).
> I thought of writing this blog post full of all the things I love about linked lists.
The blog post is short, and I think the author covered about everything.
Yeah, linked lists are sometimes appropriate, especially intrusive linked lists (the author called them "embedded" linked lists). The hate they get is a backlash against CS programs that introduce people to them early and don't emphasize enough the reasons you should usually pick something else. If we have to err on one side or the other, I'd rather it be telling people not to use them.
btw:
> Redis can be wrong, but both Redis and the Linux kernel can’t.
Yes, they can, and I'd rather avoid argument from authority altogether.
Both Redis and Linux are C projects, and in C it's customary to roll your own containers. Without generics, and with a culture of avoiding small dependencies, it's preferred to roll your own simple list than to reach for more advanced reusable data structures.
So I don't think it's a particularly enlightened decision by these projects to use so many lists on their merit, but rather a programming pattern especially common in the C language they use.
Meh, that's true of some C projects, but AFAIK Linux/Redis have good reasons here, and if I were to criticize Linux for something, it wouldn't be unwillingness to write more code.
I still prefer to evaluate each decision on its merits. Linux has made some great decisions. Also some terrible ones—e.g. fsync semantics [1], attitude to fuzzing. Redis probably has too. They can agree on something and be wrong.
If you have a pointer to an element of a doubly liked list, you can always remove it in a few processor cycles. You can also always add or remove from the end of front of the list in a few cycles.
This means that you can do those things under a spin lock or with interrupts masked. You cannot really resize a dynamic array under a tight spinlock, or modify a tree that should be kept balanced.
Another use case is free lists, a list of preallocated structures where you can grab one or put one back quickly under spinlock.
These are some examples of how lists are used in kernel land.
I'm not saying there are no uses. Concurrent lock-free algorithms are an especially useful case, and a e.g. linked list of on-stack mutex waiters is a very clever approach that I don't think has a better replacement. But I doubt that lists are as common as they are because every place really needs them.
Vecdeque can cheaply pop/push front and end too. Arrays have swap-remove. You may not need to grow if you preallocated a capacity. Freelists can be a stack, or you can choose a data structure that doesn't allocate objects individually so you don't need the freelist in the first place.
Another thing that is C-specific is that most objects are treated as unmovable and their address is their permanent identity. Without language features for prevention of dangling pointers, moves are asking for trouble. This favors linked lists and pointery data structures over contiguous containers and inlined data.
The problem with linked lists is that they are rarely the ideal solution to any problem. They are convenient in C, because of C’s inadequacies, but in C++ and Rust they are marginalized because the tools exist to use superior data structures.
You almost always want to unroll your linked linked lists, so that items are fetched in blocks. But this rarely happens in C because it’s premature optimization in C. In C++ or Rust you can write the code once in a generic data structure and enjoy the benefits to the end of time.
That all said, I think we should still teach them in CS because understanding linked lists is the first step to understanding trees.
OS kernels tend to use them simply because they don't have malloc at their disposal and so it's most practical to stick to fixed size memory chunks, right?
Linux kernel's malloc is probably more resilient to the long-term (as in, long running) problems you would get from continually resizing an array with realloc in userspace (ie. massive heap fragmentation and eventually dying in flames)
>Yes, they can, and I'd rather avoid argument from authority altogether.
Proof by authority is the worst. Yet, linux kernel can have valid reasons for, but they are intrusive lists with the data having a pointer to the next.
In my experience, the linked lists that are useful are intrusive linked lists where the data you care about has a next/previous embedded pointer as opposed to an just storing an arbitrary object inside a linked list.
One example would be if your thread structure has these embedded pointers, you can easily add/remove a thread to a linked list of threads waiting for some resource without any allocation just by pointer manipulation.
This is a very useful pattern when you have small collections of large objects, where the vector/array implementation starts to fall flat on its face. Kernels and high-performance servers use these a lot for state tracking.
If you combine intrusive linking with a pool allocator, you can have decent locality too.
That's absolutely right, but it's not just large objects. Any situation where you follow links rarely compared to your other memory accesses then the cache performance of links becomes negligible and the benefit of simplicity and O(1) operations can shine. Vectors for scrubbing through things without changing their structure, lists and trees the opposite.
Scrubbing through things quickly is empirically more common, hence the standard advice to use vectors by default.
Or how about an object that can't easily be relocated like, you know, a stack for a kernel thread which also contains the task struct and needs to be moved around between different scheduler queues.
But yeah, I understand they took the task struct out of the kernel stack now.
Edit: difficult being an understatement, you'd need to solve the halting problem to be able to rewrite any self-referential pointers to stack objects wherever they may be.
The other benefit of this kind of data structure is the ability for one object to participate in multiple collections. Otherwise you would have to do some kind of hash map structure to point to them.
Isn't that the opposite, though? If you store the next/prev pointers in the data structure itself, you can't use the same object in multiple collections.
If you have a standalone linked list data structure, then you can, as long as you have a good means of tracking the lifetime of the data itself.
FWIW, I recently implemented a safe Rust crate around the Windows variant of intrusive linked lists [1] and spoke about it at EuroRust [2].
I did it only for compatibility and would still prefer vectors whenever I can use them. Nevertheless, I put some efforts into coming up with a useful and high-performing implementation. To give an example: My `append` function is O(1) and not a poor O(N) version that you rightfully criticized in your tweet.
The most interesting thing about linked lists to me is how trivially it can be made lock-free multithreaded with atomics.
An array/stack can be made multithreaded but it is non-trivial to handle the edge-cases. In particular, the ABA problem is rather difficult. I've seen many solutions to it (ex: a synchronization variable that puts the stack into "push-only" mode and then "pop-only" mode. There's no ABA-problem if all threads are pushing!)
However, pushing/popping from a Linked List stack requires no such synchronization at all. Simply compare-and-swap the head (and on failure, try again). Its about as simple as you can get when it comes to atomic / lock free patterns.
Still, that's easily solved: 64-bit version number + 64-bit pointer (or 32-bit version number + 32-bit pointer) for a 128-bit (or 64-bit) compare-and-swap.
All modern CPUs support 128-bit CAS.
EDIT: The version number is incremented by +1 each time. It is unlikely that you'd overflow 64-bit version number and risk an ABA problem, though 32-bits can be overflowed surprisingly quickly in today's computers.
--------
EDIT2: Note: I'm pretty sure (but not 100% sure) that the Head of a linked-list stack can be "simply" compared-and-swapped to remain lock-free and 100% valid (ie: 64-bit compare and swap over the 64-bit pointer). I did not mean to imply that you can "CAS any arbitrary node of a linked-list". A fully generic linked list is possible and I've seen it with the 128-bit CAS operator, but that wasn't what I was going for originally.
> EDIT2: Note: I'm pretty sure (but not 100% sure) that the Head of a linked-list stack can be "simply" compared-and-swapped to remain lock-free and 100% valid (ie: 64-bit compare and swap over the 64-bit pointer).
No, you cannot. The problem is what you're comparing and swapping into the head during a pop. You want to do the moral equivalent of `current = list; list.compare_exchange(current, current->next)`, but current->next might have changed if someone else popped the original head, pushed something else, and then pushed the original head again.
You need double CAS or LL/SC or a more complicated scheme to make this work.
It might not be immediately obvious that this can happen, but in practice, this scenario happens very frequently, because if you free a node and then malloc a node, you're very likely to get the same address back with most memory allocators.
In scripting languages it is a truism that anything you would want to do with a linked list is probably better done with an array. And given the overhead of working in the language versus things implemented in C, this is generally true.
However I have a fun use for linked lists where this is NOT true.
Anyone who has played around with algorithms has encountered dynamic programming. So you can, for example, find the count of subsets that sum to a particular number without actually enumerating them all. But what if instead of counting them, you wanted to actually find some?
The answer turns out to be that instead of using dynamic programming to get a count, you use dynamic programming to build up a data structure from which you can get the answer. And the right data structure to use turns out to be...a type of linked list! There is no faster array equivalent for what you get.
In other words, with a few extra fields for clarity, here is the basic structure for subset sum of positive integers:
{
current_sum: ...,
count_of_solutions: ...,
current_value: ...,
solutions_using_current_value: (link in one dim of linked list),
solutions_using_previous_values: (link on other dim of linked list),
}
I leave figuring out the dynamic programming code to generate this as a fun exercise. Likewise how to extract, say, the 500'th subset with this sum. Both are easy if you understand linked lists. If not...well...consider it education!
Suspicion: this is because a positive integer is a linked list. Every integer is defined by a unary root element and then a series of successors until its cardinality matches the corresponding symbol (counting is correspondence).
> find the count of subsets that sum to a particular number without actually enumerating them all.
Would the generating formula for partition numbers[0] work, or am I misunderstanding this problem?
(I think actually generating the subsets is necessarily a dynamic programming problem...)
Given a particular set, the question is enumerating the subsets that add to a particular thing. The fact that numbers themselves could be represented as a linked list has nothing to do with it. And there is no particularly useful generating function to use either.
For example there are 303 primes less than 2000, and 47,839,398,752,301 subsets of them add up to 2000. If you arrange them in lexicographic order, what is the 5 trillionth one?
>When people asking my opinion for Rust, I loved to share them the Linkedin List implementation link:
This LinkedList obsession is a bit bizarre to me, and tends to come from older programmers who come from a time when coding interviews involved writing linked lists and balancing b-trees. To me though it also represents the stubbornness of C programmers who refuse to consider things like growable vectors a solved problem. My reaction to the LinkedList coders is not "well Rust needs to maintain ownership", its why does your benchmark for how easy a language is involve how easy it is to fuck around with raw pointers?.
LinkedLists are a tool, but to C programmers that are an invaluable fundamental building block that shows up early in any C programmers education due to how simple they are to implement and the wide range of use cases they can be used for. But they are technically an unsafe data structure and if you willing to let some of that stubbornness go and finally accept some guard rails, you have to be able to see that a data structure like linkedlists will be harder to implement.
It has nothing to do with the language; implementing with LinkedLists with any sort of guardrails adds a ton of complexity, either up front (e.g. borrowchecker) or behind the scenes (e.g. a garbage collector). When you accept this fact, it becomes ludicrous to imply that a LinkedList implementation is a good benchmark for the ergonomics of a language like Rust.
In the Rust ecosystem philosophy, linked lists are the kind of thing you want to write once, very carefully, then re-use a lot. That's exactly the kind of code where unsafe can be reasonable. It's not like it would actually be safer in any other language that didn't have an explicit unsafe keyword. More aesthetically pleasing, perhaps, but not safer.
You're acting like it's some epic burn on Rust that there's some ugly code in its stdlib. It's not. Stdlibs are like that. Furthermore, as others have pointed out, linked lists in particular are a pathological case for Rust, meaning you're pointing and snickering at a special case of a special case. Most people never have to write or look at that kind of code.
Functional programming languages like Haskell, ML, and Lisps use linked lists very extensively and rarely have safety problems as a result. On the contrary, because linked lists allow you to build a list up efficiently without mutation, they give you a good way to avoid all the safety problems that come from mutation. In general just about any kind of data structure in these languages is some sort of weird linked list.
This isn't the best solution for all cases! Sometimes performance is more important than safety, and there are things you can't do as efficiently without mutable data. And sometimes (especially for systems that are reactive rather than transformational, in Harel and Pnueli's terms, or for libraries) a garbage collector is an unacceptable cost.
And Rust is actually reasonably good at singly-linked lists; it's just doubly-linked lists (which inherently feature mutation) that are troublesome. Singly-linked lists with sharing through Arc are a little more expensive in Rust than in something like GHC, Ikarus Scheme, or OCaml, but that's hardly a damning criticism of Rust.
But you seem to be claiming that linked lists are unsafe in any language, and to anyone familiar with functional programming, that claim is obvious nonsense. Did you intend it?
I should have been more specific about doubly linked lists. And I think the implicit context for comparisons with Rust is usually pointer-heavy non-GC languages, especially given OP, but maybe I should have specified that, too. It's certainly true that DLL's are tricky to write, and tricky to mechanically prove correct, which is mainly what 'unsafe' actually means.
I would think that the implicit context for safety comparisons with Rust would be other languages designed with safety as a goal. Those languages are almost all GCed! ML, Haskell, Scheme, Oberon, Coq, Erlang — all GCed. That's because GC helps enormously with type- and memory-safety. In the example in question, it ensures that your doubly-linked list manipulation cannot cause undetected type errors, use-after-free errors, double-free errors, or memory leaks; this last guarantee goes beyond what Rust can offer in this case. In ML, Haskell, and Oberon, it even ensures that all type errors are detected at compile time instead of run time.
There are a few languages designed with safety as a goal that do not normally use GC: Pascal, Ada, COBOL, and MISRA C. But they suffer enormously in expressivity, are not pointer-heavy, and are generally only viable alternatives in restricted domains.
The usual pointer-heavy non-GC languages are C and C++, and these have no concern whatsoever for safety, so it seems like a weak-man argument to target linked-list manipulation in them as unsafe relative to Rust.
The correctness criteria for doubly-linked lists are indeed not within the capability of Rust's (or Haskell's, or ML's, or Oberon's) static checking. That doesn't imply that it's impossible to statically verify doubly-linked-list manipulation code, and in fact I think Ada SPARK is capable of doing it. Certainly it's within the scope of what you can prove with Coq or Isabelle/HOL.
I did say tricky, not impossible. ;) IIRC proving pointer-pushing code is an active research area, but I could easily be out of date. Not cheap enough to be mainstream yet, anyway.
We're deep into semantics territory with "what's the right context for comparing Rust", but I would strongly argue that yet another language safe via GC is not interesting for its safety. As you point out, it's been done a lot. Rust is interesting because it pushes safety into a new level of runtime performance and (to some extent) lower hardware abstraction. AIUI that's the motivation for Rust, where most of the marketing and enthusiasm is, etc. And if you don't need that speed/control, Rust isn't much of an improvement: just use a GC and have more fun (unless you just think Rust is neat, I guess, but whatever).
So it's the languages already in that "low level" domain (broadly defined) that are its real competition IMO, C/C++, upstarts like Zig/Nim/Crystal, etc, languages where linked lists are on the cards. Especially for the problems where safety has previously been seen as too costly in either runtime or development time (no one is coding to fast moving web standards in Ada/SPARK, methinks).
And I don't think I implied that those languages' list manipulation is less safe than Rust's, rather that Rust's isn't any worse, besides being uglier. I made almost the same point from the other direction in another sub-thread: there's no use denying that Rust code for DLL's is uglier for no safety benefit on that particular task.
As others have pointed out, the whole question of linked lists in Rust is a huge distraction from its actual benefits and weaknesses.
Well, Rust can offer some kinds of guarantees that GCed languages can't, in the cases where its type system applies. It can allow you to safely share mutable data between threads, for example, or to guarantee that opened files get closed in a timely way. So I don't think it's interesting only because it pushes safety into a new level of runtime performance, though often that is also true.
I think it's noteworthy that, not only in the doubly-linked-list case but even in cases that use Arc (which doesn't require `unsafe`), Rust is less safe than mainstream GCed languages like Java (and arguably Haskell and OCaml). Rust can leak memory; those other languages can't. Its safety guarantees in those cases are a proper subset of theirs.
Unless I'm misunderstanding the situation? I've only written very small amounts of Rust.
Fair point about other kinds of guarantees, I suppose. Now that you mention it, there has been some speculation about what a GC'd Rust would offer (worth remembering that IIRC early versions of Rust had a GC, until they figured out they could do without it). However I don't think that significantly changes my main point about where Rust primarily competes today, and its primary value proposition.
Re safety: Rust has, um, a very specific idea of what "safety" means. It's basically things that could cause RCE, or at least incorrect writes I guess. Memory leaks are specifically not considered "unsafe". :) And really, the worst you can do with a memory leak is DoS, until the target reboots. Handling mutable data across threads is included, though, which Java at least is somewhat famously not great at; it's really easy to write Baby's First Race Condition Demo in Java. So it feels really weird to say Rust's safety guarantees are a strict subset of Java's.
OTOH I agree it's hard to imagine Rust beating out Haskell for raw safety guarantees by any measure, and I have no idea for OCaml (despite having written a lot more Ocaml, heh).
Whatever. They're definitely pointing and snickering. I don't think there's any way to read their post but as a criticism of Rust the language based on this one piece of stdlib code, and that's nonsense. That irritates me. So sue me.
There are reasonable arguments to be had about Rust. I wish people would do those instead of all the nonsense you usually see.
on the contrary, I think more and more people are unwilling to hear any criticism, even constructive criticism, of "their thing".
Rust is not perfect. I wont go into its faults, but you know what they are. I think overall, Rust is a great language, but you have to be realistic and understand that some stuff you ignore in favor of the holistic view, others might not be able to. So sometimes maybe just take the criticism and deal with it, or if you're in the position, do something to fix the problem.
> I wish people would do those instead of all the nonsense you usually see
you seem to want to place the blame squarely on the critics, when the reality is that many people in your position are seemingly intolerant of any criticism of "their thing", even constructive criticism.
Ah, I see, since I'm defending Rust in this sub-thread, it's impossible for me to ask for constructive criticism instead of bullshit without it being "seemingly intolerant", while you're the reasonable one for doing the exact same thing. Basically just assuming I'm an idiot and/or arguing in bad faith. Lovely.
they linked to the source code. What bar do they have to reach, before you honor it with being constructive? the point is, there is no bar. unless its praise to Rust, then its trolling in your eyes.
The source code of the standard library is usually an awful way to show off any language. Pointing and gawking at such a thing as criticism is very bad criticism.
Yeah, I already made that point in my root comment. It was pretty much my whole thesis, in fact. Doesn't matter, apparently! svnpenn is determined to confuse me for someone dumber. :)
Considering that unsafe rust basically just allows raw pointer access (see link below) which is already a thing you can do normally in C, I do not see how that's a very good argument against it honestly.
As for the Option type, it is exactly what it says. It's an optional, non-nullable node generic over type T. I suppose generics, optionals and the idea of non-nullable types might be exotic at first, but this is not a problem with Rust as much as it's a problem with C not being able to express these concepts in the first place and instead expecting you to spray null checks all over the place!
Not to be overly contrary here but, you "need to spray null checks" probably just as much in rust as in C (in this specific example, not in general). Since those prev/next pointers are being modified in unsafe code where either you're using NonNull::new() which contains the null check, or you're using NonNull::new_unchecked() which means you need to convince yourself all the invariants hold true. The situation is roughly equal in C (ie. once you prove in a module that fields cannot be NULL, no need to add lots of redundant extra NULL checks in all users of the module).
When explaining why the difficulty of modeling linked lists in Rust doesn't matter, I like to share the following book, "Learn Rust With Entirely Too Many Linked Lists", written by the woman who wrote the implementation you've linked above: https://rust-unofficial.github.io/too-many-lists/
So I glanced at it, expecting something much much worse, but I was surprised to see it wasn't that bad. Even if you look at the final code for the final implementation, it doesn't seem much more complex than what you'd have to write in C. And a good chunk of the code is implementing traits from the stdlib, stuff that isn't strictly necessary, but makes it easier to make the API of the deque idiomatic and work well with other stdlib (and non-stdlib) APIs.
And regardless, this book is teaching from the perspective of intentional super-hard-mode. Most Rust developers will never have to write that much unsafe code.
Oof. Yeah, standard lib code tends to be harder to read on average than application code. But at least I can mentally parse the Rust code. Heck it even looks like a linked list impl. The c++ code looks barely better than line noise. I'm sure I could wrap my head around it, but ho boy, it's far worse than Rust by a long shot.
And this is with several years cpp experience, and maybe a few months of Rust on and off.
Rust's list only supports a default allocation mechanism. For an actual comparison, you'll want to show us a Rust implementation which is parametrized over allocators.
Having said that - yes, it's not pretty. About the same length though.
I don't know Rust but I would guess that an Option<NonNull<Node<T>>> would be a Node of type T, that cannot be set to null, but can be optional. This is a type I would want to return from a function that could either return nothing, or a pointer to a real thing.
As for the unsafety, I would assume the authors of collections know what they are doing and do that for performance.
`NonNull` is actually a wrapper around the pointer type, with the added invariant that the pointer cannot be null. The compiler is made aware of this, which allows `Option<NonNull<T>>` to be just a plain pointer, where the `None` variant of the option corresponds to a null pointer and the `Some(ptr)` case corresponds to a non-null pointer.
It's an interesting suggestion to impose Option semantics on raw pointers by default (presumably with some alternative way to receive a nullable pointer, for completeness). But in the meantime, in your own codebase, it's easy enough to do `type Ptr<T> = Option<NonNull<<T>>`.
The problem there is that NonNull is exclusively a wrapper around *mut T. Using it discards the distinction with *const _ pointers. I think that has implications for correctness with Miri.
I think if you tried to do that, you'd basically be mixing type-driven optimizations with the C FFI, which sounds sketchy, to me at least. The null-optimization for Option is just that, an optimization, and I don't like taking it for granted where safety/security is at stake.
> Rust guarantees to optimize the following types T such that Option<T> has the same size as T:
> - Box<U>
> - &U
> - &mut U
> - fn, extern "C" fn1
> - num::NonZero*
> - ptr::NonNull<U>
> - #[repr(transparent)] struct around one of the types in this list.
> This is called the “null pointer optimization” or NPO.
> It is further guaranteed that, for the cases above, one can mem::transmute from all valid values of T to Option<T> and from Some::<T>(_) to T (but transmuting None::<T> to T is undefined behaviour).
Although I'm not sure if the ABI is guaranteed to match (Which could differ even if the layout matches AFAIK). The ABI for Box is guaranteed to match since https://blog.rust-lang.org/2020/01/30/Rust-1.41.0.html , and I would imagine NonNull is the same. Maybe you could open an issue in the UCG asking if you're needing confirmation.
I suppose one problem with this approach for C FFI is that there's a lot of different values which could all be "null pointers". Converting them all for Option would be awkward and slow, and you wouldn't want to ever risk getting this stuff wrong.
But pointers are also useful even if you aren't doing FFI. Eg for implementing custom data structures. In that case, Option<NonNull<T>> (Or even NonNull<T>) is usually better than *mut T. But its harder to type, and it doesn't clearly tell you if the pointer should be *mut T or *const T. NonNull<T> should be preferred because security/safety is at stake for this sort of code.
Also the implementation is all wrong. The proper way to add an element to a linkedin list is to send the element an email titled "I just requested to connect", along with a "view profile" and "accept" link.
It's definitely not a data structure that makes the borrow checker happy. Luckily it's also not a data structure that's required or desirable in 99% of cases, but the standard library still offers an implementation for those 1% so people don't try to write their own (turns out getting a doubly linked list correct isn't quite as trivial as it may seem).
I’m actually a bit surprised it’s in the standard library if it’s so uncommonly needed. Isn’t Rust’s stdlib tiny with a philosophy of letting the community fill in most things?
It's at odds with it, but so are many other data structures found in the standard library. The solution is usually to use unsafe { } inside the data structure to allow for those criss-crossing pointers (while exposing safe, checker-friendly interfaces to the outside world)
From what I remember Rust has some kind of RefCell with weak references. Those could be used to make DLLs, if anyone can find a reason to use those. That said, you could also use zippers...
AIUI, GhostCell and related patterns can be used to implement doubly linked lists in safe Rust. These patterns are quite obscure and hard to use in practice, though.
That's because in order to interact with pointers you need unsafe, and not using unsafe in Rust requires a single ownership tree or reference counting. If you're not keen on either of those solutions, you shouldn't be writing a Doubly Linked List in other languages either (because they will have either pointers or reference counting/GC).
"Those solutions" start off being hierarchical ownership or reference counting, but turn into "pointers or reference counting". Requiring pointers is implicit to the definition of linked lists, and calling it out here is silly. And you don't really need either ref counting (again, pretty obviously, witness every real world implementation) or hierarchical ownership outside Rust.
Conceptually, the list as a vague concept owns all the nodes. The ownership just stops being expressible directly through references the way Rust wants it to be. But you don't have to re-invent hierarchical ownership to implement DLLs in other languages, because it's not really there in the problem. You just have to accept that they're always going to suck under Rust's ownership model.
Technically the pointers don’t have to be actual pointers. And if they’re not actually pointers, then Rust’s ownership model and borrow checker will be perfectly content; they’ll barely trouble you at all. And you won’t even need any unsafe blocks.
What you do instead is use integers instead of pointers. The integers are indexes into a Vec of list nodes, owned by the linked list. Since the nodes are now owned only by the Vec, which is in turn owned only by the linked list, the borrow checker will not complain.
Some people object that this is cheating somehow, but what is memory but a giant untyped global vector, shared between all parts of your application? Pointers into that giant shared array are just indexes with extra risk, since you have to trust that they point to real nodes that have been initialized. Plus, you often see users of a linked list put the nodes into an arena allocator anyway, especially in the kernel. The Vec in your Rust implementation serves the same purpose as the arena allocator.
> Requiring pointers is implicit to the definition of linked lists, and calling it out here is silly.
The usual complaints are that "Safe Rust doesn't let you do this", when unsafe Rust is right there to let you operate on any soup of pointers datastructure that you'd want to implement.
> you don't really need either ref counting (again, pretty obviously, witness every real world implementation)
If you implement DDL in a memory managed language, you effectively have the same behavior as you would in Rust with Arc or Rc. The ownership of the nodes then belongs to the GC or the RC value. You don't have to think about it because you're abstracted from the lifetime of each node by the language and runtime.
> or hierarchical ownership outside Rust
The ownership/borrow checker requires hierarchical to operate, but it doesn't stop you from writing unsafe Rust.
> Conceptually, the list as a vague concept owns all the nodes. The ownership just stops being expressible directly through references the way Rust wants it to be. But you don't have to re-invent hierarchical ownership to implement DLLs in other languages, because it's not really there in the problem. You just have to accept that they're always going to suck under Rust's ownership model.
Yeah, and that's what unsafe is for. Most code doesn't need vague ownership, though. I'd go as far as saying that the nudge towards clear ownership helps both maintainability and performance.
My concern is with the prevalence of this idea that unsafe Rust is not "real Rust" or that the existence and use of unsafe Rust somehow precludes any benefits of Rust.
Another cool property about linked lists is that you can add an element to the head of a list without mutating the previous list. There are not many data structure that you can add an element to and leave the previous version unchanged. This form of immutability is a very nice feature that Lisp / functional programmers enjoy. Because of this, you can concurrently add an element to the head of a list to create new lists without having a critical section in your code.
Lisp of course was originally invented for manipulating linked lists and it came about in an era where cache locality wasn't an issue because computers in 1959 were speed limited by their CPUs rather than their memories.
Cdr coding [0] solves several problems with singly-linked (cons-style) lists including cache locality and the 2x storage overhead of these lists. But it's not typically used except in dedicated hardware e.g. Lisp machines. It could in principle be revived fairly easily on modern 64-bit Lisp systems on stock hardware -- especially if such systems provided immutable cons cells.
But it's probably not worthwhile because modern Lisp programmers (in Common Lisp at least) don't use lists all that much. CL has very nice adjustable arrays and hash tables that are easier and faster than lists for most purposes.
There was a time when memory was conventionally expensive, memory allocation of large blocks was slow, and small blocks was much faster (malloc would find a small block faster than a large one on a highly fragmented heap).
Before having gigantic caches that would engulf nearly any sized contiguous list, linked lists were sort of vogue, frugal, and thought of as being fairly performant.
That time is still there in embedded hardware. Even potent ARM chips (except the cellphone ones maybe) have some KB of cache at best, and similar amounts of SRAM. While code size is still at a premium since XIP (execute-in-place from in-built flash) is slower.
FWIW I usefully use a "fake" linked list by adding a .next field to the struct in a compiler-allocated array of structs. On initialization I set the list head to &item[0], and in a trivial loop set the .next field of each entry (except the last) to entry+1.
Why bother? Because I can then easily add a few extra structs to the beginning of the (contiguously-allocated) linked-list without having to reallocate the whole thing.
Sure, pointer chasing with separately allocated structs is "slow", but I haven't yet measured to see if it's any different when (almost all) items are contiguous.
If you would...
- what sort of cache behavior should one expect of this on a modern laptop CPU?
- I haven't seen this approach before, have you?
It depends on what prefetchers your CPU has and the actual underlying access pattern your accesses cause. If chasing next leads to a constant stride run through the array, you'll get identical performance to that of an iterative walk since essentially every high performance CPU since like 2010 supports stride prediction based on raw addresses. If .next sends you bouncing all over the area, you'll get complicated behavior since whether or not the data can be prefetched depends on the performance/existence of a pointer prefetcher, which is less common/more error prone. We know Apple's M1 have it due to some researchers using it as a side channel [1] but you'll have to do some digging on whether or not your laptop has one. Would make a nice post here if you do make the benchmarks :)
Linked lists are one of those things, like textual protocols, that people reach for because they're easy and fun but, from an engineering standpoint, you should never, ever use unless you can provide a STRONG justification for doing so. The locality of reference characteristics of vectors mean that, except for extreme cases, they beat linked lists almost every time on modern cached CPUs.
How do you detect a cycle with linked lists? :) I actually have a side project where I've run into this - it's a distributed directed graph structure where the nodes can send messages to each other. It's intended to be acyclic, but "walking the graph" is really only used to send truth values, which are combined with others to run Boolean logic. So in that case, the occasional cycle doesn't matter, because if the calculated boolean value matches that of the node's internal state, it just can cease propagating. So a structural loop won't turn into an infinite processing loop.
The problem then becomes that I can't introduce NOT gates anywhere in a cycle, because then the bit will continue flipping and I'll get an infinite processing loop.
So it seems my only hope is external processes that continually walk the graph and keep track of where it visited to try and detect loops, and I don't like how that scales...
Just in case performance matters, there is a more efficient way: have the tortoise stay in place and advance the hare only one node at a time, and assign the hare to the tortoise every time a power of 2 number of steps have been made. This is known as Brent's algorithm, and it requires fewer advancements than the original tortoise and hare algorithm by Floyd.
Another notable advantage of Brent's algorithm is that it automatically finds the cycle length, rather than (in Floyd's case) any multiple of the cycle length.
Cycle detection in (currently) 44 languages. Most (all?) use Brent's algorithm. They're operating on an iterated function but converting most of these to detect cycles in a linked list would be straightforward.
It's not difficult. Keep a pointer to a recently-visited node. After visiting N nodes, update the pointer to the current node, but also double N. If the pointer ever matches the next node, the list has a cycle. The doubling of N ensures that even large cycles will eventually be detected.
I would say the easiest way is to walk the graph and tag the nodes as visited. If you don't visit every node each iteration use a increasing key you store in each node.
Donald Knuth's Dancing Links paper is a beautiful exploration of the power of linked lists. He uses a table with doubly linked lists (for rows and cols) to solve omino tiling. The linked lists allow for an elegant unpicking of rows (and re-attaching when backtracking)
The paper's a really enjoyable read, Knuth's tone throughout is "look at this, this is fun"
I think I first learned about linked lists in the context of programming for AmigaOS, which has a standardized form of (intrusive) linked lists that is used throughout the system. I remember being fascinated by the header node which serves as both the head and the tail of a list due to a “clever” field layout: https://wiki.amigaos.net/wiki/Exec_Lists_and_Queues
One of the only times linked lists are "undoubtedly the right option" is if you're writing some sort of intrusive data structure to avoid an allocation by reusing caller stack from another location. Not sure how many run into that often.
When I need to avoid an allocation it's because I'm writing an allocator :) Overlaying the list nodes on top of a block of free memory avoids the need for a secondary data structure to keep track of free areas.
The other time allocations are undesirable, and therefore linked lists are widely-used, is bare metal work when other allocators just aren't available. You can do tricks like representing pools of free and in-use objects with two lists where the nodes occupy the same memory block. Allocations are a fast O(1) unlink from the free list and O(1) re-link to the used list, and frees are the opposite.
Allocating a huge chunk of memory all at once when the array grows can also cause a bunch of problems. Linked lists are also much simpler in a multi threaded context.
I’m not an expert in memory allocators but I suspect large allocation is better than many small allocs due to less fragmentation no? I’ll grant you the multithreaded case but if you’re doing that your cache performance is probably crap anyway =)
I just used linked lists to make an LRU hard limit on the number of items in an in memory store. Each usage, the pointers are updated to move the item to the head. Each time the list is at the limit, the tail is removed and freed. I may take advantage of the machinery to make it go onto free list. This was Go and the DLL thing is my first use of generics.
Linked lists don’t need defense. It is unfortunate that job of vast majority of developers is rather benign. They don’t work on problems that requires complex algorithms and data structures beyond arrays and dictionaries. But interviewers keep asking those questions and give bad rep to these subjects. However, if you are working on building any real infrastructures such as search, deep learning, cloud, crypto, game engines, OS, compilers etc then you need linked lists, graphs, trees and myriad of algorithms surrounding them all the time. There is clearly different tier of developers who work on such infrastructure and those you simply utilize infrastructures others have built.
Honestly, a lot of programming data structure concepts were difficult to grasp /until/ I truly understood how linked list works. It's a very simple concept in hindsight, but as a newbie programmer - this was an alien concept. Linked list? Just give me my array or built in List! Then I understood how to build a linked list and how it can be extended in various ways. They can be used to solve interesting problems. Many high level languages provide pre-baked implementations of LL but most programmers don't understand how they work under the surface. When I understood LL, I was able to apply that understanding to trees.
The only reason to do this is for education/visualization of data structures, although I'm wondering if this can be engineered to cause a Python memory leak, via circularization of the linked list, or if Python's GC would catch it. Also educational perhaps.
Where linked lists really seem to come into play is with custom manual memory allocators in embedded programming, something of a niche subject.
"Since the collector supplements the reference counting already used in Python, you can disable the collector if you are sure your program does not create reference cycles"
I suspect the problem is that many people think of stuff like Java's LinkedList when they think of linked lists. As a standard list implementation, I think it is easy to see that a doubly linked list just isn't that strong. Especially with how cumbersome the List interface makes it to take advantage of the links.
That said, conceptually, linked items are everywhere; such that you can easily find how to list things following the links. You probably just don't bother keeping that in a "List" structure in your code.
I run into people that use LinkedList everywhere in their code "for efficiency". It always confuses me. Are CS professors teaching something that leads to this?
In practice, it seems unusual for it to have any advantages over just using an ArrayList in typical JVM applications.
I blame an over reliance on Big O as an arbiter of what is a fast algorithm. Linked lists are commonly taught as having much better insertion speed than arrays.
This ignores constant factors and cache coherency, of course. More, it also ignores that most lists that folks will make have a VERY predictable access pattern. Typically just a one way scan, if any traversal at all. With very few inserts, and mostly just appending.
Funny, as I constantly have to tell folks to not bother using it at the office. Everyone always assumes it has to be faster than arraylist for whatever they happen to be doing this time. I think the vast majority of the time it flat doesn't matter, but I also fully expect LinkedList will lose most speed tests.
Right. The main problem is that Java's List interface affords access by int index, making it seem array-like. Unfortunately get(i) on a LinkedList is O(n). This turns apparently innocuous things quadratic, like looping over a list by index.
Nothing to do with the interface - only java.util.RandomAccess guarantees O(1). Iterations using get(int) is a mistake regardless, stick to the java.util.iterator or ListIterator it you need the index.
The main issue w/ the LinkedList is its memory footprint (aside being LinkedList w/ an indirection on each access), along with the increased GC costs (the GC has to iterate the nodes too, cache misses - gc pauses), even half empty ArrayList is more memory friendly than any LinkedList. ArrayDeque offers 'adds' at the both ends, if you wish to use LinkedList as a queue.
Agreed that LinkedList memory footprint and locality are serious issues.
Fundamentally though the problem is that the LinkedList implementation is at odds with the abstraction provided by List -- access by index.
Certainly, straight iteration of every element is better done by a for-each loop (which uses an Iterator under the covers). But the availability of indexed access leads one to use it for a variety of additional circumstances. Consider for example processing every even-numbered element, or finding an element that meets some criterion and then operating on an adjacent element. Iterating over indexes for cases like these is quite natural given the List API. (ListIterator can be used for this sort of stuff, but it's quite cumbersome, and sometimes it doesn't actually help.)
I think the Rust people are de-emphasizing the importance of linked lists due to the glaring hole in the Rust language that makes it difficult to implement linked lists, and horribly awkward to do doubly-linked lists in safe Rust
I've used linked lists where I wanted allocation of an almost-write-only log data structure to be very fast - so not resizing a buffer, and nobody cares about cache locality except for allocation. In a system with a TLA buffer and bump-allocation, this is absolutely ideal because allocation is local and very fast, and does not cause any of the existing log to be brought into cache.
"Nobody uses this data structure stuff in real programming at work! You can forget it after college!"
The first problem with Linked Lists is you should almost never use them. Do they have uses? Of course! However on modern computers memory access is, roughly, more expensive than CPU cycles. The Linked List has been demoted to a “special case” data structure.
The second problem is Linked Lists are taught too early. They’re historically taught first in Data Structures 101. This results in new programmers using LinkedLists first when they should be used last.
I think the whole "dynamic arrays" vs "linked lists" debacle is just a waste of time, since the two actually complement each other.
- You can certainly implement linked lists using dynamic arrays, so you call the minimal amount of `malloc()`s and also make your items stored contiguously! You need to use indices instead of pointers, since you will lose pointer stability unless you use virtual memory (though the upside is that typically a uint32_t is enough for most cases, which only takes up half as much memory as a pointer).
- A lot of data structures under the hood uses linked lists, even the ones that seem to use dynamic arrays on the surface. Have experience in writing a simple arena allocator? For the allocator to find a free space in the arena in O(1), you need to maintain a free-list data structure under the hood. And guess what: you need to write a linked list. Even your operating systems' `malloc()` itself probably uses linked lists under the hood.
- Linked lists just come up inside so many other data structures, that you can essentially call it as the 'backbone'. For example: in compute graphics there is something called a half-edge data structure [0], which stores the geometry data of a mesh that's both compact, easy to iterate, and easy to do manipulations with (for instance, you can do things like insert/delete an edge/face easily in O(1) time, and more complex mesh operations can be implemented as well). And guess what? The vertices and halfedges are essentially stored in a complex web of linked lists.
> Linked lists are conceptual. A node pointing to itself is the most self centered thing I can imagine in computing: an ideal representation of the more vulgar infinite loop. A node pointing to NULL is a metaphor of loneliness. A linked list with tail and head connected, a powerful symbol of a closed cycle.
Tongue-in-cheek, but this really made me smile. :-)
> You get what data structures made of links truly are: the triviality of a single node that becomes a lot more powerful and complex once it references another one.
I don't think beginners actually make this connection for a while. Linked Lists are introduced analogously to arrays, sets, etc. Beginners think about Linked Lists in terms of what they already know.
As a beginner, I thought of Linked Lists purely as non-contiguous arrays, even though there are deeper concepts behind them.
Unless the beginners already have the perspective of "same having the power as the whole", I don't think this connection gets made for a while. Linked Lists don't expose so much possibility on their own.
People must stop give attention to old garbage solutions that should not be used. Never use a link list if you don’t specifically need a link list. You probably don’t know why you would need one so don’t go there.
If you stop "giving attention" to old solutions, people will simply reinvent them. Linked lists in particular are very easy to reinvent because they're so simple, and because their advantages are all evident even in a high-level language, while disadvantages require a more low-level understanding of modern compute performance.
> (oh, dear Twitter: whatever happens I’ll be there as long as possible – if you care about people that put a lot of energy in creating it, think twice before leaving the platform)
I guess linked lists as they are are very useful for implementing queues (particularly those that feed thread pools) where-in the costs of growable array is not needed and the cache locality does not matter (continuing with a thread pool example - It's almost a guarantee that having next element in the cache of the CPU which is not going to execute the next Runnable is a waste).
In Java particularly the both array as well as the linked implementation of blocking queues should perform equally well. FWIW most queue implementations are linked lists.
The best implementations typically have a queue per thread and work stealing. The first threads to finish their assigned work will grab items from other queues, but until then you get perfect cache locality.
Java's queues and global threadpool queues in general are pretty old hat.
That was my high school programming fun - implement linked lists and other data structures in TurboPascal. Pascal is a language that does not come with dynamic lists, only fixed size arrays and memory allocations. I had to build linked lists just to have a decent list primitive.
A few years later I discovered Perl that comes with dynamic size lists, dictionaries and strings. Whoa, that was a treat, so very nice to use and efficient. To this day I prefer constructs made of lists and dicts 10x more than defining classes.
Linked list is useful when you are hurting for memory and need precision memory allocation, like in the kernel. You trade memory compactness for more operation time in dealing with its certain aspects.
And code size, let's not forget about that one.
Plus you do not have a super smart memory allocator to prevent the hundreds of vectors from spraying the heap.
> Linked lists are simple. It is one of those rare data structures, together with binary trees and hash tables and a few more, that you can implement just from memory without likely stepping into big errors.
One of the best engineers in Microsoft’s devdiv told me that he often gave a linked list implementation in C as a whiteboard assignment for interviewees and that no one ever managed a bug-free version. (I failed mine even after creating a full general purpose implementation on my own just a couple years before.)
One thing is: you have 30 minutes of time and can compile and test and end with a working implementation. Another thing is write the code on a whiteboard, that is a process nobody ever follows in practice: very easy to make mistakes this way. Anyway doubly linked lists are easy to get wrong without compiling / testing.
Seems to me that all these arguments are rooted in the way computer memory is fundamentally implemented: Could it not be possible to re-architecture it so that it's more agreeable to arrays and lists, since that's how most data is used?
I often wonder about the separation of CPU and RAM, of ways it could be done better. How about making it like neurons, where each "memory cell" is also a teeny tiny computer itself? (How DO neurons do internal computation anyway?)
A nice explanation on rationales of using linked lists in OS kernel, from a Fuchsia developer. In short, it is almost guaranteed not to fail on most typical operations given the invariant is not broken, which make it suitable for cases where there's no fallback option left like kernel codes.
> Linked lists are simple. It is one of those rare data structures, together with binary trees and hash tables and a few more, that you can implement just from memory without likely stepping into big errors.
Well, if one likes linked lists enough, they can replace binary search trees / hash tables with a skip list. Super powerful and easier to implement than splay trees, red black trees, avl trees, augmented trees, what-have-you.
I much prefer deques to linked lists when I need O(1) push/pop but still want to keep some semblance of cache locality. Although you can't splice two deques together as easily as two linked lists.
Sadly, you can't really control the block size in C++'s std::deque so you can't communicate useful information to the library about your use case. I think MSVC allocates such a small block that it is effectively a linked list.
Note that a deque in general is an ADT that isn’t necessarily implemented in the way std::deque commonly is. A deque can be implemented as a linked list.
Correct, deques are abstract data types and you can implement them multiple ways, including using linked lists or growable arrays/vectors. So they aren't directly comparable to linked lists since under the hood they can be implemented with linked lists (or doubly linked lists at least). You could compare the performance of different deque implementations, though.
It depends on the implementation. Python's deque, for example, is implemented as a two-level data structure; a circularly linked list of blocks. If you've got page-sized blocks, then your deque can be sufficiently local for most cases. It introduces a little extra logic, and you'll want to keep around a spare block to prevent thrashing memory if you repeatedly push/pop at the block boundary, but it's often worth the pain if the data structure is anywhere near a hot loop.
It is usually implemented as linked arrays. For example, arrays of 16 or 64 elements, if it grows over that size, new arrays or vectors are used underneath.
Two consecutive elements are probably in the same array, and that helps with cache locality.
Linked lists are cool, but he missed one reason why: how easily they become n-arry trees. A tree expressed like this is quite elegant and useful in a variety of domains, my favorite being a "trie" that is an n-arry tree that computes the label of each node as the string you get from the previous label plus the string in this node. It's a wild and hairy data-structure, but very cool once tamed.
In defense of what? A brigade of null pointer exceptions?
> Linked lists are conceptual. A node pointing to itself is the most self centered thing I can imagine in computing: an ideal representation of the more vulgar infinite loop. A node pointing to NULL is a metaphor of loneliness. A linked list with tail and head connected, a powerful symbol of a closed cycle.
Oh, this article is complete satire. Bravo, you had me
Anyone else getting a serious cert expired warning when visiting this site?
The one I'm getting is also for the wrong name - registered for redis.io and expired on `Fri, 07 Aug 2020 11:30:03 GMT`. Issued by Let's Encrypt. Fingerprint: `07:BF:EA:59:DB:83:33:77:50:27:A8:C5:2F:80:F6:CA:E6:EC:E2:7D:01:DB:72:7A:AF:EE:69:DD:EC:2D:DA:F3`
Linked lists are great to know and understand. Linked lists are not good for performance. In most cases, an array or other data structure is much more efficient than a linked list. But there are cases where linked lists' immutability/segmentation or simplicity to implement make them the better choice.
There are, interestingly, also times when fun facets of the linked lists mutability can lead to benefits. https://taeric.github.io/DancingLinks.html is my attempt at exploring Knuth's DLX algorithm. I can't recommend enough that you look into how he did it. Very fun and exciting use of mutable doubly linked lists.
Linked lists also work great in a sharded distributed data structure. You can make changes to the list structure by simply changing reference at nodes and don’t need to move anything.
Linked lists are great. But they have the problem that, almost always, whatever technical interview you have, someone asks you to whiteboard how to reverse one.
While there's a certain use for linked lists - in operating systems kernels for example and as a programmer you absolutely have to know how linked lists work, there are other linear data structures which are more suited for general programming.
To give a C# example, arrays, lists and dictionaries (hash tables) implement iterators so you can always know what the next element in collection is. Elements can be accessed by key or index in O(1),elements can be added in O(1). The case in which you absolutely have to insert an element in a particular position is rare so linked lists are seldom used.
The same case is for C++, there is a vector class in STL but no linked list class. Same for Java.
I think this article could be more intelligently titled: "Don't post, read, or respond to things on TWTR". The linked lists aspect could be replaced with anything.
It's odd that the author states they wrote the article in response to arguments on Twitter, yet they didn't show us the arguments this was a response to.
I'd note that the number of people creating content now is probably two or six orders orders of magnitude larger than before the internet, when it was just print/radio/TV/etc.
Most that Joe Random could expect to get in print would be a Letter to the Editor, or a random passerby interview. The in-crowd was even more "unusual".
This includes problems that they should be great for, like insert into the middle or the front.
The reason is that in practice the way computers actually work is that there is an enormous time penalty for jumping around randomly in memory, and it's large enough that it's often worth paying a O(lg n) cost to switch to something that will be contiguous in memory and allow the CPU to prefetch data.
There are exceptions when you really should use an actual linked list, but your default should be a vector and not a linked list.