Distilling the Real Cost of Production Garbage Collectors [pdf]

Sindisil · on April 28, 2022

Twitter post from Steve Blackburn (one of the authors) announcing the paper and providing a bit of context: https://twitter.com/stevemblackburn/status/15196411576216576...

Haven't quite finished reading the paper yet, but it's interesting so far.

While they present a method that could be applied more widely, this paper is only analyzing current tracing GCs in OpenJDK (plus Epsilon GC). Would love to see future papers taking other GCs into account (e.g., various flavors of refcount GC).

Even better would be a way to compare vs methods of "manual" memory management (manual alloc/free, RAII/scope based alloc/free, arena allocation, etc.). Unfortunately, non-GC memory management is very application specific, so I'm not sure if a generalized analysis would be super useful.

gwbas1c · on April 28, 2022

> Even better would be a way to compare vs methods of "manual" memory management (manual alloc/free, RAII/scope based alloc/free, arena allocation, etc.). Unfortunately, non-GC memory management is very application specific, so I'm not sure if a generalized analysis would be super useful.

I've personally found Rust to be a very difficult language to learn and become productive in.

In contrast, I was able to jump into Swift and become productive very quickly.

Swift isn't GC, but is reference counted. Reference cycles will lead to leaked memory.

So what's interesting is to know GC's overhead relative to what: Reference counted languages like Swift? Or languages like Rust?

More importantly: Is the overhead worth it? Developer time is valuable; thus leading to the critical question: What's cheaper, extra developer time or a more powerful computer?

hedora · on April 28, 2022

GC shifts human work from development time to deployment / ops time.

So the question is whether one-off developer time is cheaper than SRE time for the useful lifetime of the service.

However, it gets worse for GC: depending on the application, developing for a GC can be much more difficult / expensive than manual memory management (or reference counting). The canonical examples are low latency (with SLAs on the order of a disk or network I/O, so 10-100 microseconds) and applications that need to use bounded memory.

On top of that, there are confounding factors that hurt the case for GC. I don't know of an imperarive GCed language with a reasonably expressive type system (all the JVM languages I know of use type erasure).

For experienced developers, rich type systems with actual runtime guarantees are a huge productivity boost.

LAC-Tech · on April 29, 2022

I don't know of an imperarive GCed language with a reasonably expressive type system

Well, we have no idea what you consider to be imperative and what you consider a reasonably expressive type system to be.

titzer · on April 28, 2022

C#? F#? Go (now with generics)? Virgil...

native_samples · on April 29, 2022

GC shifts human work from development time to deployment / ops time.

Not unless you're load testing in production. Source: have been both a developer and an SRE in the past (an actual SRE, at Google, running services that used both manual memory management and GC).

Now, often people do in fact load test in production but this isn't some intrinsic property of GC. Also as an ops guy at Google I had to track down a number of memory management errors/segfaults that only occurred in prod, which wouldn't have occurred in a GCd language. Often double-frees or use-after-frees that only occurred after very specific combinations of RPC responses or timeouts that corrupted the state machine trying to handle them. Not much fun.

Also, you should be aware that "SRE" as a job role is deeply weird and most companies don't do what Google and fast-follower firms do. In most companies, ops people are cheaper than developers. Google was historically pretty unusual in demanding ops people who also happened to have histories hacking UNIX kernels but that was:

a. Not consistent, even in the early days most SRE people had sysadmin backgrounds.

b. To some extent because you would sometimes have to do things like ... debug segfaults in production.

Frankly they struggled to keep devs in SRE. They were only able to do it for a long time because they were making job offers to SWEs conditional on them working in SRE for a few years. Don't want to do ops work? Tough, you don't get to work at Google at all then. The desirability factor of working at Google was higher than the dislike of doing ops, so, people would accept this offer (that's how I ended up there). But obviously this is only a viable strategy in rare cases. Most companies can't do that.

Final point - your ideas about JVM GC are quite out of date. Modern JVMs are much better at auto-tuning. You can still squeeze extra performance out by tweaking obscure knobs and you should definitely tweak the one master knob that tells it whether your job is latency or throughput sensitive (i.e. batch or interactive). But that takes a few seconds and then you're done.

I don't know of an imperarive GCed language with a reasonably expressive type system (all the JVM languages I know of use type erasure).

Type erasure and GC are orthogonal. C# has GC without type erasure.

spullara · on April 28, 2022

You don't consider Scala to have a reasonably expressive type system?

alpaca128 · on April 28, 2022

They probably don't consider Scala an imperative language.

spullara · on April 29, 2022

It is definitely both functional and imperative depending on author :)

kqr · on April 28, 2022

> I don't know of an imperarive GCed language with a reasonably expressive type system

Haskell with all code in the IO monad?

im3w1l · on April 28, 2022

Now I'm mildly curious how viable this would be.

kqr · on April 29, 2022

Extremely, if my last experience trying it (back in 2016 or so) counts.

Like all good jokes, "Haskell is the world's best imperative language" contains a kernel of truth. The standard library contains rather interesting control structures that can exist only because IO actions are lazily evaluated values -- but you can also choose to ignore those for a more Java-like experience.

gwbas1c · on April 28, 2022

> The canonical examples are low latency (with SLAs on the order of a disk or network I/O, so 10-100 microseconds) and applications that need to use bounded memory.

Well, in those cases extra developer time is cheaper than a more powerful computer!

Will you get ROI building a stereotypical web application with manual memory management? Doubtful.

Edit:

> GC shifts human work from development time to deployment / ops time.

I've never had to tune a garbage collector in deployment. I've heard of other people having to do it; but I've never heard of anyone saying, "yes, I'm so happy I built my website with C++."

Sindisil · on April 28, 2022

All the world is not a web application.

Spivak · on April 29, 2022

True but it’s not really about web and more about “problem domains where GC pauses are significant, matter, and you can’t throw more compute and memory at the problem” which is not nothing but on the dart board of all SWE jobs good money is on hitting one where it isn’t even if you’re aiming for one.

gwbas1c · on April 29, 2022

However, web development is by far the most common domain that people develop for; and most web sites do not run at the scale where they would benefit from the kind of performance tuning that involves manual memory management.

(Good luck calling malloc from Javascript, or debugging use-after-free issues when you're iterating quickly.)

chrisseaton · on April 28, 2022

> Swift isn't GC, but is reference counted

GC is often an umbrella term that includes reference counting.

Almost all serious tracing GCs include some aspect of reference counting, and almost all serious reference counting GCs include some aspect of tracing.

dreamcompiler · on April 29, 2022

> Almost all serious tracing GCs include some aspect of reference counting

I don't know about the Java world but in Common Lisp GC there's no reference counting whatsoever.

chrisseaton · on April 29, 2022

Common Lisp is a specification, not an implementation. It doesn't make sense to say that the GC is implemented in any particular way, because that's not part of the specification - it's up to the implementation.

I would bet my house that some implementation of Common Lisp somewhere has a GC that uses card marking, which is a form of reference counting.

dreamcompiler · on April 30, 2022

I should have been more precise: The four CL implementations I'm familiar with (ACL, LW, CCL, SBCL) don't do reference counting and they probably comprise the vast majority of the market.

_19qg · on April 30, 2022

Interlisp used reference counting. Don't know if that is still the case -> interlisp.org . It includes a Common Lisp implementation.

gwbas1c · on April 28, 2022

No, it's not. Garbage collection and reference counting are "automatic memory management."

With reference counting there are no pauses, and memory is reclaimed immediately. It allows for strongly coupling deallocation with resource cleanup. (The consequence is higher overhead maintaining counts, and leaked reference cycles.)

In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

munificent · on April 28, 2022

These days, "garbage collection" is the umbrella term that includes ref counting and other forms. "Tracing garbage collection" the term for GC that doesn't include ref counting.

> With reference counting there are no pauses

You can still have pauses of arbitrary duration with ref counting. When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

> and memory is reclaimed immediately.

Most production-quality ref counting implementations defer tracking references on the stack to minimize ref count churn. That significantly improves performance but it means you can no longer claim that the system always reclaims memory "immediately".

> In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

Managing non-memory resources is orthogonal to tracing versus ref counting. If you choose to tie a resource's lifetime to a piece of memory's lifetime then, yes, the memory manager's deallocation policy becomes user-visible.

I think you would get a lot out of reading "A Unified Theory of Garbage Collection".

SomeCallMeTim · on April 28, 2022

> When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

Two things:

1. Unbounded may be technically true, but it's really not a hard problem to ensure that time-critical runtime deallocations either rarely happen or are simply never huge chains of objects that release each other. Source: Used C++ & boost::shared_ptr/boost::weak_ptr extensively in game development and never had unexpected jank due to deallocations.

2. There's no actual reason that the full deallocation needs to happen immediately in all reference counting implementations. If you have hard time or latency constraints you could queue up the actual dereferencing and ensure that you only dereference a certain number of objects per time period.

Yes, #2 changes some assumptions about when objects will be dereferenced; in C++ one might rely on guarantees that objects are immediately released along with associated resources. But if you go into a project knowing that it needed to support deferred reference counting, manually releasing critical resources is hardly the worst thing you'd need to do.

munificent · on April 28, 2022

> it's really not a hard problem to ensure that time-critical runtime deallocations either rarely happen or are simply never huge chains of objects that release each other.

Sure, but now you're essentially in the same business as someone using a GC finds themselves where they are tuning the GC, using object pools, etc.

> There's no actual reason that the full deallocation needs to happen immediately in all reference counting implementations. If you have hard time or latency constraints you could queue up the actual dereferencing and ensure that you only dereference a certain number of objects per time period.

Yup, that's a common optimization. But now you have given up the property that destruction happens immediately and deterministically.

My overall point is that the benefits of ref counting are not without cost. There are trade-offs and as you make different ones to improve performance or latency, you quickly discover that your ref counting system starts to take on shades of tracing GC with both the benefits and downsides that that entails.

SomeCallMeTim · on April 29, 2022

Object pools are not hard, and I certainly used them a lot in game development. That's just a cost of doing business in low-latency programming.

There's no "tuning of the GC" beyond that, though, unless you count linking in a faster multithreaded malloc library, which is another thing that I've tried.

What's problematic about Java in particular in relation to latency is that Java's libraries can be pointlessly wasteful with regard to allocating garbage. I found that a function that was supposed to call through to a system function that got the system time in milliseconds was somehow allocating an object and discarding it as part of the call. The function returned an integer.

I found this when my Android JNI (C++) app was getting janky because of GC pauses, and I traced the allocation to that function. I wrote a quick JNI function in C++ to call the system and return the integer directly, and the jank went away.

The best GC in the world won't make up for wasteful allocations, and some languages (or their ecosystems?) seem to encourage rampant allocations.

astrange · on April 28, 2022

> Most production-quality ref counting implementations defer tracking references on the stack to minimize ref count churn. That significantly improves performance but it means you can no longer claim that the system always reclaims memory "immediately".

Swift doesn't do this, if I understood what you meant properly. I suppose you might want to defer freeing some things to idle time but it's usually not a problem in my experience. I'm honestly surprised it always comes up.

There's "autorelease pools" in ObjC but they aren't used to move releasing to an idle time, rather it's to handle cases where nobody knows what the precise lifetime of the allocation is. Swift has a better ABI that doesn't have that issue.

munificent · on April 28, 2022

> Swift doesn't do this, if I understood what you meant properly.

I think Swift can take advantage of the fact that many values are purely stack allocated and don't touch the heap or ref counting system at all. It leans pretty heavily on structs and value types.

Other managed languages are more Java-esque in that basically every value is in principle heap allocated. In those, I think it's more valuable to defer ref counting stack references since there are many more of them and they tend to be frequently mutated.

pjmlp · on April 29, 2022

Swift surely does it, I advise reading a bit of what the compiler actually generates.

That is why, as Apple decided to fix of their ARC optmizations, they had a talk at WWDC 2021 about ARC gotchas for code that will behave in a different way due to the new optimizations.

"ARC in Swift: Basics and beyond"

https://developer.apple.com/videos/play/wwdc2021/10216/

astrange · on April 29, 2022

I'm familiar with it, I wouldn't say Swift previously intentionally lengthened lifetimes for the purposes of having fewer refcount operations.

It's more like the old version of the optimizations were just heuristics (so they didn't work that well) and the new versions are formally proven, meaning they're more aggressive and found some memory safety holes in C/ObjC APIs.

gwbas1c · on April 29, 2022

> There's "autorelease pools" in ObjC but they aren't used to move releasing to an idle time, rather it's to handle cases where nobody knows what the precise lifetime of the allocation is. Swift has a better ABI that doesn't have that issue.

Autorelease is a no-op when using automatic reference counting. (But it was a pretty cool concept back then.)

astrange · on April 29, 2022

It isn't. There's an optimization that removes things from the pool sometimes, but it doesn't always work, and it's used to pass back out-params like NSError**.

What you can't do is message autorelease yourself - although of course you can escape that pretty easily.

cesarb · on April 28, 2022

> You can still have pauses of arbitrary duration with ref counting. When the last reference to an object that is the sole owner of a large graph of other objects gets decremented, the entire object graph must be immediately deallocated. It's possible for a single dereference to trigger an unbounded amount of deallocation.

When people say "with reference counting there are no pauses", they are not talking about the bookkeeping cost of releasing the allocations when their last reference is dropped; they are talking about unrelated code (which might for instance be running a tight numerical loop with no allocation or deallocation at all) being arbitrarily stopped.

Also, it's misleading to call the amount of deallocation "unbounded". The amount of deallocation is precisely the size of the graph of objects being released at that point; no more, no less.

munificent · on April 28, 2022

> When people say "with reference counting there are no pauses", they are not talking about the bookkeeping cost of releasing the allocations when their last reference is dropped;

I'm not sure that you can claim that confidently. I have certainly seen many people laboring under the false assumption that ref counting systems always have bounded, small latency when that isn't the case. The latency may be much lower than various other tracing GC alternatives, but I think many people have a simplistic mental model that a ref count decrement only ever leads to a single object being freed.

There is an ah-ha moment when you point out that freeing an object decrements its outgoing references, which can in turn cause a cascade of frees.

> they are talking about unrelated code (which might for instance be running a tight numerical loop with no allocation or deallocation at all) being arbitrarily stopped.

Many GCs will never arbitrarily stop the program in a tight loop that isn't doing allocations, so there's nothing particularly unique about ref counting here.

> Also, it's misleading to call the amount of deallocation "unbounded". The amount of deallocation is precisely the size of the graph of objects being released at that point; no more, no less.

Sure, and a program can build object graphs of arbitrary size. That's what "unbounded" means: there is no pre-defined bound on it.

native_samples · on April 29, 2022

Yes people have incomplete mental models.

Something else people often don't realize is that malloc/free aren't instant. Why is de-referring a large object graph slow, well, mostly because free is slow. Why is free slow, because the internals of a malloc can be rather sophisticated. There is probably locking, there may be internal book-keeping like coalescing free regions inside the heap and so on.

If you look at modern GC benchmark workloads, they're often generating gigabytes of allocations per second. Good luck shoving that much work through a typical malloc! Big companies like Google have had to write their own mallocs for a good reason.

munificent · on April 29, 2022

> Something else people often don't realize is that malloc/free aren't instant.

True, and the costs vary wildly and are deeply tied into the rest of the memory management strategy.

In a copying collector, free() is essentially free: It takes the same amount of time to free 1 object in the semispace as it does 1,000 objects in the same semispace.

astrange · on April 28, 2022

It's not unbounded, but it can be surprisingly high and variable, especially if some of those objects have things like handles to kernel objects and freeing them requires IPC etc.

Still, it usually works out, especially when lowering peak memory increases performance on its own.

wgd · on April 28, 2022

> With reference counting there are no pauses, and memory is reclaimed immediately

You can't guarantee both of these properties at once, in the general case. Consider a very large tree of objects which are referenced through a single root node, and then that root node becomes eligible for reclamation. For a sufficiently large object graph, either reclamation will require a noticeable pause or some portion of the work will have to be deferred until later.

kragen · on April 29, 2022

You may be able to guarantee it in this particular case with region-based allocation (also called arena allocation); discarding a region, or resetting its allocation pointer back to the beginning, is a constant-time operation. If your very large tree is in a single region, and it's the only thing in that region, then you can reclaim the entire thing in constant time as soon as you no longer need it.

Region-based memory management is commonly used, for example, with per-frame heaps in games, or per-request heaps in servers; but in those cases memory is not reclaimed immediately. The MLKit implements Standard ML with regions instead of garbage collection, though recent versions also have GC.

native_samples · on April 29, 2022

That only works if you are able to prove none of your objects do anything in their destructors.

In practice, manually memory managed languages always conflate memory and other resources. A File object is not just a block of memory, it's also a file descriptor, etc. So you have to run the destructors at the 'right' times and can't just toss the memory. In special cases where you control every allocation, sure, but often that isn't practical.

kragen · on April 29, 2022

It's true that this solution doesn't apply when you have finalizers, but I think of that as the unusual case, not the usual one. I don't understand what you mean by "special cases where you control every allocation"; how are things getting allocated in your arena without you controlling them? Who is "you" if not the author of the program, who controls everything in it?

native_samples · on April 30, 2022

Libraries. If your libraries can't do that because you wrote your own malloc to get this arena allocation, it's because your strategy isn't general and will break the moment you want to use libraries that allocate.

kragen · on April 30, 2022

The malloc interface doesn't support finalizers anyway! If a library is using malloc, but needs finalizers, it's going to have to invoke its own finalizers manually, usually just before it manually invokes free().

Also, though, it's very unusual to install an arena allocator as a replacement for malloc and (as a no-op) free — malloc has no way to specify which of possibly several arenas you want to allocate in, arena allocators suck at realloc, and writing an arena allocator that invokes malloc to get its arena is significantly easier than writing one that uses a variety of platform-specific calls depending on compilation options. Instead, you write your own arena allocation functions (h_alloc or apr_palloc or whatever), and then you invoke them explicitly, in your own code. Non-arena-aware library code will just keep invoking malloc as usual.

The case where you might end up with library code invoking your arena allocator is where you supply your allocator as an argument to something — like C++ STL containers or, again, APR pools, for example with apr_pool_create_ex. But that still doesn't mean you have to depend on the allocator to invoke finalizers — after all, the standard library operator new and operator delete don't invoke finalizers, instead relying on other finalizers (destructors) to invoke finalizers and then invoke operator free. That still works — your files still get closed — even if the "operator free" is a nop because you're allocating out of an arena. In that case you don't get the major benefit of arena allocation (fast constant-time reclamation of the whole arena), just the minor ones, no fragmentation and fast constant-time allocation.

A more useful strategy is often to attach your finalizers to the arena instead of allocations within it; in the case of apr_palloc you can do this by creating subpools with apr_pool_create and attaching cleanup functions to them with apr_pool_userdata_set.

Regardless, my claim was more restricted than the one you're arguing against. I claimed that arena allocation can guarantee both that there are no pauses and that memory is reclaimed immediately in this case, where you have an arbirarily large tree of objects that dies at a single point in time, and in many similar cases. This is what wgd was claiming was impossible in their second claim: "For a sufficiently large object graph, either reclamation will require a noticeable pause or some portion of the work will have to be deferred until later." That claim is not true.

As far as I know, however, wgd's first and less general claim is true: there's no way to extend this region-based strategy even to the general case of immediate memory reclamation without losing the constant-time guarantee, much less the even more general case of immediate execution of arbitrary finalizers. (By contrast, reference counting can guarantee immediate memory reclamation and immediate execution of arbitrary finalizers, although at the cost of unbounded pauses, and not in the general case because it can't handle cycles.) Obviously you can't execute an unbounded number of finalizers in constant or even bounded time, or call close() on an unbounded number of file descriptors in constant or even bounded time.

But that doesn't mean you can't reclaim an unbounded number of memory allocations in constant time. You can, it's a useful thing to do, and people do it all the time. If you've fetched an URL from Apache today you've done it yourself, even if you didn't know it.

pkolaczk · on April 28, 2022

There is a difference between pausing a single thread in a known, predictable point and pausing all threads at random point in time. Tracing GCs do the latter. It is way easier to avoid long pauses with recounting than with tracing GC.

Skinney · on April 28, 2022

Java’s ZGC has a constant (O(1)) sub-milliaecond pause-time. GC runs concurrently with the app

pkolaczk · on April 28, 2022

Yeah, according to the paper you pay for it by a very high overhead, up to 2x. So generally, pick your poison ;)

Skinney · on April 29, 2022

Exactly.

But the thing is, I can switch GC implementation according to my needs.

If throughput is my number 1 concern I can use ParalellGC, if not I can use ZGC. Or G1 if I want something in between (not sure if the paper picks up the huge drop in overhead of G1 in Java 18).

pkolaczk · on May 1, 2022

You can switch the GC implementation only globally. In languages like C++ or Rust you can have different parts of program using different memory management strategy, from carefully hand-crafted region allocation through refcounting to even tracing GC (rarely needed though).

ncmncm · on April 28, 2022

In such uses, if it matters you don't rely on RC to reclaim the storage. You allocate from an arena, and drop the whole graph as a unit.

Using pointers to make up a graph is a choice. It is a thing taught in CS classes, so it may feel comfortably familiar; that does not make it good. I never do.

ahtihn · on April 29, 2022

How is this different than working around a GC?

There are lots of things you can do in GCed language to avoid allocations and avoid running the GC when it matters.

ncmncm · on April 29, 2022

Often it is your SRE who has to do that work.

gwbas1c · on April 29, 2022

> Consider a very large tree of objects which are referenced through a single root node

That's predictable and can be planned for: You can hold onto the reference until you get to a point where the pause is acceptable.

You can't do that in a garbage collected language, because you have no control over when the garbage collector will pause. (In theory your application can explicitly trigger a garbage collection, but in practice such behavior is discouraged.)

Sindisil · on April 28, 2022

[citation needed]

Reference counting is widely considered one strategy of garbage collection (https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...). For that matter, some GCs that are primarily tracing also use ref counts for some portion of the heap, as an optimization (just as some primarily refcounting GCs also run a trace to clean up cycles).

Also, reference counting can certainly result in pauses, when large graphs of interconnected objects are reclaimed. On the upside, you know when a pause might happen. You still don't know for certain without explicitly keeping track of what is about to be freed, which would add additional overhead, and indeed may not be practical or even feasible.

spullara · on April 28, 2022

Reference counting absolutely has pauses. Try allocating millions of objects in a block and then exiting that block. Hell, even malloc pauses when fragmentation gets too high.

mirekrusin · on April 28, 2022

I'll show in profiler straight away. With tracing gc it won't.

spullara · on April 28, 2022

What do you think won't show up in a profiler with GC? With the right profiler I can see the entire heap as well as where every object was allocated and all those deallocations are basically free. With a modern collector like ZGC, max pause times are now sub-millisecond for collections. Also easy to measure the total GC cost in the profiler as well. The modern GC story is extremely good. The biggest issue with GC in my opinion is that it uses more memory than you strictly need.

mirekrusin · on April 29, 2022

It won't show what we're talking about in this thread - deallocations of large number of allocations when they loose last root reference at once.

spullara · on April 30, 2022

It absolutely will show them because they cost nothing. 0 will show up in the profiler which is how much deallocations cost with GC.

mirekrusin · on April 30, 2022

Magic.

In rc/arc you can achieve the same by making release a no-op.

kragen · on April 30, 2022

I don't know what you mean by "making release a no-op", but with reference counting it inherently takes O(N) time to deallocate an object containing N references, and O(M) time to deallocate M objects containing no references. With a generational copying collector, it takes O(1) time to deallocate M unreferenced objects each containing N references in the nursery — or rather it takes O(P+Q) time, where P is the number of objects that don't get deallocated and Q is the number of root references that need to be scanned, the ones in the stack plus the cards marked by your write barrier or whatever.

You can't achieve that with reference counting, with malloc/free, or with a non-copying tracing collector. You can achieve it with regions.

On the other hand, you may not need to. Allocating and initializing those objects took O(MN) time, so your program can't get an asymptotic speedup by deallocating them in O(1) time instead of O(MN) time. So it really depends on the constant factors, which nowadays depend strongly on things like cache miss rates and cache line sharing between cores.

mirekrusin · on May 1, 2022

Allocation cost in region is just pointer update, quite cheap. Deallocation is constant. You will get speedup if you can replace large number of allocations with arena. In gc languages you don't have luxury of specifying custom allocators for selected parts of the program. But the main argument was that tracing gc hides all of this from you and you can't profile your program explicitly see which callstacks contribute to large deallocations. Just because gc hides it, defers and spreads in time, doesn't mean it's zero cost – on contrary tracing will have more overhead. In vast majority of programs power consumption overhead is not relevant though, if it's traded for programming ergonomics, it's a win.

chrisseaton · on April 28, 2022

> With reference counting there are no pauses

How are you reclaiming cycles with no pauses?

ncmncm · on April 28, 2022

I have had no occasion in the past three decades to make cycles.

It smacks of poor discipline, or poor design.

chrisseaton · on April 28, 2022

Well good for you I guess - but that’s not a general solution for language implementation as it’s not sound for a language like Python. Which is why in practice most RCs include tracing. And the reverse as well - most tracers include a degree of RC. That’s the pragmatic solution in practice.

kragen · on April 29, 2022

I think tracing GC was added to CPython after about 15 years, so I guess RC was a general enough solution for it for quite a while. Even today CPython refuses to collect cycles with finalizers. Smalltalk also only used RC from 01971 to the early 01980s. I think Perl still only does RC. Swift apparently also only does RC. People programming in those languages just design their programs to not create cycles, or include a weak reference in the cycle somewhere, like for Observer kinds of things.

I'm not a fan of RC, myself. It's space-hungry, it creates unnecessary write traffic, it's unreliable (when you can have cycles), and it's slow. But there are lots of "not sound" and "not general" but "pragmatic" language implementations out there using it, without tracing. In practice, even.

ncmncm · on April 29, 2022

Pervasive RC is the most costly GC. But it is fine where not pervasive, and even some places where it is. RC is about the last thing you would worry about slowing down your Python program.

I RC'd a thing in a program, five years back. The program had a linked list in it. Increments and decrements happened on a scale of minutes. I did not agonize over it.

kragen · on April 29, 2022

I mostly agree, except that I do think reference counting is a significant part of CPython's runtime slowness. Maybe 10%.

kragen · on April 30, 2022

Confirmed, Perl still only does reference counting. No cycle collector.

titzer · on April 29, 2022

In my work, I regularly end up with cyclic data structures, e.g. the SSA IRs in several compilers I have worked on have oodles of cycles. And interpreters run user programs, which create cycles.

I don't think this is a particularly fruitful line of discussion to strongly claim that cycles can just be avoided, or worse, that you're an idiot for having them. It's needlessly inflammatory.

ncmncm · on April 29, 2022

I think your SSA cycles can be trivially reclaimed by arena means, no?

Point is just that people get used to whatever they were directed to do for their CS class assignments, and get to thinking it is natural, and even unavoidable.

That something somebody is used to not thinking about is not actually necessary can be a novel and fruitful idea. The cheapest GC is always no GC at all.

titzer · on April 29, 2022

I've worked on 5 different SSA compilers. C1, C1X, Crankshaft, TurboFan, and Virgil. C1X and Virgil are written in safe, garbage-collected languages, while C1, Crankshaft, and TurboFan are all C++ and use arena allocators. Of these, in my totally subjective opinion, the two written in GC'd languages are a heck of a lot more productive to write in--like 10x as productive. They still run really fast, too. In fact, C1X was faster than C1, and Virgil is in the same ballpark or even better, though they do very significantly different analyses and optimizations in their pipeline. Crankshaft might have the edge on C1X in some respects, but I never did a direct comparison. TurboFan is by far the most incredibly intricate piece of software of my life and I'm convinced could not be written with ownership, acyclicity, or manual memory management without one or two suicides and a bugtail of security vulnerabilities decades long. I would have either walked out or have been forced to resign in disgrace for failing to deliver a workable product by making the call to prioritize memory management efficiency (by presumably using optimal malloc/free everywhere) over literally every other priority when building a piece of software--correctness, maintainability, team retention, extensibility, security, etc. It wouldn't have been the V8 way after all, because it was designed with regions quite extensively because even back then, because most VMs have long realized it is a waste of time to try to do anything more complicated and regions are basically el-cheapo GC for C++.

> The cheapest GC is always no GC at all.

I kind of chuckle here because the Virgil compiler proves you manifestly right and wrong simultaneously; it's a fully GC'd, semi-functional language with a self-hosted compiler written entirely in the language. Incidentally, in a careful style that doesn't allocate a whole lot or create a lot of garbage. It chooses efficient, C-like data structures for IRs and doesn't waste a lot of time on Java-like armored classes because I am sick of wasting my life on that. It bootstraps in less than 200MB of memory, compiling the entire compiler at full optimization in a single whole program compilation. Because I made the dumbest, simplest Cheney moving GC, without even generations, it doesn't even have a write barrier. But because it only allocates 200MB, when bootstrapping it doesn't even GC once. It's basically a giant, single region that is thrown away at process exit, and the mutator doesn't pay the cost of a single write barrier. (And the funny thing is, GC wouldn't even find that much garbage, because only about half of the heap does actually die, primarily because it's mostly IR that is always reachable--it's a whole program compiler, after all). It's basically your mythical perfect GC just because it has enough memory and I am damn well in my rights to spend it as I have.

ncmncm · on April 29, 2022

You seem to be agreeing that not doing even so much as RC, just leaking, is better in this application than any sort of GC. I think that is common, for compilers. In that case, what does your other language do for you that C++ does not?

titzer · on April 29, 2022

It doesn't leak. While I could give you a long rundown of why Virgil is better than C++, in all reality I designed it as a better alternative to Java--statically compiled, polymorphic specialization, delegates and first class functions, tuples, variance on overriding, simpler syntax, properly immutable fields and constants, and algebraic types--its advantages over C++ is that I literally never, ever, have to think about memory management. I don't have to decide regions, reason about errant pointers, mess with custom allocators. I don't have leaks, the object representation is efficient and straightforward, allocation is fast, and if I am a little careful, then I just don't blow a lot of memory on garbage. If I'm not careful then the program just uses more memory and GC's some.

I'm going to time this thread out now because I'd rather continue not thinking about memory management until I cycle back on GC work, and I don't really see any evidence that you're listening. Besides, I've said it better with code and the machine seems to happily hum along to my music.

kragen · on April 30, 2022

I for one have found it very educational; thank you!

ncmncm · on April 29, 2022

This was all enlightening, thank you.

deathanatos · on April 28, 2022

While I also haven't ever had a cycle in my decades as an engineer, it doesn't really answer the parent's question about pauses w.r.t. ref counting. Ref counting, in the general, isn't free.

(Also, in the sense of releasing a cycle, that would be pretty quick, since a typical RC implementation is just going to leak it. Its where a large graph resource graph is successfully ref-counted to 0 that can cause a pause, theoretically, although that's another case I've not seen.)

ncmncm · on April 29, 2022

The usual method is to have strategically placed "weak" references.

deathanatos · on April 29, 2022

… yes (and I know), but that has nothing to do with the point being made here, which is about the compute time required to manage the memory.

anamax · on April 28, 2022

Cycles can happen in lots of ways.

For example, heap allocated closures for functions with non-local references create a cycle when a pointer to the closure is stored in a non-local environment.

ncmncm · on April 29, 2022

Yes, it is and has always been easy to make cycles. My point is that it is almost always about equally easy not to make cycles, if you are paying any attention.

ahtihn · on April 29, 2022

The fact that you have to pay attention makes it significantly harder.

You have to be aware that cycles can cause issues and you have to think about the impact of cycles on whatever you're doing.

ncmncm · on April 29, 2022

So you say. But that does not match my experience. It has always been wholly trivial to know where the "weak references" go.

Ygg2 · on April 29, 2022

Using enough transition proves that difficulty of not making cycle is going to increase very, very fast.

pron · on April 28, 2022

Reference counting and tracing are the two main approaches for garbage collection. See, for example, https://courses.cs.washington.edu/courses/cse590p/05au/p50-b...

While tracing is considered to have better throughput and reference counting to have a lower footprint, as the liked paper shows, they are really on a spectrum of techniques with different tradeoffs.

throwaway894345 · on April 28, 2022

> No, it's not. Garbage collection and reference counting are "automatic memory management.

In strict computer science parlance, refcounting is a kind of GC, but to most people GC refers to "tracing GC" specifically. It's fairly pedantic.

> In a garbage collected language you need to explicitly release your resources.

I assume "resources" here must mean "non-memory resources"? Even still, GC can apply to any kind of resource, it's just most popularly applied to memory (you could have a GC for file handles, for example).

Also, so you know, the user you're replying to knows a thing or two about programming languages.

pjmlp · on April 28, 2022

I advise reading a couple of CS books on the subject.

colejohnson66 · on April 28, 2022

> With reference counting there are no pauses...

Same with GC. "Stop the world" may be a characteristic of a simple GC, but it's entirely possible to have GC that works by "coloring" objects.

> In a garbage collected language you need to explicitly release your resources. (IE, the Dispose pattern that's common in C#)

You only need to implement IDisposable if you are dealing with unmanaged resources (ie. non-GC objects). But it's the same with C/C++: you have to call `free()` or `delete`; The difference is that managed resources are auto-released for you.

zozbot234 · on April 28, 2022

Rust can also use reference counting when necessary. Many Rust novices don't realize this, and run into needless trouble when trying to learn Rust. Rc<RefCell<T>> and Arc<Mutex<T>> are not "unidiomatic" other than in a very weak sense. Sometimes they are even unavoidable.

gwbas1c · on April 29, 2022

They're significantly more time-consuming, for the developer, to work with.

Think of using something like a lambda to start a thread: You have to explicitly clone your Rc<Whatever> into another variable, and then use the second variable in your lambda. The compiler won't implicitly clone your RC<Whatever> for you.

IMO, it's one of the biggest drawbacks in Rust.

throwaway894345 · on April 28, 2022

It's still pretty painful to deal in Rc<...> and friends in Rust. I was expecting it to be a few extra keystrokes around my types, but I still ended up having to deal with the borrow checker when I would put something into an Rc or take it out to pass into some other function. It was nowhere near as easy as a GC or refcounted language.

woodruffw · on April 28, 2022

> I was expecting it to be a few extra keystrokes around my types, but I still ended up having to deal with the borrow checker when I would put something into an Rc or take it out to pass into some other function.

I'd be interested to hear where you experienced difficulty, specifically. `Rc<T>` is `Deref<T>`, so you should never have any problems calling borrowing APIs. Owning APIs require an explicit borrow-and-clone, but that's just two calls (`as_ref().clone()`).

sharpneli · on April 29, 2022

One must also remember to differentiate the deployment targets. Not every piece of software is deployed in your own machine.

Let's take games as an example. You cannot say to users "Just buy more powerful compute/phone". And even if they would you may have literally millions of machines, due to having millions of users. In which case saving 1 dollar worth of machine per user is savings of millions which is worth a lot of developer time. Naturally you as a dev won't really be paying this, so the incentive is not directly there. But the incentive comes from people just not buying a game that runs like crap.

In these cases GC is nice and dandy at start but then you run into the inevitable "Why does my game stutter?", after which object pools are introduced and even then some annoying "leaks" may remain. If the amount of time had been used from the get go to actually think about memory management the whole issue could have been avoided.

native_samples · on April 29, 2022

Yes, but GC and object re-use aren't mutually exclusive.

Sometimes people argue as if they are, which is understandable because the overlap between semi-FP programming styles and GCd languages is fairly strong, and the FP world pushes immutable state as almost a matter of moral principle. If you tried to do functional programming with purely manual memory management you'd see high overheads too simply because there'd be so many small allocations and copies. In practice though, people tend to think harder about in-place mutation when they don't have a GC to clean up after them.

kragen · on April 30, 2022

They aren't mutually exclusive, but if you're using a tracing garbage collector this millennium you're probably using a generational copying collector, and if you're using a generational copying collector you have a write barrier, and the write barrier means that allocating and initializing a new object is less expensive than overwriting all the fields of an existing object, probably by a factor of several. Also, if your reused object is only referenced from the nursery, and it references other objects in the nursery, those objects ought to be collected on a minor collection, but they won't be; instead they'll be promoted, possibly through several generations, until they reach the generation where your reused object is, which will finally allow them to die. Balanced against this you have less frequent minor collections — none at all during times when you're allocating no new objects.

So, with a tracing GC, introducing an object pool will usually make your program slower, not faster. (This is not true with reference counting.)

Sindisil · on April 28, 2022

Yes, both.

Indeed, the fundamental question is whether the costs of automatic memory management (be it tracing GC or refcount GC) are "worth it".

But it's not always as simple as a monetary consideration, and even when it is, the monetary cost often involves way more than just dev time vs cost of computer (e.g., power, cooling, distributed cost over many end users, market applicability because of system requirements).

pkulak · on April 29, 2022

Reference counting is GC. There are no pauses, but incrementing and decrementing a counter on every touch isn’t free either.

ygra · on April 29, 2022

Well, the pauses are somewhat predictable in that they may appear when the count of an object reaches zero. I had a very expensive closing brace in (admittedly not well-written) C++ code once.

pjmlp · on April 28, 2022

Reference counting is a GC algorithm as per any worthwhile CS book on GC algorithms.

xwolfi · on April 28, 2022

Well it depends a bit, in my low latency application where we cant afford GC pauses, we tolerate them from 5am to 5:30 am. We do everything preallocated and if we must tolerate new heap allocation we simply never release (so we have a lot of object pools).

I felt it was hard and exotic at first but now I feel it's quite logical: never free, and have a proper monitoring of your size per unit of information to size the heap right and you're done.

But then it also depends why you dont want GC, us it's because clients cannot tolerate a pause (we had a socket manager creating and releasing garbage that triggered one 50ms GC pause during the last 3 seconds of trading of a particularly crazy trading day and the client threatened to leave), but often they re tolerable if you just stay reasonable with your garbage.

The problem is not the collection in non critical applications, it's the garbage generation getting completely out of hand. In a C# application my team had to micro optimize because of covid-related explosive trading volume, we simply bought gigantic amount of RAM for our 200 users, only allowed GC if CPU was in low usage and run with 80GB memory usage at the end of the day for an order grid that could run with 5GB collected. That s another way: I wish Java 8 had an easier way to simply deactivate GC for us to run with hardware money.

cmrdporcupine · on April 28, 2022

One problem is that GC'd languages also tend to have programming practices that encourage the generation of garbage, while languages with manual stack/heap allocation make you think about it a bit more. The sheer number of String and other lightweight (but existent) instances in a running Java program is astounding, because the language simply doesn't encourage you to be parsimonious with creating them.

For most people's applications this is totally fine. But, yeah, you can end up with a scenario where GC pauses are just totally unacceptable, or where you need absolutely predictable behaviour and then you might resort to radical things like what you're describing here and it starts to feel like you've just chosen the wrong tool for the job.

I say this as a person who wrote an RTB ad server in Java, many moons ago... and never would again.

titzer · on April 28, 2022

> The sheer number of String and other lightweight (but existent) instances in a running Java program is astounding, because the language simply doesn't encourage you to be parsimonious with creating them.

I agree with this. Java promotes a profligate programming style where allocation (and reclamation) are considered free. (They aren't). This is baked into the JDK deeply. E.g. with standard APIs it is not even possible to parse an integer out of a string without allocating memory. Those are language and library design choices that have a huge confounding effect on trying to measure GC.

For example, I wrote the Avrora AVR emulator and sensor network simulator, which is one of the DaCapo benchmarks. It's a CPU emulator for very small devices and uses almost no standard JDK stuff in its core work. As such it allocates very little and has a pretty small heap. It often shows up weird (sometimes obnoxiously small or uninteresting) in papers that study memory management because it really isn't very Java-like in its behavior.

_old_dude_ · on April 28, 2022

> with standard APIs it is not even possible to parse an integer out of a string without allocating memory

Not true anymore :)

https://docs.oracle.com/en/java/javase/17/docs/api/java.base...

titzer · on April 28, 2022

Wonders never cease :)

xwolfi · on April 28, 2022

I d simply take a noop GC in the most recent jvm (I cant yet in my conservative bank), and make people suffer through it in prod until everything is constrained correctly, but you're right, String are banned where I work, and what a pain it has to be to do everything via char buffer but it's taking a dev like maybe 10 minutes per "String" handling instead of 5 second during code writing (to massage the buffer all the way from the input signal to the output destination, be it a log file, a storage structure, a comparator or a map key), so it's a giganormous increase of dev time relatively, but in absolute time it's quite trivial over the last 20 years. Yeah we wasted time per feature on String handling but... like maybe a few hours per feature per dev and we generate billions out of the trading backend so...

Maybe Java had the wrong assumption and really should just have made C-string optional first class citizens or something.

One thing is sure though, our C++ colleagues in the algo team are so slow to deliver, so prone to crazy bugs, so wild and wizardy, so unable to onboard new joiners, they re getting replaced this year by... a brand new Java algo engine... So not sure what to think.

ncmncm · on April 28, 2022

I know exactly what to think: salaries offered for your C++ colleagues are not competitive, so you are collecting competitors' dregs. It can be very hard to come back from that, maybe impossible.

In other shops, the notion of conducting trading in Java, GC'd or no, would get you laughed from the room. But Java coders certainly are cheaper.

xwolfi · on April 28, 2022

Well we do Java for the slow path and fpga for the fast path. C++ for the algo decisions, but even when we do java we jni everything that would be slower than C++. Nobody's laughing, everyone is just profiling and fitting constraints, so far Java worked, we ll switch if we must...

Problem with FPGA is that it s pointless to try to correct client mistakes or validate exchange regulation with it, so it's only for upstreams who have the whole chain clean, it's a slow transition.

cmrdporcupine · on April 28, 2022

See now this all sounds very interesting and exciting. I love both FPGA and C++.

Too bad it's finance :-)

ncmncm · on April 28, 2022

I had a colleague who called it "automated stealing".

I think there is a time threshold above which it defensibly is not, and another below which you would have to agree, but I am not sure of the values.

xwolfi · on April 29, 2022

There's no stealing if all parties consent. If you don't consent, convince others and vote better. Why blame finance when house representatives dont even understand they shouldnt insider trade, something even I, a "stealing automation engineer" would never even think of ?

ncmncm · on April 29, 2022

For the record, I don't think of you as a stealing automation engineer. You perform a valued service for an appreciative employer. (My employer, meantime, provides valued services for just such employers, maybe even yours.)

But your employers take home very large annual bonuses while contributing what, exactly, to society? If they and their sub-ms competitors did not do it, would anyone miss it? Suppose, e.g., trades did not register until 1000ms after entry?

Rhetorical questions.

pjmlp · on April 28, 2022

The City in London is full of people with humour and deep pockets.

xwolfi · on April 28, 2022

I m in Asia, we need the ability to validate 13 different exchange regulations and Java helps more than C++ as far as I can observe in team work, modularity, forcing some sort of okay common practice etc. The guys in the US have it easy with regard to logic so can throughput fast enough in Java to only switch fast clients to FPGA without the need for C++ in between.

The guys in London have nowhere near the volumes of APAC or the US. They can trade with excel if they want.

pjmlp · on April 28, 2022

Might be, doesn't change the fact that plenty of business in finance have moved into GC based languages, reducing their use of C++ into niche use cases.

Just like using C or C++ hasn't taken away that some niche use cases still require coding in Assembly.

It isn't only Java, C#, F#, Haskell and OCaml are also tooling they reach for.

FpUser · on April 28, 2022

Initially Java has been created with totally different application scope in mind comparatively to what it is now. I know it may look like a mess but maybe they should've added new optimized string type in addition to already available.

As for C++ - I do just that. All my servers are C++ and while I will not claim "bug free" I do not really remember when was the last time I've had memory related bug (alloc / boundary / use after free / etc). Modern C++ along with sanitizers helps greatly in this department. All fixed bugs were logical.

I did my share of Java development big while ago and got sick of trying to make it performant enough. In order for me to achieve the goal I've spent way more time caring about memory / resource related issues in Java than I would do in "dangerous languages. Granted it was high performance servers. Doing different type of task in Java could've cost way less aggravation.

initplus · on April 29, 2022

I'm curious, is there a reason why GC'd languages don't use some form of escape analysis so that only allocations that can possibly live beyond the current scope need to be heap allocated?

kragen · on April 29, 2022

Lots of GCed languages do use escape analysis. That's what escape analysis was invented for!

chihuahua · on April 28, 2022

C# has the option of background/concurrent GC. Did that not work for your application?

xwolfi · on April 28, 2022

It was less cpu intensive to buy RAM, so we did that first. We need all 20 cores for all the other stuff the desktop user uses sadly, and we can only give them 3 computers with 20 cores each.

I tried lobbying for threadrippers but... not easy to onboard a new CPU in a giga corporation :(

dragontamer · on April 28, 2022

The cost of memory management is:

1. The cost of free(), which recursively includes the cost of free()ing all "substructures" (ex: linked list free(head) includes the cost of free()ing the entire linked list)

2. The cost of malloc()

Manual memory management has both malloc() and free(). While garbage-collection memory management removes free() entirely, and does everything with malloc() alone (with "malloc" having some kind of mechanism to automatically detect which bits of RAM are safe to re-allocate without free() to guide it).

These costs can be incredibly complicated to figure out (!!!). free()ing a 4MB linked-list for example, would flush out your L1 cache and L2 cache, as well as take up a significant chunk of your L3 cache.

There is also the issue of fragmentation: malloc() and free() are cheap when the program starts, but as the heap gets more and more confounded by malloc()/free() calls, future calls to malloc() and free() can get slower-and-slower.

So there's a degree of:

* "Housekeeping" -- anti-fragmentation mostly, sorting the free()'d pointers and regions into groups more easily malloc'd() in the future.

* Synchronization -- Having all your threads pull from the same pool of memory is easier to think about, but requires multi-threaded synchronization. For long-lasting free() (ex: freeing a 4MB linked list), this could have one thread slowing down other threads when they try to malloc or free themselves.

-----------

Finally, the application code grossly determines what is faster or slower. Many people assume manual memory management is faster, but consider the following example:

Lets say our application-data consists of 1MB of misc. data + 3999MBs of linked-list data (including the internal pointers). Lets also assume 4000MBs is when the garbage collector runs.

Manual memory management would malloc() the linked list until its of 3999MB size, then the free() call would traverse the entire 3999MB structure every time to find the "deletion" pointers.

In contrast, Garbage collection of malloc() the linked list to 3999MB. The free() doesn't exist, but the linked-list will be detected as "dead data" upon the next malloc() call (only the 1MB of misc. data is traversed).

As such: the garbage-collector will never "Traverse" the entire linked list. The garbage collector only traverses the "live data" whenever it runs (which might be the 1MB of misc. data... but if this misc. data is long lived, then a generational garbage collector won't even traverse that!!).

So really, garbage collectors are best when you have small amounts of live data, and large amounts of malloc() calls. In contrast, manual memory management is best when you have lots of live-data (especially "quickly moving" live data, such that the generational-heuristic doesn't help), and relatively little dead-data.

titzer · on April 28, 2022

I've seen internal numbers from $MEGACORP that suggest large-scale C++ applications spend 15-20% of their time doing memory management. Incidentally, the memory fragmentation on those same applications can end up as high as 40%. In GC'd languages fragmentation is typically closer to 10%.

jeffbee · on April 28, 2022

This paper gives 10% for allocation in Google's fleet, and that was before significant efficiency improvements in tcmalloc. https://static.googleusercontent.com/media/research.google.c...

ahh · on April 29, 2022

https://research.google/pubs/pub50370/ gives more up to date information, as does the slide deck here: https://docs.google.com/presentation/d/1taP_gqeRE1XGB2td7f32... (We get down to 4% on the graph on slide 3.)

It's far below 10% now (thanks to the hard work of a lot of people), and Google actually gotten to the point where we chose to spend _more_ time in malloc to make overall efficiency higher.

titzer · on April 28, 2022

Thanks for the link. I am going by ballparks I've heard internally, from less than 3 years ago (so, with newer allocators), and unfortunately can't provide a reference. It would be great if there was more public information from large-scale studies.

ncmncm · on April 28, 2022

You can spend literally as much time as you like on memory management. Google has chosen 20%. They have also, by banning exceptions, chosen to spend a very substantial part of their time on checking for error status, and on two-phase initialization and cleanup.

I most typically choose 0% post start-up, with 0% fragmentation. For a different problem domain, I might choose differently.

titzer · on April 28, 2022

So you're running programs without dynamic memory allocation? That makes sense for a certain domain, like microcontrollers and embedded systems, but most programs do end up dynamically allocating memory, which leads to non-zeroes for both of those things.

ncmncm · on April 28, 2022

My programs mmap thousands of hugepages at startup, then run for years. Well, not all of them.

jeffbee · on April 28, 2022

The paper specifically discuss the cases of high mutation rates and small heaps, because it is a pathological case for the trendy new low-pause GCs in the JVM.

kevin_thibedeau · on April 28, 2022

If deallocating large datasets is a performance concern then you should adopt a different allocation strategy like arenas.

dehrmann · on April 29, 2022

Interesting. The cost of free() could be an advantage for Java in some situations because it can GC without pauses in a separate thread.

adamrezich · on April 28, 2022

in the "jai" programming language there's this thing called Temporary_Storage, which is basically an arena allocator, which lives in the "context", which is something that you can "push" to to make it different per-thread if needed, or anything else (like per-request in a web server). you can allocate anything you want in Temporary_Storage, and then when you call reset_temporary_storage() it does what is says and resets it. for a video game you will call this every frame, but for different contexts like web development you could reset this after resolving a request or whatever. you could use this exact same system in C++ even though there's no built-in language feature for it, but it probably wouldn't play nice with RAII or whatever (unless your data structures were aware of the existence of your hand-rolled Temporary_Storage system and knew not to call free() in the destructor).

after learning about the costs associated with using malloc()/free(), then being exposed to long-term arena allocators, then being exposed to a "temporary storage" system like this, when people talk about garbage collection being absolutely necessary for developer productivity or whatever plus also a way of avoiding malloc()/free() costs, I can't help but wonder if anyone who promotes garbage collection as a borderline necessity has ever tried something like Temporary_Storage. I find it to be pretty wild that instead of figuring out better memory management models like this, we collectively decided to punt on the whole issue, declare it an unsolvable problem, and decide that increasingly complex garbage collection systems are the only way forward. with Temporary_Storage in place, you can do whatever small allocations you need at any point in time (parse an integer from a string, etc.), and you don't have to worry about free()ing the result, because it's taken care of for you, without Actually Being Garbage Collection.

this entirely solves both "1." and "2." above, without the uncertainty of not knowing when exactly the garbage created by the thing you're free()ing and all of its substructures will be cleaned up. instead, you know exactly when it's going to be cleaned up, once per frame at the end of the frame (or whenever you call reset_temporary_storage()).

everything below the line break above sounds crazy to me. why are your linked list nodes all being malloc()d randomly on the heap, instead of in an arena allocator that you can also just flush all at once that's what you want to do? why would you ever traverse a complex nested structure to free it all, when you could just free() the entire hunk of memory at once (or, if you're going to be reusing that chunk of memory, just reset the arena allocator pointer to the beginning)?

dragontamer · on April 28, 2022

There's a time and place for all techniques. Obviously, the "most efficient" form of memory allocation is none-at-all. If you can statically allocate everything at the beginning, then that wins.

My example of 3999MBs of mallocs() followed by garbage collection is overly simplified. In the real world, if you knew such a common pattern existed, then you'd use a 3999MB vector/array instead. (or if a linked-list were really needed, perhaps Knuth's "Dancing Links" example is best).

--------

The issue is that the programmer nay not know the patterns of malloc/free for their code, and yet still wants O(1) performance for insertions / deletions to the middle of the chain. EX: the "Rope" data-structure for text editors is a glorified linked list. Each link has variable length, and represents a "cut" portion of the text that the user has entered. You can move the cursor and "insert" new text anywhere in the rope rather easily. For example, the sentence "This is a sample sentence" could be represented as: "Thi" -> "s is a sam" -> "ple sente" -> "nce".

If the user changes it to "This is a silly sample sentence", you'd split "s is a sam" into "s is a " -> "silly " -> "sam" ->...

"moving" the cursor left-or-right traverses the linked list. Since the "cursor" of the text-editor only exists in one location, its not very difficult to handle.

> when you could just free() the entire hunk of memory at once

In this rope example, you don't know which bits of the text will be free'able ahead of time. Only the human knows what parts they plan to delete.

kragen · on April 29, 2022

That's a piece chain, not a rope. In a binary rope your sample sentence might be (("Thi" . "s is a sam") . ("ple sente" . "nce")) and get edited to (("Thi" . ("s is a " . ("silly " . "sam"))) . ("ple sente" . "nce")), except that you'd probably consolidate some of those small leafnodes in practice. (The (A . B) thing is a Lisp notation for a binary tree node whose left child is A and whose right child is B.)

adamrezich · on April 28, 2022

but it's completely valid to have something like a 4GB linked list, if that's what your application calls for. there's nothing wrong with having a linked list whose nodes you allocate in an arena and then free all at once (or mark as "free" & reuse yourself if needed) later! I'm not familiar with how ropes work beyond the surface level so I can't comment on that, but in general, it's helpful to recognize that memory allocation and struct initialization don't have to be coupled. it's only when you go "full RAII" that you have the recursive-traversal free() issue. if you sidestep RAII (or override operator new or whatever), then you can allocate memory in pools or arenas or whatever, instantiate structs within that memory, and then later free the whole thing at once. the whole idea of garbage collection seems to me to be predicated upon the idea of not wanting to have to worry about what is going on "under the hood" in memory and instead wanting to be able to say, with code, "get me a new Foo—I don't care where you put it, figure it out yourself—and when I'm done with it, get rid of it as you see fit." and it's easy to see how one gets there, because languages like C++ encourage you to use new and delete to construct and destruct objects, allocating them in memory wherever. but by putting a little bit of thought into your program's structure and how the memory is going to be handled under the hood, you can not only get better performance but also just a better understanding about what the computer is doing with the code you write.

I used C# before I learned C/C++ and it was great to muddle my way through things with garbage collection until it wasn't. then when I learned C & C++ it was like ok, malloc()/free()/new/delete all the things, sure. but there's plenty of middle ground between the two extremes that is often overlooked when it comes to memory management, in favor of "garbage collection, yes or no?"

Karrot_Kream · on April 29, 2022

Yeah but then you're allocating 4 GB on app startup. Now sure, most of that memory will be unused and so the OS has potential to keep the memory "unallocated" until you write to a particular page, but this depends on the OS and architecture's implementation of paging. In embedded systems where the system is being used for only a single thing (or mostly a single thing), arena allocation is common. In desktop applications where the user can run heterogeneous workloads or long-lived network applications where the application could be allocating memory on every request, you run up against memory limits quite fast. If you squint closely, you can see that Java's NIO libraries basically sidestep the Java GC altogether and manage buffers in "native memory" which are essentially small arena allocators. There's certainly room for more than just "yes GC" or "no GC".

Regardless it's good to know about arena allocation because it's one of many effective techniques in a toolbelt. If you find yourself writing a CLI utility in an unmanaged language, it's often simplest to just allocate and never deallocate, because the application will only run for a small amount of time and the OS will deallocate resources on app exit.

_gabe_ · on April 29, 2022

You can allocate 4GB of virtual memory which will ensure that it doesn't get allocated until you actually hit a new page. Heck, you can, and OurMachinery has a great blog post on this, allocate 100GB of virtual memory and then use pointers as immutable ids and your app will only use as much physical memory as needed and the OS will take care of the rest.

https://ourmachinery.com/post/virtual-memory-tricks/

dragontamer · on April 29, 2022

The main problem with more precise strategies is that they only work in precise situations.

New/delete and RAII and malloc/free (be it either unique_ptr source/sink pattern or RAII pattern) are basically universal techniques, albeit at a performance cost.

There are many other strategies but they require more thought.

-------

Personally, I'm a big fan of vector + swap with the end, which covers most of my common needs.

When looking at linked structures: linked list, trees, and graphs... It's often common that your specific application has patterns that a more specific strategy (arena allocation, stack allocation/bump allocation) works out.

But there is also the chance it doesn't work out, and prototyping with generic methods (new/delete/RAII or garbage collection) is a good first step.

avgcorrection · on April 28, 2022

Linked lists? Not a useful data structure in application programming when using imperative languages. And if you are using a functional language then most people agree that a GC is necessary (FP without GC sounds like a research problem).

ncmncm · on April 28, 2022

FP dependence on GC is 100% cultural. FP languages come out of academia where not depending on GC makes you unwelcome at the lunch table. (You would be suspected of being about to mention performance.)

remexre · on April 28, 2022

That is absolutely not my experience with academic PL work, as someone in that field. From my perspective, it seems like we're currently undergoing a renaissance of more principled approaches to resource management than GC (especially for non-memory resources), including work on substructural type systems, dynamic resource reuse (I'm thinking of FBIP here), and new ways to combine effects and resource management.

ncmncm · on April 28, 2022

That would be a welcome development.

avgcorrection · on April 28, 2022

GC makes a lot of sense if you want to provide ergonomic pure functional programming. Pure functional data structures are implemented using the idea that old versions can share nodes. With dynamic reference sharing you need something like reference counting at the least, which is a form of GC. But you probably would want something more than that since saying no to cycles would be too onerous.

No-GC could be used in cases where you are just transforming one list or whatever and you don’t need to keep around any references to older nodes. That’s an optimization which is used by a research language which focuses on effects (whatever its name was). But it does use GC (reference counting).

Then you have linear types and all of that. But the intent seems to be to want to complement regular functional programming, not to invent a new GC-less FP paradigm.

You claim that it is merely “cultural” makes it seem to me that you just want to throw some weird shade and that you don’t have any particular technical argument in mind.

dragontamer · on April 28, 2022

std::map on C++ is a linked-tree structure, often a Red-and-black tree that very much looks like a "linked list" (or close to it) from a memory-management perspective.

If you delete a std::map taking up 4GB of RAM, the destructor will have to traverse all 4GBs to find all the pointers and return them to the heap.

avgcorrection · on April 28, 2022

std may not always have the best of datastructures even though they have generic-sounding names. A standard library could either be a testament to great foresight or just a list of regrets.

Chandler Carruth has discouraged its use (Google talk 2014) because it’s a cache killer.

I wouldn’t use LinkedHashMap in Java unless I needed to preserve the order.

astrange · on April 28, 2022

How does the GC know there aren't any cyclic references in there to avoid tracing all the pointers the same way?

nb if it used an arena allocator then it could simply not do anything on free as long as it was acceptable to reclaim the memory later.

dragontamer · on April 28, 2022

> How does the GC know there aren't any cyclic references in there to avoid tracing all the pointers the same way?

Mark-and-sweep, as well as generational-collectors, walk through all references starting from the root of all variables. In college-level toy garbage collectors, you usually use 1-bit of the pointers to mark where your algorithm has been to (or not). EDIT: The top 16-bits of x86_64 systems are often ignored, because x86 CPUs (AMD Ryzen or Intel i7 / Xeons) only have 48-bit physical memory space, and are a common set of bits used for this "marking" process)

Its basically just a depth-first-search or breadth-first-search over the graph of memory pointers. Its pretty simple in concept, but lots of details depending on performance considerations.

initplus · on April 29, 2022

Cyclic references are an issue for ref counting as the cycle prevents the count ever reaching 0. Not an issue for mark & sweep where if an allocation isn't accessible from the GC root it's safe to free.

astrange · on April 29, 2022

The question isn't whether they're "an issue" for GC, it's whether the GC algorithm has to read in all the memory in the first place and therefore take longer to GC it/disturb caches.

belmarca · on April 29, 2022

As sibling says. If it cannot be accessed starting from any of the GC roots, it is by definition unreachable and will be "cleaned up". What is meant by "cleaned up" varies with GC implementations.

dehrmann · on April 29, 2022

Someone else called out std::map is linked, but unlinked hash tables often store entries to wrap the key and value, so it's a pretty common problem.

ape4 · on April 28, 2022

Of course, there are benefits of GC: less development time, fewer null pointer exceptions, double free's and other other mistakes, etc

kragen · on April 29, 2022

I don't see how GC reduces null pointer exceptions. If anything, GC encourages the use of pointers (as opposed to nested child objects), which increases null pointer exceptions.

ape4 · on April 29, 2022

Sometimes I code:

    {
        MyObject obj = new MyObject(); // Make an empty but non-null thing
        ... do stuff
        obj = new MyObject("real data");
    }

This would be bad in C++ of course, but in a GC language it avoids nulls.

kragen · on April 29, 2022

In either Java or C++ it would be be better to write:

    {
        ... do stuff
        MyObject obj = new MyObject("real data");
    }

Except that in C++ to get the Java semantics you need an extra * in there.

Where applicable, this converts the run-time NullPointerException into a compile-time error if you're referencing the value of obj before you've initialized it, which is an improvement. By contrast, your suggested change converts the run-time NullPointerException into silently incorrect behavior, which seems strictly worse in any context I've ever seen.

You didn't say why your suggested change would be bad in C++, but I'm guessing you're thinking that it's because it would leak the initial empty object, which is true if you use a bare pointer. But in the more usual case where you just construct the empty object (as a "nested child object") in your stack frame

    MyObject obj;
    ... do stuff
    obj = MyObject("real data");

there's no leak and also no null pointers because no pointers at all. It's still bad in the same way your example is bad, though: if you read obj before you initialize it you still get incorrect results.

(There's also no leak in C++ if you use a smart pointer.)

josephcsible · on April 29, 2022

That's bad in Java too: http://thecodelesscode.com/case/115

ape4 · on April 29, 2022

I see wise Bazen's point. In theory that just replaces `if (obj != null) { ... }` with `if (obj.isInitialized()) { ... }` or similar nonstandard check. But it often makes for nicer and safer code.

josephcsible · on April 29, 2022

That makes the code less safe, because now if you forget an isInitialized check, your code will silently do the wrong thing and start corrupting state, instead of failing early with a detailed stack trace and error.