Speaking as someone who has benchmarked the shit out of such things, I can tell you it is most certainly not free. Building and maintaining the live graph is an insane task to do from the perspective of performance, and it shows up in CPU metrics.
For one particular task we went as far as writing the equivalent in C and the single-threaded C version was twice as fast as the Java version in wall time and 20 times as fast in total CPU usage.
Love Java for the productivity. But don’t be fooled that it is remotely optimal. Just realize that you aren’t going to rewrite that thing in C or Rust, and be content. Java is a nice Mercedes with a warranty and a service plan. You can’t afford a Bugatti. But don’t pretend the Mercedes is faster.
While I agree Java is almost never the fastest, I'll still object it's extremely hard to meaningfully benchmark these things. It's hard to benchmark even within the same language.
There are so much hidden state in the JVM. Like are you running out of TLAB space? How aggressively has the code been optimized? Does JVM elide any object creations? It may be tempting to try to compensate for these things, to isolate the benchmark from the statefulness of the JVM, but that is also isolating the results from any sort of practical applicability because real world code exists in the context of a stateful JVM.
It's also difficult to generalize benchmarks of allocation costs alone because it's rare for code to only be allocating stuff and this isn't something anyone is optimizing for. The GC will choke if you don't give it some breathing room. You'll get resource contention that realistically wouldn't be there in real-world scenarios because real code doesn't (typically) just allocate objects for minutes on end.
OP>Using something like a reference count adds contention to this high-concurrency data structure, slowing it down.
So instead we'll have an engine that walks the DAG all the time; that trashes the CPU caches; in a language where there are no nested structures or stack allocations, so that what might be a single allocation in C or C++ or Rust (if an allocation happened at all) ends up being 28 for a single API call. Then you've either got to traverse the DAG completely every cycle, or you've got to write somewhere else that the object has changed and that thing has the same (or worse) access cost as a reference count. I know that there are very clever people working on very clever algorithms, but at this point, most of these algorithms have crossed into optimizations that work great 99% of the time, and it is the 1% that brings your app down in production.
And I'm not even complaining: if I have to architect my application so that it submits to the rules of the GC algo just so I can use Java, well that's often a fair trade. But I'm not pretending it's free.
OP>Threads have to fight over updating that counter,
I've done a lot of work that involved exactly this, and measured it, and they just don't: most of the time they just don't because it is incredibly unlikely for two threads to be hitting the same object at the same time. Some languages (Swift) even identify that a refcount doesn't need to be taken at all. If you do have a "central" object with such contention then you switch to a different locking model.
Speaking as someone who has benchmarked the shit out of such things, I can tell you it is most certainly not free. Building and maintaining the live graph is an insane task to do from the perspective of performance, and it shows up in CPU metrics.
For one particular task we went as far as writing the equivalent in C and the single-threaded C version was twice as fast as the Java version in wall time and 20 times as fast in total CPU usage.
Love Java for the productivity. But don’t be fooled that it is remotely optimal. Just realize that you aren’t going to rewrite that thing in C or Rust, and be content. Java is a nice Mercedes with a warranty and a service plan. You can’t afford a Bugatti. But don’t pretend the Mercedes is faster.