> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.
Cache must be physically organized as 64 byte lines. Cache line size is most important for software for two things:
- Architectural interfaces like (I think, I don't really know aarch64) DC CVAU. These don't necessarily have to reflect physical cache organization, cleaning a "line" could clean two physical lines.
- Performance. The only thing you really care about is behavior on stores and on load misses for avoiding false sharing cache line bouncing problems.
It's possible that either they think 128 byte lines will be helpful for performance and hope they could switch over after legacy software goes away, seeding their mac ecosystem with 128 byte lines now, or that 128 byte line behavior actually does offer some performance benefit and they have a mode that basically gangs two lines together (Pentium 4 had similar IIRC) so it has performance characteristics of a 128 byte line.
Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
My understanding is that x86 implementations use speculation to be able to reorder beyond what's allowed by the memory model. This is not free in area and power, but allows recovering some of the cost of the stronger memory model.
As TSO support is only a transitional aid for Apple, it is possible that they didn't bother to implement the full extend of optimizations possible.
I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
The programs that see the most benefit of WO vs TSO are poorly written multithreaded programs. Most of the software you actually use might be higher quality than that?
> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.
Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").
Intel initially claimed Processor Ordering that, IIRC, allows processors doing independent reads of independent writes (IRIW) to observe different orderings. This is slightly weaker than TSO.
In practice Intel never took advantage of this and, given the guarantees provided by the memory barriers, it was hard to formally recover SC, so Intel slightly strengthened it to TSO, which is what was actually implemented in hardware anyway.
I don't think intel ever claimed SC since their first CPU with builtin support for cache coherency (it was the PPro I think?), and the memory model was not well defined before that and left to external chips.
Apple M4 cpu is pretty much kimg in terms of single threaded performance. In multithreaded the M4 ultra of course loses against extreme high core count server CPUs. But I think it's wrong to say that x86 readily outperforms ARM64. Apple essentially dominates in all CPU segments they are in.
But x86_64 does outperform ARM64 in high-performance workloads. High-performance workloads are not single-threaded programs. Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux. OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.
> Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux.
They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.
> OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.
These are manufactured on far older nodes than Apple Silicon or Intel x86, and it's a chicken-egg problem once again - there will be no incentive for ARM chip designers to invest into performance as long as there are no customers, and there are no customers as long as both the non-Apple hardware has serious performance issues and there is no software optimized to run on ARM.
> They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.
That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.
> These are manufactured on far older nodes than Apple Silicon
True but I don't think this would be the main bottleneck but perhaps. IMO it's the core design that is lacking.
> there will be no incentive for ARM chip designers to invest into performance as long as there are no customers
Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.
> Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.
AWS has a bit of a different cost-benefit calculation though. For them, similar to Apple, ARM is a hedge against the AMD/Intel duopoly, and they can run their own services (for which they have ample money for development and testing) for far cheaper because the power efficiency of ARM systems is better than x86 - and like in the early AWS time that started off as Amazon selling off spare compute capacity, they expose to the open market what they don't need.
Sure, there's a different cost-benefit calculation. My argument was that there is an incentive to optimize for ARM64 because that translates to $$$. It's not only Amazon but Oracle and Microsoft too.
> That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.
Why not? Well form factor is an issue. But you can easily fit a few mac pros in a couple Us. Support is generally better then some HP or Dell servers.
Are you serious? But maybe you're not aware how such businesses are run - Linux is not officially supported by Apple and someone has to take the liability when something goes wrong, either you loose your data or your CPU melts down or whatever.
Do you think HP or Dell will take liability? Tell me you have never dealt with any large OEM without telling me you have never dealt with any large OEM. No way they will take any responsibility for loss of life, data loss, or literally anything at all. The best they do is send some cannon fodder to replace the hardware if it fails. Perhaps it's different if you have a few hundred thousands of their devices running but my experience with small operations is that it's basically impossible to deal with them.
You're misrepresenting what CPU's do exactly, and opaque term "high-performance workloads" does not help it, either. M-class chips have 256-bit (M4) and 512-bit (M3 Max) memory bus per "socket" options, as high as 1024-bit total in M2 Ultra, which is significantly higher than 64-bit and 128-bit DDR5 buses you get in x86 CPU's. For example, my relatively modern datacenter AMD EPYC 8434PN CPU (based on Zen 4c cores) is a six-channel DDR5, effectively 384-bit at 200 GB/s bidirectional bandwidth. Apple Silicon beats it by a factor of 5x. You can get somewhat better with Turin, but not by much, and at perhaps unreasonable premium.
Now, like with everything in life, of course, there's highly-specialised datapaths like AVX-512, but then again these only contribute towards single-threaded performance, & you yourself said that "High-performance workloads are not single-threaded programs." Now, as your compute network grows larger the implementation details of the memory fabric (NUMA, etc.) become more pronounced. Suffice to say the SoC co-packaging of CPU and GPU cores, along with some coprocessors, did wonders for Apple Silicon. Strix Halo exists, but it's not competitive by any stretch of imagination. You could say it's unfair, but then again, AMD MI300A (LGA6096 socket) exists, too! Do we count 20k APU's that only come in eights, bundled up in proprietary Infinity Fabric-based chassis towards "outperforming ARM64 in high-perf workloads"... really? Compute-bound is a far cry from high-performance, where the memory bus, and idiosyncrasies of message-passing are King as number of cores in the compute network continues to grow.
Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus is capable of reaching that bandwidth, but how much can your code actually use? You'd be mistaken if you think the CPU can make use of all that bandwidth, or even the GPU!
Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths? I'm only familiar with Apple Silicon-specific code in llama.cpp, and not really familiar with either Accelerate[0] or MLX[1] specifically. Have they really cracked it at homogenous computing so that you could use a single description of computation, and have it emit efficient code for whatever target in the SoC? Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?
It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.
See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.
Really, 1TB/s of memory bandwidth to and from system memory?
I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ...
It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well.
However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it?
> You can get somewhat better with Turin
High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better.
> The Power10 processor technology introduces the new OMI DIMMs to access main memory. This allows for increased memory bandwidth of 409 GB/s per socket. The 16 available high-speed OMI links are driven by 8 on-chip memory controller units (MCUs), providing a total aggregated bandwidth of up to 409 GBps per SCM. Compared to the Power9 processor-based technology capability, this represents a 78% increase in memory bandwidth.
And that is again a theoretical limit which usually isn't that interesting but rather it's the practical limit the CPU is able to hit.
You're right, I looked it up, the hardware limit is actually 800 GB/s for M2 Ultra. You're also right that the actual bandwidth in real workloads is typically lower than that due to the aforementioned idiosyncrasies in caches, message-passing, prefetches, or lack thereof, etc. The same is the case for any high-end Intel/AMD CPU, though. If you wish to compare benchmarks, a single most relevant benchmark today is LLM inference, where M-series chips are a contender to beat. This is almost entirely due to combination of high-bandwidth, high-capacity (192 GB) on-package DRAM, available to all CPU and GPU cores. The closest x86 contender is AMD Strix Halo, and it's only somewhat competitive in high-sparsity, small MoE setups. NVIDIA were going to produce a desktop one based on their Grace superchip, but it turned out to be a big nothing.
Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra.
I own both a Mac studio, and a Sienna-based AMD system.
There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup.
Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs.
> Really, 1TB/s of memory bandwidth to and from system memory?
5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs.
The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth.
You're deceiving yourself and falling for Apple marketing. Regardless of a stick or SoC memory, which has been the case with pretty much SoC in 2010's (nowadays I have no idea), it is not possible to drive the memory with such high speeds.
This is definitely citation needed. I very much expect a combined GPU/CPU/NPU load to saturate the memory channels if necessary. This is not some marketing fluff. The channels are real, the number of RAM ICs are physically there and connected.
We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC so you're pulling in the argument that is not valid.
> While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of.
> That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth.
> Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters)
The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.
They did not (or, rather, could not) measure the theoretical peak GPU core saturation for the M1 Max SOC because such benchmarks did not exist at the time due to the sheer novelty of such wide hardware.
> The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.
So, which part of "We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC" you didn't understand?
Well I think we talked about memory channels and the maximum speed reachable. And you claimed it was marketing fluff. I don't think it's unreasonable to say that if that speed is reachable using some workload it's not marketing fluff. It was not clear at all to me you limited your claims to CPU speed only. Seems like a classic motte-and-bailey to me.
You're realistically going to reach power/thermal limits before you saturate the memory bandwidth. Otherwise I'd like to hear about a workload that'll make use of the CPU, GPU, NPU, etc. to make use of Apple's marketing point.
It's quite impressive what they were able to achieve with ppc64el in recent years, including Linux support for it, too. Unfortunately, they turned the wrong way with proprietary encryption of memory, which may or may not be deliberate as far as backdoors come and go, but in all honesty so much in it is contingent on IBM's proprietary fabric (OSC or what was it?) implementation for tiered memory anyway. There's similar setups from Samsung, even including fully transparent swapping to NVMe for persistence which is really cool, and hard to match in open source setting.
I think their slogan could be "unlimited, coherent, persistent, encrypted high-bandwidth memory is here, and we are the only ones that really have it."
Disclaimer: proud owner of thoroughbred OpenPOWER system from Raptor
I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
I think they’re trying to maintain the stick for ordinary “Cocoa” app developers, but otherwise leave themselves the room to keep using the technology where it makes sense.
> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc
> Apple keeps trying to be a platform for games. Keeping old games running
> would be a step in that direction. Might include support for x86 games
> running through wine/apple game porting toolkit/etc
Well... They'd need to bring back 32-bit support also then. This is what killed most of my Mac-compatible Steam library....
Rosetta 1 was licenced third party technology back when the company wasn't exactly rolling in money.
https:/www.wikipedia.org/wiki/QuickTransit
If you have to pay the licensing fee again every time you want to release a new version of the OS, you've got a fiscal incentive to sunset Rosetta early.
Rosetta 2 was developed in-house.
Apple owns it, so there is no fiscal reason to sunset it early.
Rosetta 1 wasn't really useful for much because PowerPC was a dead platform by the time Apple switched off of it. Rosetta 2 is used for much more than just compatibility with old macOS apps.
> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.
How would this be even possible?
reply