It's not stating python is faster than c in general. This is just one very speci...

galangalalgol · on Nov 29, 2023

It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd. Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.

xuanwo · on Nov 29, 2023

> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't.

The root cause is not about page alignment. In fact, all allocators are aligned.

The root cause is AMD CPU didn't implement FSRM correctly while copying data from 0x1000 * n ~ 0x1000 * n + 0x10.

> Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.

OpenDAL Python Binding v0.42 does have many place to improve, like we can alloc the buffer in advance or using `read_buf` into uninit vec. I skipped this part since they are not the root cause.

scottlamb · on Nov 29, 2023

> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd.

Other way around: with glibc it was page-aligned; with the others, it wasn't.

This weird Zen performance quirk aside, I'd prefer page alignment so that an allocation like this which is a nice multiple of the page size doesn't waste anything (RAM or TLB), with the memory allocator's own bookkeeping in a separate block. Pretty surprising to me that the other allocators do something else.

Attummm · on Nov 29, 2023

The context of my initial comment is that python is slow, but can be fast.

From the article.

> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.