It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd. Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.
> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't.
The root cause is not about page alignment. In fact, all allocators are aligned.
The root cause is AMD CPU didn't implement FSRM correctly while copying data from 0x1000 * n ~ 0x1000 * n + 0x10.
> Other questions never answered, why did pyo3 add so much overhead? it was over half the difference between the two.
OpenDAL Python Binding v0.42 does have many place to improve, like we can alloc the buffer in advance or using `read_buf` into uninit vec. I skipped this part since they are not the root cause.
> It does make me wonder why pymallov and jemalloc used page aligned memory, but glibc didn't. That is odd.
Other way around: with glibc it was page-aligned; with the others, it wasn't.
This weird Zen performance quirk aside, I'd prefer page alignment so that an allocation like this which is a nice multiple of the page size doesn't waste anything (RAM or TLB), with the memory allocator's own bookkeeping in a separate block. Pretty surprising to me that the other allocators do something else.