This memory is now the least recently used in the L1 cache, despite being freed by the allocator, meaning it probably isn't being used again.
If it was freed after already being removed from the L1 cache, then you also need to evict other L1 cache contents and wait for it to be read into L1 so you can write to it.
128 cycles is a generous estimate, and ignores the costs to the rest of the program.
Nontemporal writes are substantially slower, e.g. with avx512 you can do 1 64 byte nontemporal write every 5 or so clock cycles. That puts you at >= 640 cycles for 8 KiB.
https://uops.info/html-instr/VMOVNTPS_M512_ZMM.html
Well, the point of a non-temporal write kind of is that you don't care how fast it is. (Since if it was being read again anytime soon, you'd want it in the cache.)
The worker is already reading/writing to the buffer memory to service each incoming HTTP request, whether the memory is zeroed or not. The side effects on the CPU cache are insubstantial.
If it was freed after already being removed from the L1 cache, then you also need to evict other L1 cache contents and wait for it to be read into L1 so you can write to it.
128 cycles is a generous estimate, and ignores the costs to the rest of the program.