I don't know exactly why the Linux kernel does this, but the pauses do not seem to occur for reads, so the Linux kernel is marking these pages read-only. A friend suggested this is the kernel's way of reducing the write I/O rate under overloaded conditions. If you know exactly why the kernel is doing this, I would love to hear it.
What is happening is that the JVM is dirtying a previously clean page (this also happens in your test program, because the dirty pages in your mmap'ed file are being regularly written out - and therefore made clean - by background writeback).
If, at this point, the global dirty limits are exceeded (/proc/sys/vm/dirty_bytes and /proc/sys/vm/dirty_ratio) then the task will be paused to throttle the generation of dirty pages.
This is by design, these are called "stable pages".
In some circumstances, the kernel has to ensure that pages aren't modified between initiating and completing the writeback for that specific page.
Btrfs is a copy-on-write filesystem, so it ends up needing to use this guarantee more often than the others. This is something the btrfs developers are actively working on improving.
BTW, I'm pretty sure that this blocking behavior is the best choice (among bad choices). If I memory map a file, I can set bits in that buffer at a much higher rate than the kernel can write those bits to disk... So the choices are: block the writing thread until the kernel can catch up, or simply drop some of those writes... At least with blocking my writes, I have a chance to notice the issue (in my application, this would cause a chain reaction of threads to backup and ultimately block resulting in me dropping network packets)...
The only guarantee when you write to an mmaped page is that your wrote to the memory, whether or not it makes it do disk is up to many different things. So before you can write to the memory, it needs to have the right contents, that can mean a read has to finish, it can also mean pages have to get murdered to free up for you to have the memory to read in to. I can't think of how the write itself can actually block unless a read is required which hasn't finished (like the file is in read/write mode or something) in fact, other than a pagefault, there is no way it can be a blocking operation, the pages are stitched in to the processes page tables. At least I can't think of how it can block on a write right now, I've had a couple glasses of wine with dinner though. [edit] m_time update makes some sense, that blocks though?
In write only mode there are optimizations to not require the read.
> I can't think of how the write itself can actually block unless a read is required which hasn't finished (like the file is in read/write mode or something) in fact, other than a pagefault, there is no way it can be a blocking operation, the pages are stitched in to the processes page tables.
You've almost got it. The mmaped pages may be in the process page table, but they may be in the page table as read-only: if the process tries to write to the page, the process traps into the kernel. If there are few dirty pages, the kernel will mark the page dirty, make it writable from the process and make the process runnable again. Apparently, if there are a lot of dirty pages, the kernel will not fill the request immediately, it will wait. While it's waiting the process is not runnable (other processes with the same memory space would continue to be runnable)
I don't understand. There's no guarantee that every write to an mmapped region be synced to the disk. Indeed, AFAICS, the OS isn't obliged to write any changes through until you call msync() or close the file. Given that, I don't see why a memory write should ever block.
You know, it is pretty funny that some people on Reddit linked to a talk from aaronsw (AKA tenderlove) about Ruby on Rails performance issues and performance regressions. In the middle of the video he cracks up and points out he went down a tangent path because that Ruby profile was built of a gem that called into Ruby MRI's C API. So he worked his way through the gem developer, and then the Ruby dev who wrote the C API. Neither knew what was going on.
Turns on the profiling data where there was a discrepancy between CPU times and Wall times was only OS X because of a problem with a trap() call on OS X specifically, not any other platform. His moral of the story: even profilers have bugs.
Profilers lie all the time. Usually it's not so much a bug, as a known limitation of the profiling approach. E.g. profilers that annotate methods with entry / exit code, inflate the runtime of small methods. Profiles that rely on CPU sampling, can be vulnerable to correlations with the sampling schedule. And so forth. The moral is to always do a reality check on whatever your profiler is telling you.
Wonder if it deals with dirty write back. That is a Linux behavior for writing dirty pages to disk. If writes come in at a high enough rate, Linux will hard block the writing thread until pages are written to disk.
Before that usually it spawns a bunch of pdflush processes to flush data out in the background, but if those can't keep up then it moves to blocking the process. On older systems and older spinning drives, blocks could take seconds even.
See /proc/meminfo for these two entries:
Dirty: 4 kB
Writeback: 0 kB
Dirty are the current dirty pages, and writeback is the current amount being written out.
I had a hand-assembled binary that was always crashing on out-of-bounds memory access. But whenever i loaded it in the debugger, it was always perfectly fine.
This has to do with the JVM's monitoring, which, among many other things, monitors memory allocation/deallocation, which just happens to be automatic. It has nothing to do with automatic vs. manual memory management.
While oracle is certainly to blame for a great deal of things in life, your comment would benefit from some explanation / background regarding this specific issue, otherwise it's value is debatable.
What is happening is that the JVM is dirtying a previously clean page (this also happens in your test program, because the dirty pages in your mmap'ed file are being regularly written out - and therefore made clean - by background writeback).
If, at this point, the global dirty limits are exceeded (/proc/sys/vm/dirty_bytes and /proc/sys/vm/dirty_ratio) then the task will be paused to throttle the generation of dirty pages.