The Four Month Bug: JVM statistics cause garbage collection pauses

caf · on March 27, 2015

I don't know exactly why the Linux kernel does this, but the pauses do not seem to occur for reads, so the Linux kernel is marking these pages read-only. A friend suggested this is the kernel's way of reducing the write I/O rate under overloaded conditions. If you know exactly why the kernel is doing this, I would love to hear it.

What is happening is that the JVM is dirtying a previously clean page (this also happens in your test program, because the dirty pages in your mmap'ed file are being regularly written out - and therefore made clean - by background writeback).

If, at this point, the global dirty limits are exceeded (/proc/sys/vm/dirty_bytes and /proc/sys/vm/dirty_ratio) then the task will be paused to throttle the generation of dirty pages.

jcalvinowens · on March 27, 2015

This is by design, these are called "stable pages".

In some circumstances, the kernel has to ensure that pages aren't modified between initiating and completing the writeback for that specific page.

Btrfs is a copy-on-write filesystem, so it ends up needing to use this guarantee more often than the others. This is something the btrfs developers are actively working on improving.

Here's a good article describing this in a bit more detail: https://lwn.net/Articles/442355/

EDIT: Fix a typo

dicroce · on March 26, 2015

BTW, I'm pretty sure that this blocking behavior is the best choice (among bad choices). If I memory map a file, I can set bits in that buffer at a much higher rate than the kernel can write those bits to disk... So the choices are: block the writing thread until the kernel can catch up, or simply drop some of those writes... At least with blocking my writes, I have a chance to notice the issue (in my application, this would cause a chain reaction of threads to backup and ultimately block resulting in me dropping network packets)...

Nelson69 · on March 27, 2015

The writes block? How can that be?

The only guarantee when you write to an mmaped page is that your wrote to the memory, whether or not it makes it do disk is up to many different things. So before you can write to the memory, it needs to have the right contents, that can mean a read has to finish, it can also mean pages have to get murdered to free up for you to have the memory to read in to. I can't think of how the write itself can actually block unless a read is required which hasn't finished (like the file is in read/write mode or something) in fact, other than a pagefault, there is no way it can be a blocking operation, the pages are stitched in to the processes page tables. At least I can't think of how it can block on a write right now, I've had a couple glasses of wine with dinner though. [edit] m_time update makes some sense, that blocks though?

In write only mode there are optimizations to not require the read.

toast0 · on March 27, 2015

> I can't think of how the write itself can actually block unless a read is required which hasn't finished (like the file is in read/write mode or something) in fact, other than a pagefault, there is no way it can be a blocking operation, the pages are stitched in to the processes page tables.

You've almost got it. The mmaped pages may be in the process page table, but they may be in the page table as read-only: if the process tries to write to the page, the process traps into the kernel. If there are few dirty pages, the kernel will mark the page dirty, make it writable from the process and make the process runnable again. Apparently, if there are a lot of dirty pages, the kernel will not fill the request immediately, it will wait. While it's waiting the process is not runnable (other processes with the same memory space would continue to be runnable)

TheCondor · on March 27, 2015

So a page fault

gohrt · on March 27, 2015

block the thread/process that is writing, as a preventive measure, not block the write.

ScottBurson · on March 26, 2015

I don't understand. There's no guarantee that every write to an mmapped region be synced to the disk. Indeed, AFAICS, the OS isn't obliged to write any changes through until you call msync() or close the file. Given that, I don't see why a memory write should ever block.

pstrateman · on March 26, 2015

General memory pressure would be my guess.

There isn't an infinite buffer in memory for disk writes either.

616c · on March 26, 2015

You know, it is pretty funny that some people on Reddit linked to a talk from aaronsw (AKA tenderlove) about Ruby on Rails performance issues and performance regressions. In the middle of the video he cracks up and points out he went down a tangent path because that Ruby profile was built of a gem that called into Ruby MRI's C API. So he worked his way through the gem developer, and then the Ruby dev who wrote the C API. Neither knew what was going on.

https://www.youtube.com/watch?feature=player_detailpage&v=JM...

Turns on the profiling data where there was a discrepancy between CPU times and Wall times was only OS X because of a problem with a trap() call on OS X specifically, not any other platform. His moral of the story: even profilers have bugs.

https://www.youtube.com/watch?feature=player_detailpage&v=JM...

I think I am seeing a pattern today.

snewman · on March 26, 2015

Profilers lie all the time. Usually it's not so much a bug, as a known limitation of the profiling approach. E.g. profilers that annotate methods with entry / exit code, inflate the runtime of small methods. Profiles that rely on CPU sampling, can be vulnerable to correlations with the sampling schedule. And so forth. The moral is to always do a reality check on whatever your profiler is telling you.

kenbellows · on March 27, 2015

Just for the sake of it, that is Aaron Patterson (yes, aka Tenderlove), but it is not aaronsw, a moniker referring to the late Aaron Swartz.

0x0 · on March 26, 2015

ps I think you got some aarons mixed up

616c · on March 27, 2015

Yes, I did get the Aarons mixed up. Thanks to both of you for noticing.

rdtsc · on March 26, 2015

Wonder if it deals with dirty write back. That is a Linux behavior for writing dirty pages to disk. If writes come in at a high enough rate, Linux will hard block the writing thread until pages are written to disk.

Before that usually it spawns a bunch of pdflush processes to flush data out in the background, but if those can't keep up then it moves to blocking the process. On older systems and older spinning drives, blocks could take seconds even.

See /proc/meminfo for these two entries:

   Dirty:                 4 kB
   Writeback:             0 kB

Dirty are the current dirty pages, and writeback is the current amount being written out.

hinkley · on March 26, 2015

Good catch.

This would make a good case study in Heisenbugs. GC delays that only happen when you're collecting GC statistics.

mappu · on March 26, 2015

I had a hand-assembled binary that was always crashing on out-of-bounds memory access. But whenever i loaded it in the debugger, it was always perfectly fine.

That was the day i learned about _NO_DEBUG_HEAP !

ooOOoo · on March 27, 2015

I have not seen any link to a bug report at http://bugs.java.com/

What is reported?

jalcazar · on March 27, 2015

Debuggers, profilers, monitoring tools, etc. have an associate overhead.... how is this a bug?

arielweisberg · on March 26, 2015

If Hotspot didn't require global safepoints all the time this wouldn't be such a big deal.

I have been wondering if Zing handles threads blocked in memory mapped files better.

b0b0b0b · on March 26, 2015

Would -XX:-UsePerfData also work?

I noticed jankiness in my eclipse and adding that flag seemed to help.

zengorilla · on March 26, 2015

Direct reclaim shines again!

pron · on March 26, 2015

This has to do with the JVM's monitoring, which, among many other things, monitors memory allocation/deallocation, which just happens to be automatic. It has nothing to do with automatic vs. manual memory management.

zengorilla · on March 26, 2015

Direct reclaim is the precise name for what the linux kernel is doing in the case called out by the article.

http://lwn.net/Articles/396561/

But thanks for the snark anyway!

pron · on March 27, 2015

No snark, but I did completely misunderstand your comment[1]. In any case, I apologize, and I learned something new!

[1] BTW, if you're mentioning a term that isn't widely known, it's helpful to link to a definition.

zengorilla · on March 27, 2015

Right on, my comment could have been much more clear, I will do so in the future :).

MichaelGG · on March 26, 2015

A manual implementation or refcounting system could easily expose the same issue.

cfontes · on March 26, 2015

Talk about perseverance... Really is inspiring.

irascible · on March 26, 2015

I blame oracle

smcleod · on March 26, 2015

While oracle is certainly to blame for a great deal of things in life, your comment would benefit from some explanation / background regarding this specific issue, otherwise it's value is debatable.