I noticed this when he talked about speeding up the Google download servers. Ver...

fooyc · on July 29, 2013

Apparently there is no CAS by design, this allows groupcache to replicate hot keys on multiple hosts.

willvarfar · on July 29, 2013

Yes, but if it was opt-in...

Its nice if you can use groupcache as a memcache replacement even if some use-cases tie the hands of the implementation e.g. replication of hot items.

happyhappy · on July 29, 2013

I've been thinking about something similar. I don't see how timed expiration would conflict with the two most important features - the filling mechanism and the replication of hot items. Am I missing something that would make timed expiration impossible?

willvarfar · on July 29, 2013

Yeah on the first pass of the problem you seem right.

The CAS must have an authoritative node (my mind wanders thinking about replication and failover) but the key it protects - with the version baked in - can be replicated surely?

fizx · on July 29, 2013

CAS is incompatible with the distribution architecture, which uses a best-effort distributed lock in lieu of e.g. a strongly consistent distributed state machine. It would require a lot more work.

happyhappy · on July 29, 2013

If you have a bug in another application on the server running the cache that causes it to grow its memory use, your cache would suddenly disappear/underperform, and the failure could cascade on to the system that the cache is in front of. If instead you let the offending program crash because the cache is using regular memory, this would not happen. Just a thought.

willvarfar · on July 29, 2013

If the cache uses regular memory, you hit swap...

I think the failure mode of a system-level cache is rather better than per-application islands that can conflict.

happyhappy · on July 29, 2013

If you hit swap, again only the offending application or instance is punished, not everyone else (for instance by pummeling a backend database server that other services are using as well)

willvarfar · on July 29, 2013

Actually not to my thinking:

If program A hits swap, it means that cold pages are written to swap so that A can get those pages; this initial writing is done by program A, its true. But A may not be the cause of the problem, A is just the straw that breaks the camel's back.

And those pages that got written to swap likely belong to others, and they pay the cost when they need those pages back...

In my practical experience, when one of my apps hits swap, the whole system becomes distressed. It is not isolated to the 'offender'.

You can of course avoid swap, but with your OS doing overcommit on memory allocations, you are just inviting a completely different way of failing and that too is hard to manage. You end up having to know a lot about your deployment environment and ring-fence memory between components and manage their budgets. If you want to have both app code and cache on the same node - and that's a central tenet of groupcache - then you have to make sure everything is under-dimensioned because the needs of one cannot steal from the other; your cache isn't adaptive.

That's why I built a system to do caching centrally at the OS level.

I hope someone like Brad is browsing here and can make some kind of piecing observation I've missed.

lmm · on July 29, 2013

I understand google solves that problem by not enabling swap on their servers.

gohrt · on July 29, 2013

That's rather common. If swapping can harm your application, than don't swap. On a machine where slowdown is tolerable (temporarily, on a desktop), swap is fine. On a machine whose entire purpose is to serve as a fast cache in front of slow storage, swapoff and fall back to shedding or queuing requests at the frontend.

mh- · on July 29, 2013

Without any specific knowledge of Google's practices, I can say this is certainly true - this is standard nowadays.

happyhappy · on July 29, 2013

That is my experience as well. In my thought experiment the 'offender' would be a server instance, not a process running among other applications on a single machine. Applications that hit swap often have memory leaks, and hitting swap is then just a matter of time. Creating a cascading failure may be preventable however.