Scheduling Algorithms in Big Data: A Survey [pdf]

thinkmassive · on Nov 20, 2016

This paper might seem short, but it's a great introduction to various scheduler types available for Hadoop clusters. Fair Scheduler & Capacity Scheduler are the only ones I've witnessed in production use, but I see a huge potential for improvement in the longer term by using some of the adaptive types.

It's amazing how many enterprise customers encounter severe issues with their clusters simply because of poor scheduler configuration. Either they never changed the defaults, or they arbitrarily twisted knobs without understanding the how various configuration values are dependent on each other.

Finding a healthy balance between having humans describe what they think they want and letting the system adjust itself will be huge.

nl · on Nov 20, 2016

Hadoop scheduling is hard, and the one thing I'd love to give someone some money to come in and fix.

dkuder · on Nov 20, 2016

It's dated. Uses references to Hadoop 1.2 documentation. Hadoop is at 2.7 currently with 2.8 and 3.0 coming soon.

In particular it refers to slots. YARN no longer use slots.

No mention of Tez, Storm or Spark?

mattkrause · on Nov 20, 2016

Is anyone actually sitting on zettabyte-scale data sets, as the introduction claims?

There's a three year old xkcd "what-if" that guesstimates Google has about 15 exabytes of data, and that presumably includes lots of different types of data (e.g., you're not going to run any analysis across youtube videos, gmails, and self-driving car telemetry).

bitmadness · on Nov 20, 2016

The authors really need to clean this up, the English is awful...