This paper might seem short, but it's a great introduction to various scheduler types available for Hadoop clusters. Fair Scheduler & Capacity Scheduler are the only ones I've witnessed in production use, but I see a huge potential for improvement in the longer term by using some of the adaptive types.
It's amazing how many enterprise customers encounter severe issues with their clusters simply because of poor scheduler configuration. Either they never changed the defaults, or they arbitrarily twisted knobs without understanding the how various configuration values are dependent on each other.
Finding a healthy balance between having humans describe what they think they want and letting the system adjust itself will be huge.
Is anyone actually sitting on zettabyte-scale data sets, as the introduction claims?
There's a three year old xkcd "what-if" that guesstimates Google has about 15 exabytes of data, and that presumably includes lots of different types of data (e.g., you're not going to run any analysis across youtube videos, gmails, and self-driving car telemetry).
It's amazing how many enterprise customers encounter severe issues with their clusters simply because of poor scheduler configuration. Either they never changed the defaults, or they arbitrarily twisted knobs without understanding the how various configuration values are dependent on each other.
Finding a healthy balance between having humans describe what they think they want and letting the system adjust itself will be huge.