You bring up some really good points about badnwidth to the data and how many of...

You bring up some really good points about badnwidth to the data and how many of our competitors have gotten ridiculously high benchmarks. We do NOT cache on the gpu, EVER. The reason why is because most of our columns are much too large to fit on one or even 8 gpus let alone the rest of the space required for processing on it.

which are why we actually prefer to have only 1 GPU per server when we are making our own boxes. We find that our most optimal running environment is when we have smaller instances with only 1 gpu. This is due to the fact that two smaller rigs with a gpu each will benefit from increased CPU RAM throughput (basically double 1 rig) and you have 2x the PCIE bandwidth since you are splitting it across two machines. While we don't have the enviable op / byte loaded that some machine learning tool sets might use we are able to greatly enhance these throughputs by using compression and doing things like running multiple arithmetic operations in one kernel call.