The sample sounds impressive, but based on their claim -- 'Streaming inference is faster than playback even on an A100 40GB for the 3 billion parameter model' -- I don't think this could run on a standard laptop.
I think it could be. I also think it is likely that HN frequenter `dekhn` has personally spent more money on compute resources than any other living human, so maybe they will chime in on how the cost gets allocated to the research.
A big part of it is basically hard production quota: the ability to run jobs at a high priority on large machines for an entire quarter. The main issue was that quota was somewhat overallocated, or otherwise unable to be used (if you and another team both wanted a full TPUv3 with all its nodes and fabric).
From what I can tell, ads made the money and search/ads bought machines with their allocated budget, TI used their budget to run the systems, and then funny money in the form of quota was allocated to groups. THe money was "funny" in the sense that the full reach-through costs of operating a TPU for a year looks completely different from the production allocation quota that gets handed out. I think Google was long trying to create a market economy, but it was really much more like a state-funded exercise.
(I am not proud of how much CPU I wasted on protein folding/design and drug discovery, but I'm eternally thankful for Urs giving me the opportunity to try it out and also to compute the energy costs associated with the CPU use)
reply