> If you have the research paper, someone in the field could reimplement them in a few days.
Hi I did this while I was at Google Brain and it took our team of three more like a year. The "reimplementation" part took 3 months or so and the rest of the time was literally trying to debug and figure out all of the subtleties that were not quite mentioned in the paper. See https://openreview.net/forum?id=H1eerhIpLV
> The replication crisis (also called the replicability crisis and the reproducibility crisis) is an ongoing methodological crisis in which the results of many scientific studies are difficult or impossible to reproduce. Because the reproducibility of empirical results is an essential part of the scientific method,[2] such failures undermine the credibility of theories building on them and potentially call into question substantial parts of scientific knowledge.
People should publish automated tests. How does a performance-optimizer know that they haven't changed the output of there are no known-good inputs and outputs documented as executable tests? Pytest-hypothesis seems like a nice compact way to specify tests.
I think I agree with everything you've said here, but just want to note that while we absolutely should (where relevant) expect published code including automated tests, we should not typically consider reproduction that reuses that code to be "replication" per se. As I understand it, replication isn't merely a test for fraud (which rerunning should typically detect) and mistakes (which rerunning might sometimes detect) but also a test that the paper successfully communicates the ideas such that other human minds can work with them.
Sources of variance; Experimental Design, Hardware, Software, irrelevant environmental conditions/state, Data (Sample(s)), Analysis
Can you run the notebook again with the exact same data sample (input) and get the same charts and summary statistics (output)? Is there a way to test the stability of those outputs over time?
Can you run the same experiment (the same 'experimental design'), ceteris paribus (everything else being equal) and a different sample (input) and get a very similar output? Is it stable, differentiable, independent, nonlinear, reversible; Does it converge?
Now I have to go look up the definitions for Replication, Repeatability, Reproducibility
> Measures of reproducibility and repeatability: In chemistry, the terms reproducibility and repeatability are used with a specific quantitative meaning. [7] In inter-laboratory experiments, a concentration or other quantity of a chemical substance is measured repeatedly in different laboratories to assess the variability of the measurements. Then, the standard deviation of the difference between two values obtained within the same laboratory is called repeatability. The standard deviation for the difference between two measurement from different laboratories is called reproducibility. [8] These measures are related to the more general concept of variance components in metrology.
> In engineering, science, and statistics, replication is the repetition of an experimental condition so that the variability associated with the phenomenon can be estimated. ASTM, in standard E1847, defines replication as "... the repetition of the set of all the treatment combinations to be compared in an experiment. Each of the repetitions is called a replicate."
> Replication is not the same as repeated measurements of the same item: they are dealt with differently in statistical experimental design and data analysis.
> For proper sampling, a process or batch of products should be in reasonable statistical control; inherent random variation is present but variation due to assignable (special) causes is not. Evaluation or testing of a single item does not allow for item-to-item variation and may not represent the batch or process. Replication is needed to account for this variation among items and treatments.
> In simpler terms, given a statistical sample or set of data points from repeated measurements of the same quantity, the sample or set can be said to be accurate if their average is close to the true value of the quantity being measured, while the set can be said to be precise if their standard deviation is relatively small.
alltime classic Hacker News moment: "heh someone in the field could write this in a few days" "Hi its me 3 of us literally work at google brain and it took us a year"
I don't know the subject matter well enough to make the call, but it's possible the OP is making a general statement that's generally true even if it's not in this specific context of it taking a year.
One thing that's not appreciated by many who haven't tried to implement a NN is how subtle bugs can be. When you look at code for a NN, it's generally pretty simple. However, what happens when your code doesn't produce the output you were expecting? When that happens, it can be very difficult and time consuming to find the subtle issue with your code.
How come you weren't able to just get it from DeepMind given that they are a subsidiary of Google? Is there a lot of red tape involved in exchanging IP like that?
There is Leela Zero (https://github.com/leela-zero/leela-zero) for Go and lc0/Leela Chess (https://github.com/orgs/LeelaChessZero/repositories) for Chess, where both provide trained weights. The Leela Chess project specifically have been working for a long time on training and refining the weights for Chess, as well as providing the code -- they allow you to see the history and performance over time for the various trained models.
Secret sauce is in the ML compiler and accelerator used, but all those improvements simply lower the cost of training a model. You could still do it on a regular GPU, it would just take you more time.
In the case of Google, they probably used TPU chips that you can't get direct 'bare metal' access to anyway, so none of that code would have helped.
The actual optimizer used and parameters (like the learning rate schedule) is normally published in the research paper.
You should pencil out on a napkin just how long "more time" is. Here, i'll get you started:
1600 inferences per move * 1ms per inference * 250 moves/game * 30M games played = 12B seconds. 140k days; muzero with gumbel brought down the 1600 to ~40, but either way, you need some more scale.
It turns out a lot of the difficulties, judgment calls, and implementation details involve data pipelining. Some of those choices affect the final skill ceiling you reach. Which ones? How much? Are they path dependent? Well, you'll need to run it more than once...
Depends on game/environment and—since it's using a GBDT and not a NN—how good you are at feature extraction/selection for your problem.
High level, I'd say it's a good way to test a new environment w/out spending time/effort on GPUs until you understand the problem well, and then you can switch to the time/money costly GPU world.
Models can be massive, but also totally doable. Just to put things in perspective: ProcMaze solving using DeepMind MCTX converges <1M steps. Whereas a physically based agent such as HalfCheetah may require >100M steps to learn to run. Q-learning Pac-Man on snapdragon chromeos is ~1hr for 1000 epochs ;)
If you have the research paper, someone in the field could reimplement them in a few days.
Then there is the large compute cost for training them to produce the trained weights.
So, opensourcing these bits of work without the weights isn't as major a thing as you might imagine.