Suppose there are many times more posts about something one generation of LLMs can't do (arithmetic, tic-tac-toe, whatever), than posts about how the next generation of models can do that task successfully. I think this is probably the case.
While I doubt it will happen, it would be somewhat funny if training on that text caused a future model to claim it can't do something that it "should" be able to because it internalized that it was an LLM and "LLMs can't do X."
The new model does play very well but when it draws the board it frequently places the moves in incorrect locations (but seemingly still keeps track of the correct ones). But I can't fault it too much, I don't think what is essentially ASCII art is intended to be a strength of the model.
Edit: Actually third game with it led to it making an illegal move, and claiming a draw (which would've been inevitable given optimal play for the rest of the game but there were several valid moves left to make).
They evaluate their automated reviewer by comparing against human evaluations on human-written research papers, and then seem to extrapolate that their automated reviewer would align with human reviewers on AI-written research papers. It seems like there are a few major pitfalls with this.
First, if their systems aren't multimodal, and their figures are lower-quality than human-created figures (which they explicitly list as a limitation), the automated reviewer would be biased in favor of AI-generated papers (only having access to the text). This is an obvious one but I think there could easily be other aspects of papers where the AI and human reviewers align on human-written papers, but not on AI papers.
Additionally, they note:
> Furthermore, the False Negative Rate (FNR) is much lower than the human baseline (0.39 vs. 0.52). Hence, the LLM-based review agent rejects fewer high-quality papers. The False Positive Rate (FNR [sic]), on the other hand, is higher (0.31 vs. 0.17)
It seems like false positive rate is the more important metric here. If a paper is truly high-quality, it is likely to have success w/ a rebuttal, or in getting acceptance at another conference. On the other hand, if this system leads to more low-quality submissions or acceptances via a high FPR, we're going to have more AI slop and increased load on human reviewers.
I admit I didn't thoroughly read all 185 pages, maybe these concerns are misplaced.
Also a concern about the paper generation process itself:
> In a similar vein to idea generation, The AI Scientist is allowed 20 rounds to poll the Semantic Scholar API looking for the most relevant sources to compare and contrast the near-completed paper against for the related work section. This process also allows The AI Scientist to select any papers it would like to discuss and additionally fill in any citations that are missing from other sections of the paper.
So... they don't look for related work until the paper is "near-completed." Seems a bit backwards to me.
great point. I think the AI scientist is already a winner.
If the likelihood of false outcome is FNR+FPR, then machine would fail 0.7 and humans 0.69 times. Humans do win nominally.
In terms of costs humans loose. For every FPR 0.31-0.17 = 0.14 you spend additionally, you'd gain FNR 0.52-0.39 = 0.13. The paper production costs discrepancy is at least factor 100. The value of the least useful research typically drives factor two or more benefit in comparison to production and validation costs. So the final balance is 0.014 to 0.36 -> x25 gain in favor of AI.
> Artists and "creative" people have long held a monopoly on this ability and are now finally paying the price
I've seen a lot of schadenfreude towards artists recently, as if they're somehow gatekeeping art and stopping the rest of us from practicing it.
I really struggle to understand it; the barrier of entry to art is basically just buying a paper and pencil and making time to practice. For most people the practice time could be spent on many things which would have better economic outcomes.
> monopoly
Doesn't this term imply an absence of competition? There seems to be a lot of competition. Anyone can be an artist, and anyone can attempt to make a living doing art. There is no certification, no educational requirements. I'm sure proximity to wealth is helpful but this is true of approximately every career or hobby.
Tangentially, there seem to be positive social benefits to everyone having different skills and depending on other people to get things done. It makes me feel good when people call me up asking for help with something I'm good at. I'm sure it feels the same for the neighborhood handyman when they fix someone's sink, the artist when they make profile pics for their friends, etc. I could be wrong but I don't think it'll be entirely good for people when they can just have an AI or a robot do everything for them.
This is also one of the first things I test with new models. I did notice that while it still plays very poorly, it is actually far more consistent with the board state, making only legal moves, and noticing when I win than is GPT4o.
On a related note, Microsoft published a press release last year [1] where they seemed to suggest that 30% of accepted copilot suggests was a 30% productivity boost for devs.
> users accept nearly 30% of code suggestions from GitHub Copilot
> Using 30% productivity enhancement, with a projected number of 45 million professional developers in 2030, generative AI developer tools could add productivity gains of an additional 15 million “effective developers” to worldwide capacity by 2030. This could boost global GDP by over $1.5 trillion
They were probably just being disingenuous to drum up hype but if not they'd have to believe that:
1) All lines of code take the same amount of time to produce
2) 100% of a developer's job is writing code
> To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes
Me > Alice has 40 brothers and she also has 50 sisters. How many sisters does Alice's brother have?
GPT4o (web version) > Alice has 50 sisters. Since all of Alice's brothers and sisters share the same set of siblings, each of Alice's 40 brothers also has 50 sisters. Therefore, Alice's brother has 50 sisters.
---
Step by step:
Me > Alice has 40 brothers and she also has 50 sisters. How many sisters does Alice's brother have? Think about it step by step.
GPT4o > To determine how many sisters Alice's brother has, follow these steps:
Identify the total number of siblings: Alice has 40 brothers and 50 sisters, plus Alice herself. Thus, the total number of siblings is:
40 (brothers)+50 (sisters)+1 (Alice)=91 siblings
40 (brothers)+50 (sisters)+1 (Alice)=91 siblings
Focus on Alice's brother: Alice's brother is one of the 40 brothers.
Determine the number of sisters for Alice's brother: Each brother has the same number of sisters. Since the brothers do not count themselves as sisters, they only count the 50 sisters, excluding themselves and other brothers.
Therefore, each of Alice's brothers has:
50 sisters
So, Alice's brother has 50 sisters.
---
Thinking step by step somehow gave an even more nonsensical answer, I'm actually very surprised it didn't work when prompted to do it step by step.
Suppose there are many times more posts about something one generation of LLMs can't do (arithmetic, tic-tac-toe, whatever), than posts about how the next generation of models can do that task successfully. I think this is probably the case.
While I doubt it will happen, it would be somewhat funny if training on that text caused a future model to claim it can't do something that it "should" be able to because it internalized that it was an LLM and "LLMs can't do X."