This test has always been so stupid since models work at the token level. Claude 3.5 already 5xs your frontend dev speed but people still say "hurr durr it can't count strawberry" as if that's a useful problem
You can rely on it for anything that you can validate quickly. And it turns out, there are a lot of problems which are trivial to validate the solution to, but difficult to build the solution.