P(X_1=x_1, X_2=x_2, X_3=x_3) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_1=x_1, X_2=x_...

P(X_1=x_1, X_2=x_2, X_3=x_3) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_1=x_1, X_2=x_2) = P(X_3=x_3 | X_1=X_1, X_2=x_2) • P(X_2=x_2 | X_1=x_1) • P(X_1=x_1)

That is to say: Having a correct conditional probability distribution over the next token conditional on the previous tokens, produces a correct probability distribution over sequences of tokens.

And, “correct probability distribution over sequences of tokens” (or, “correct conditional probability distribution over sequences of tokens, conditional on whatever)”, can be... well, you can describe pretty much any kind of input/output behavior in those terms.

So, “it works by predicting the next token” is, at least in principle, not much of a constraint on what kinds of input/output behavior it can have?

So, whatever impressive thing it does, is not really in conflict with its output being produced from the probability distribution P(X_{n+1}=x_{n+1} | X_1=x_1, ..., X_n=x_n) (“predicting the next token”)