Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
It's not just the prompt that matters, it's also field order (and a bunch of other things).
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
Grammars work best when aligned with prompt. That is, if your prompt gives you the right format of answer 80% of the time, the grammar will take you to a 100%. If it gives you the right answer 1% of the time, the grammar will give you syntactically correct garbage.
Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.