In general, all bets are off. The model has a certain auto-regressive property (...

In general, all bets are off. The model has a certain auto-regressive property (a factoring of conditional probabilities), and every other sampling should be examined with extreme suspicion. At a bare minimum, a lot of real-world problems in your desired domain should be spot-checked.

For more constrained examples like that, you're probably a lot safer than with completely unconstrained strings.

Even for very simple constraints like that, it's not terribly hard to imagine exactly the same ellipses problem (especially if your chosen JSON grammar isn't very careful with unbounded vs 53-bit integers). You really want a non-negative integer, you're using JSON and don't want to roll your own grammer so you use an off-the-shelf JSON LLM CFG toolkit of some flavor, and you allow _numbers_ in that field. I haven't personally seen ellipses bugs after only a few dozen characters in current-gen LLMs, but to have something concrete talk about let's say your model "tries" to write a transaction_number of 3141592635.... They want to truncate the actual result with an ellipsis. On the first period, the only allowable results include more digits (the grammar prevents the actual ellipsis from happening). You successfully adhere to the grammar and have garbage in that field, where a simple retry loop would have likely completed correctly.

Assuming the grammar actually correctly restricts to just a non-empty contiguous sequence of digits in that spot, that's _harder_ to force a bug into, but it's not impossible. For starters, the whole thing is probabilistic. Suppose the model uses "0" as the first digit. The only remaining valid (integer) JSON is finishing the field. Multiple leading zeros's aren't allowed. If for _any_ reason the model is more likely to produce invalid JSON after incorrectly using a "0" for the first digit, the grammar-constrained implementation will have more errors than a basic retry loop.

You can probably do something clever with changing the sampling temperature based on which part of the grammar you're in to side-step some of those issues. You might also find that for a particular problem my theoretical complaints don't seem to apply in practice. Maybe, for business reasons, past a certain accuracy rate you value being able to write simpler code (i.e., assume valid JSON) even if you dip down from 99% to 98% accuracy on a task. Nothing I've outlined says "don't guide an LLM with grammars." Do be careful with it though and consciously consider the task at hand and your proposed solution. It's not a free lunch.