Any odd number 2n+1 of "a" repetitions, longest match. The inflating first rule ...

orlp · on June 5, 2024

I'm sorry that is not correct at all. It does not match the string "aaaaa" for example.

HelloNurse · on June 5, 2024

Why not? Preferring the first rule, after seeing a you expect Str a; after seeing another a you expect Str a a; after seeing the third a you expect Str a a a (and you can already realize it cannot match) so after some backtracking you match the third a with the second rule instead, expecting a a, which are the fourth and fifth a in the input.

Do you have a different parsing algorithm in mind?

orlp · on June 5, 2024

> Do you have a different parsing algorithm in mind?

Yes, since I specified a PEG grammar I assumed a PEG parsing algorithm.

You can try the grammar yourself in a PEG parser, e.g. https://shamansir.github.io/pegjs-fn/.

HelloNurse · on June 5, 2024

This tool seems to dislike recursion, it manages to parse aaaaa with

   Str =   "a"  Str2 "a" /"a"
   Str2 =  "a"  Str3 "a" /"a"
   Str3 =  "a"

or also

   Str =   "a"  Str2 "a" 
   Str2 =  "a"  Str3 "a" 
   Str3 =  "a"

I'd call it a bug.

Note that

   Str =  "aa"*"a"

parses any odd number of "a" repetitions correctly.

orlp · on June 5, 2024

> I'd call it a bug.

It's not a bug in the tool, it's inherent to PEG. All PEG parsers will behave the same on this grammar.

My point is that PEG is highly unintuitive in how it works to most human brains, which is why I specifically asked for a plain explanation why this happens to someone who claims their brains align with how PEG works.

HelloNurse · on June 5, 2024

The ordered choice operator should try all choices, not stop earlier.

There's probably some cache confusing, for instance, not matching the first Str rule at the third a with not matching a Str at all (neither rule) at the third a.

The original example (here with parentheses, just in case)

   Str= ("a" Str "a") / "a"

matches strings of 2N-1 "a" repetitions if N is a power of 2 (i.e. length 1,3,7,15,31,63...) which is an obvious symptom of improperly constraining rule expansion. So I insist, you are using an incorrect parsing tool.

Interestingly,

   Str= ("a" Str "a") / "b"

matches strings of N "a", one "b" in the middle and N "a" for any N, without degenerate behaviour.

orlp · on June 5, 2024

> The ordered choice operator should try all choices, not stop earlier.

PEGs made this choice to ensure linear-time parsing. Another example of what I would consider a failure is the PEG grammar:

    Keyword = "else" / "elseif"

This will only match "else", it will never match "elseif" as the ordered choice will refuse to turn the successful match of "else" into a failure to try the "elseif".

> So I insist, you are using an incorrect parsing tool.

I agree that PEG is an incorrect formalization for parsing, as it has highly unintuitive and surprising behaviors like this one.

However, I have to insist that this is part of Parsing Expression Grammars, and not an issue with some particular tool implementing PEGs. It's part of the formalization, and if you wish to avoid this you need a different formalization, like Context-Free Grammars.

My personal favorite is LR(1) CFG grammars: you are guaranteed to get unambiguous and linear-time parsing grammars, or else the grammar will fail to compile.

What your intuition probably is is that PEGs are Ordered Context-Free Grammars, e.g. such as in https://arxiv.org/pdf/2309.08717, where first a full ambiguous parse forest is constructed and then using ordered rules a single canonical parse is selected. That is a lot more sensible, but also raises the time complexity of parsing from O(n) to O(n^4) for strings of length n. However, that's not what PEGs are.

aidenn0 · on June 5, 2024

> My personal favorite is LR(1) CFG grammars: you are guaranteed to get unambiguous and linear-time parsing grammars, or else the grammar will fail to compile.

We can both agree that LR(1) is great; it parses a large and useful subset of CFLs; had the minimal LR(1) parser been discovered before YACC was written, I probably would never have moved to PEGs; hopefully we can all agree that LALR parsers are terrible?

As a side note, I don't consider LR(1) parsers to be CFGs since they can only parse a subset of CFLs (actually only a subset of DCFLs if I remember my rusty CS correctly).

[edit]

Also TFA argues for Early parsers, which seems to me to be a poor choice for parsing programming languages.

orlp · on June 5, 2024

> Hopefully we can all agree that LALR parsers are terrible?

Yes, I think most of the bad reputation LR(1) gets is from stuff that is actually LALR(1) with its nonsense reduce-reduce conflicts.

Another great thing is that LR(1) parser generators are being written now that instead of just spitting out a 'oops there was a conflict on these two things in this state', they can actually give you back two conflicting prefixes giving you much more insight in what causes the defect in your grammar and how to fix it.

aidenn0 · on June 5, 2024

> Keyword = "else" / "elseif"

Are there any languages in which this rule makes sense? I can't imagine wanting to write

  ifFOOelseifBARelseBAZ

And ending your token rules with WS|EOF removes any issues, so

  if FOO elseif BAR else BAZ

works just fine.

thaumasiotes · on June 5, 2024

I worked this out by hand and I'd recommend you do the same. The paper defining the behavior of PEGs is here: https://bford.info/pub/lang/peg.pdf . Section 3.3 is the relevant one, defining all the parsing rules.

As for the simple explanation of what's going wrong:

1. We try to parse `aaaaa` as if it opened with the pattern `a STR a`. [And, with foresight, let's observe that this is true of the way we want the parse to go.]

2. We match the `a` at the beginning of the input and try to parse `aaaa` as if it opened with the pattern `STR a`.

3. This repeats; we make the same guess that we're dealing with a recursive STR, we match an `a`, and then we try to parse `aaa` as if it opened with `STR a`. I didn't mention this before, but the first step of doing this is to match the first element of the concatenation, `STR`, against the input.

4. Here's where things go wrong. We still guess that we're dealing with a recursive STR, because that rule has priority over the terminal rule `STR = a`. With foresight, let's observe that in the parse we want, this guess will fail, because we need the third STR to be a literal `a`. In that world, we'd then match the two following `a`s and our parse would succeed.

5. However, when we try to match our recursive rule `a STR a` against our input `aaa`, this succeeds, because that's a correct match. We have an outer recursive STR and an inner terminal STR, yielding `aaa`.

6. Since our first guess that the suffix `__aaa`` opens with a recursive STR succeeded, we will never experiment with what would happen if it opened with a terminal STR, the way we wish it would.

7. That was our only chance; the whole parse will now fail to match in predictable ways.

-----

> The ordered choice operator should try all choices, not stop earlier.

You can't defend PEGs by saying you wish they were defined differently. That's not a defense!

Look at the paper:

> Alternation (case 1): If (e₁, xy) => (n₁, x) then (e₁ / e₂, xy) => (n₁ + 1, x). Alternative e₁ is first tested, and if it succeeds, the expression e₁ / e₂ succeeds without testing e₂.

It's hard to be clearer than that.

kragen · on June 5, 2024

that's not how pegs work, and if you add that kind of unrestricted backtracking to pegs, you get a grammar class that nobody knows how to parse in linear time

pegs, by contrast, like ll(1) and lr grammars, have a linear-time parsing algorithm available (though its constant factor is significantly higher). that's one of the reasons people use them

you should read an introduction to pegs; you might like them once you get past 'i insist peg parsers are incorrect parsing tools' ;)

adwn · on June 5, 2024

No, orlp is right, as far as I understand. It's not a bug in the tool, but inherent to PEGs in general. Wikipedia has a section on this problem: https://en.wikipedia.org/wiki/Parsing_expression_grammar#The...

OskarS · on June 5, 2024

I’ve never touched a PEG parser in my life and parsing isn’t super my thing in general, but: a recursive descent parser could parse this just fine (not left recursive), correct? But it would have to backtrack after trying rule 1 five times, and switch to rule 2 when it backtracks to the third layer of recursion?

PEGs can’t do this kind of backtracking, is that the reason?

kragen · on June 5, 2024

yes, exactly correct

it's common to write recursive-descent parsers without this kind of backtracking support, though, because expressing that kind of backtracking in a natural way requires coroutines

redman25 · on June 5, 2024

Wow, this is surprising to me. Why doesn’t this work?

HelloNurse · on June 5, 2024

See the nearby comment by adwn: aggressive "packrat" caching fixates on long matches starting at intermediate positions, without considering any shortest matches at the same position.

Correct parsing would require discarding these long matches because they cause higher level rule expansions to fail, but smarter algorithms with an increase of runtime due to complete backtracking and/or an increase of memory usage to distinguish and remember different matches for the same rule at the same location would be required.

EDIT: two nearby comments by adwn.

kragen · on June 5, 2024

this is not specific to the packrat parsing algorithm; any algorithm for parsing pegs has the same behavior as packrat. your understanding of peg semantics is incorrect

adwn · on June 5, 2024

(I'll use position 1...5 to refer to the characters in "aaaaa", and first, second, third level to refer to the initial and then the recursive calls to the Str rule)

1. When the PEG parser enters the third level of Str, its first choice will successfully match on positions 3...5, and return to the second level of Str.

2. The first choice on the second level will fail, however, because it expects an "a" at position 6, which doesn't exist. Therefore, the first choice is discarded.

3. The second choice on the second level will succeed by matching "a" on position 2, successfully returning to the first level.

4. The first level finishes matching the first choice by consuming the "a" on position 3, successfully returning the parse tree Str = ("a" ("a") "a")

5. The remaining "aa" on positions 4...5 weren't consumed.