diff --git a/README.md b/README.md index d9afc7f..2d69420 100644 --- a/README.md +++ b/README.md @@ -106,7 +106,7 @@ Lark is great at handling ambiguity. Here is the result of parsing the phrase "f - MyPy support using type stubs - And much more! -See the full list of [features here](https://lark-parser.readthedocs.io/en/latest/features/) +See the full list of [features here](https://lark-parser.readthedocs.io/en/latest/features.html) ### Comparison to other libraries @@ -132,7 +132,7 @@ Check out the [JSON tutorial](/docs/json_tutorial.md#conclusion) for more detail |:--------|:----------|:----|:--------|:------------|:------------|:----------|:---------- | **Lark** | Earley/LALR(1) | EBNF | Yes! | Yes! | Yes! | Yes! | Yes! (LALR only) | | [PLY](http://www.dabeaz.com/ply/) | LALR(1) | BNF | No | No | No | No | No | -| [PyParsing](http://pyparsing.wikispaces.com/) | PEG | Combinators | No | No | No\* | No | No | +| [PyParsing](https://github.com/pyparsing/pyparsing) | PEG | Combinators | No | No | No\* | No | No | | [Parsley](https://pypi.python.org/pypi/Parsley) | PEG | EBNF | No | No | No\* | No | No | | [Parsimonious](https://github.com/erikrose/parsimonious) | PEG | EBNF | Yes | No | No\* | No | No | | [ANTLR](https://github.com/antlr/antlr4) | LL(*) | EBNF | Yes | No | Yes? | Yes | No | diff --git a/docs/_static/sppf/sppf.html b/docs/_static/sppf/sppf.html new file mode 100644 index 0000000..c9c3d21 --- /dev/null +++ b/docs/_static/sppf/sppf.html @@ -0,0 +1,212 @@ + + + + + + + + + +
+ +
+ + + +

Shared Packed Parse Forest (SPPF)

+ + + + + + +
+ +

+ +

In the last decade there has been a lot of interest in generalized parsing techniques. These techniques can be used to generate a working parser for any context-free grammar. This means that we no longer have to massage our grammar to fit into restricted classes such as LL(k) or LR(k). Supporting all context-free grammars means that grammars can be written in a natural way, and grammars can be combined, since the class of context-free grammars is closed under composition.

+ +

One of the consequences of supporting the whole class of context-free grammars is that also ambiguous grammars are supported. In an ambiguous grammar there are sentences in the language that can be derived in multiple ways. Each derivation results in a distinct parse tree. For each additional ambiguity in the input sentence, the number of derivations might grow exponentially. Therefore generalized parsers output a parse forest, rather than a set of the parse trees. In this parse forest, often sharing is used used to reduce the total space required to represent all derivation trees. Nodes which have the same subtree are shared, and nodes are combined which correspond to different derivations of the same substring. A parse forest where sharing is employed is called a shared packed parse forest (SPPF).

+ +

This article will describe the SPPF data structure in more detail. More information about the generation of the SPPF using the GLL algorithm can be found in the paper GLL parse-tree generation by E. Scott and A. Johnstone. Right Nulled GLR parsers can also be used to generate an SPPF, which is described in the paper Right Nulled GLR Parsers by E. Scott and A. Johnstone.

+ +

There are three types of nodes in an SPPF associated with a GLL parser: symbol nodes, packed nodes, and intermediate nodes. In the visualizations symbol nodes are shown as rectangles with rounded corners, packed nodes are shown as circles, or ovals when the label is visualized, and intermediate nodes are shown as rectangles.

+ +

Symbol nodes have labels of the form $(x,j,i)$ where $x$ is a terminal, nonterminal, or $\varepsilon$ (i.e. $x \in T \cup N \cup \lbrace \varepsilon \rbrace$), and $0 \leq j \leq i \leq m$ with $m$ being the length of the input sentence. The tuple $(j,i)$ is called the extent, and denotes that the symbol $x$ has been matched on the substring from position $j$ up to position $i$. Here $j$ is called the left extent, and $i$ is called the right extent.

+ +

Packed nodes have labels of the form $(t,k)$, where $0 \leq k \leq m$. Here $k$ is called the pivot, and $t$ is of the form $X ::= \alpha \cdot \beta$. The value of $k$ represents that the last symbol of $\alpha$ ends at position $k$ of the input string. Packed nodes are used to represent multiple derivation trees. When multiple derivations are possible with the same extent, starting from the same nonterminal symbol node, a separate packed node is added to the symbol node for each derivation.

+ +

Intermediate nodes are used to binarize the SPPF. They are introduced from the left, and group the children of packed nodes in pairs from the left. The binarization ensures that the size of the SPPF is worst-case cubic in the size of the input sentence. The fact that the SPPF is binarized does not mean that each node in the SPPF has at most two children. A symbol node or intermediate node can still have as many packed node children as there are ambiguities starting from it. Intermediate nodes have labels of the form $(t,j,i)$ where $t$ is a grammar slot, and $(j,i)$ is the extent. There are no intermediate nodes of the shape $(A ::= \alpha \cdot, j,i)$, where the grammar pointer of the grammar slot is at the end of the alternate. These grammar slots are present in the form of symbol nodes.

+ +

Consider the following grammar:

+ +

$\quad S ::= ABCD \quad A ::= a \quad B ::= b \quad C ::= c \quad D ::= d. $

+ +

Then given input sentence $abcd$, the the following SPPF will be the result:

+ + +
+ + + + +
+

SPPF with intermediate nodes

+ +
+ +
+ + +

Suppose that the intermediate nodes had not been added to the SPPF. Then the nonterminal symbol nodes for $A$, $B$, $C$, and $D$ would have been attached to the nonterminal symbol node $S$:

+ + +
+ + + + +
+

SPPF without intermediate nodes

+ +
+ +
+ + +

This example shows how intermediate nodes ensure that the tree is binarized.

+ +

Adding cycles

+ +

Grammars that contain cycles can generate sentences which have infinitely many derivation trees. A context-free grammar is cyclic if there exists a nonterminal $A \in N$ and a derivation $A \overset{+}\Rightarrow A$. Note that a cyclic context-free grammar implies that the context-free grammar is left-recursive, but the converse does not hold. The derivation trees for a cyclic grammar are represented in the finite SPPF by introducing cycles in the graph.

+ +

Consider the following cyclic grammar: +$S ::= SS \mid a \mid \varepsilon$.

+ +

Given input sentence $a$, there are infinitely many derivations. All these derivations are present in the following SPPF:

+ + +
+ + + + +
+

SPPF containing an infinite number of derivations

+ +
+ +
+ + +

Ambiguities

+ +

A parse forest is ambiguous if and only if it contains at least one ambiguity. An ambiguity arises when a symbol node or intermediate node has at least two packed nodes as its children. Such nodes are called ambiguous. Consider for instance the following grammar with input sentence $1+1+1$: +$ E ::= E + E \mid 1 $.

+ +

This gives the following SPPF:

+ + +
+ + + + +
+

SPPF containing an ambiguous root node

+ +
+ +
+ + +

In this SPPF, symbol node $(E,0,5)$ has two packed nodes as children. This means that there are at least two different parse trees starting at this node, the parse trees representing derivations $(E+(E+E))$ and $((E+E)+E)$ respectively.

+ +

The set of all parse trees present in the SPPF is defined in the following way:

+ +

Start at the root node of the SPPF, and walk the tree by choosing one packed node below each visited node, and choosing all the children of a visited packed node in a recursive manner.

+ +

Structural Properties

+ +

There are various structural properties that are useful when reasoning about SPPFs in general. At first note that each symbol node $(E,j,i)$ with $E \in T \cup N \cup \lbrace \varepsilon \rbrace$ is unique, so an SPPF does not contain two symbol nodes $(A,k,l)$ and $(B,m,n)$ with $A = B, k = m$, and $l=n$.

+ +

Terminal symbol nodes have no children. These nodes represent the leaves of the parse forest. Nonterminal symbol nodes $(A,j,i)$ have packed node children of the form $(A ::= \gamma \cdot, k)$ with $j \leq k \leq i$, and the number of children is not limited to two.

+ +

Intermediate nodes $(t,j,i)$ have packed node children with labels of the form $(t,k)$, where $j \leq k \leq i$.

+ +

Packed nodes $(t,k)$ have one or two children. The right child is a symbol node $(x,k,i)$ and the left child (if it exists) is a symbol or intermediate node with label $(s,j,k)$, where $j \leq k \leq i$. Packed nodes have always exactly one parent which is a symbol node or intermediate node.

+ +

It is useful to observe that the SPPF is a bipartite graph, with on the one hand the set of intermediate and symbol nodes and on the other hand the set of packed nodes. Therefore edges always go from a node of the first type to a node of the second type, or the other way round. As a consequence, cyles in the SPPF are always of even length.

+ +

Transformation to an abstract syntax tree

+ +

In the end, we often want a single abstract syntax tree (AST) when parsing an input sentence. In order to arrive at this AST, we need disambiguation techniques to remove undesired parse trees from the SPPF or avoid the generation of undesired parse trees in the first place. {% cite sanden2014thesis %} describes several SPPF disambiguation filters that remove ambiguities arising in expression grammars. Furthermore a method is described to integrate parse-time filtering in GLL that tries to avoid embedding undesired parse trees in the SPPF.

+ +

Of course, other transformation might be needed such as the removal of whitespace and comments from the parse forest.

+ +
+ +
+ + + + + + \ No newline at end of file diff --git a/docs/_static/sppf/sppf_111.svg b/docs/_static/sppf/sppf_111.svg new file mode 100644 index 0000000..d8af181 --- /dev/null +++ b/docs/_static/sppf/sppf_111.svg @@ -0,0 +1,765 @@ + + + +image/svg+xml(E, 0, 5) +(E ::= E + • E ,0,2) +(E, 2, 5) +(E, 4, 5) +(E ::= E + • E ,0,4) +(E, 0, 1) +(+, 1, 2) +(1, 0, 1) +(E ::= E + • E ,2,4) +(E, 2, 3) +(+, 3, 4) +(1, 2, 3) +(1, 4, 5) +(E, 0, 3) + \ No newline at end of file diff --git a/docs/_static/sppf/sppf_abcd.svg b/docs/_static/sppf/sppf_abcd.svg new file mode 100644 index 0000000..9ed8c80 --- /dev/null +++ b/docs/_static/sppf/sppf_abcd.svg @@ -0,0 +1,584 @@ + + + +image/svg+xml(S, 0, 4) +(S ::= A B C • D ,0,3) +(D, 3, 4) +(S ::= A B • C D ,0,2) +(C, 2, 3) +(A, 0, 1) +(B, 1, 2) +(a, 0, 1) +(b, 1, 2) +(c, 2, 3) +(d, 3, 4) + \ No newline at end of file diff --git a/docs/_static/sppf/sppf_abcd_noint.svg b/docs/_static/sppf/sppf_abcd_noint.svg new file mode 100644 index 0000000..ab9a46d --- /dev/null +++ b/docs/_static/sppf/sppf_abcd_noint.svg @@ -0,0 +1,522 @@ + + + +image/svg+xml(S, 0, 4) +(A, 0, 1) +(B, 1, 2) +(C, 2, 3) +(D, 3, 4) +(a, 0, 1) +(b, 1, 2) +(c, 2, 3) +(d, 3, 4) + \ No newline at end of file diff --git a/docs/_static/sppf/sppf_cycle.svg b/docs/_static/sppf/sppf_cycle.svg new file mode 100644 index 0000000..dcac54d --- /dev/null +++ b/docs/_static/sppf/sppf_cycle.svg @@ -0,0 +1,682 @@ + + + +image/svg+xml(S, 0, 1) +(S ::= a•,0) +(S ::= S S•,1) +(S ::= S S•,0) +(a, 0, 1) +(S ::= S • S ,0,1) +(S, 1, 1) +(S ::= S • S ,0,0) +(S ::= S • S ,0) +(S ::= +ε +•,1) +(S ::= S S•,1) +( +ε +, 1, 1) +(S ::= S • S ,1,1) +(S ::= S • S ,1) +(S ::= S • S ,0) +(S, 0, 0) +(S ::= +ε +•,0) +(S ::= S S•,0) +( +ε +, 0, 0) + \ No newline at end of file diff --git a/docs/features.md b/docs/features.md index cb711b3..fb272aa 100644 --- a/docs/features.md +++ b/docs/features.md @@ -23,7 +23,7 @@ ## Extra features - Import rules and tokens from other Lark grammars, for code reuse and modularity. - - Support for external regex module ([see here](classes.md#using-unicode-character-classes-with-regex)) + - Support for external regex module ([see here](classes.html#using-unicode-character-classes-with-regex)) - Import grammars from Nearley.js ([read more](nearley.md)) - CYK parser - Visualize your parse trees as dot or png files ([see_example](https://github.com/lark-parser/lark/blob/master/examples/fruitflies.py)) diff --git a/docs/grammar.md b/docs/grammar.md index 8ed0c58..f92b013 100644 --- a/docs/grammar.md +++ b/docs/grammar.md @@ -112,9 +112,13 @@ You can use flags on regexps and strings. For example: ```perl SELECT: "select"i //# Will ignore case, and match SELECT or Select, etc. MULTILINE_TEXT: /.+/s +SIGNED_INTEGER: / + [+-]? # the sign + (0|[1-9][0-9]*) # the digits + /x ``` -Supported flags are one of: `imslu`. See Python's regex documentation for more details on each one. +Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one. Regexps/strings of different flags can only be concatenated in Python 3.6+ @@ -251,7 +255,7 @@ COMMENT: "#" /[^\n]/* ``` ### %import -Allows to import terminals and rules from lark grammars. +Allows one to import terminals and rules from lark grammars. When importing rules, all their dependencies will be imported into a namespace, to avoid collisions. It's not possible to override their dependencies (e.g. like you would when inheriting a class). @@ -264,7 +268,7 @@ When importing rules, all their dependencies will be imported into a namespace, %import (, , , ) ``` -If the module path is absolute, Lark will attempt to load it from the built-in directory (currently, only `common.lark` is available). +If the module path is absolute, Lark will attempt to load it from the built-in directory (which currently contains `common.lark`, `python.lark`, and `unicode.lark`). If the module path is relative, such as `.path.to.file`, Lark will attempt to load it from the current working directory. Grammars must have the `.lark` extension. diff --git a/docs/parsers.md b/docs/parsers.md index 13f2466..af7df6f 100644 --- a/docs/parsers.md +++ b/docs/parsers.md @@ -25,9 +25,18 @@ Lark provides the following options to combat ambiguity: 3) As an advanced feature, users may use specialized visitors to iterate the SPPF themselves. -**dynamic_complete** +**lexer="dynamic_complete"** + +Earley's "dynamic" lexer uses regular expressions in order to tokenize the text. It tries every possible combination of terminals, but it matches each terminal exactly once, returning the longest possible match. + +That means, for example, that when `lexer="dynamic"` (which is the default), the terminal `/a+/`, when given the text `"aa"`, will return one result, `aa`, even though `a` would also be correct. + +This behavior was chosen because it is much faster, and it is usually what you would expect. + +Setting `lexer="dynamic_complete"` instructs the lexer to consider every possible regexp match. This ensures that the parser will consider and resolve every ambiguity, even inside the terminals themselves. This lexer provides the same capabilities as scannerless Earley, but with different performance tradeoffs. + +Warning: This lexer can be much slower, especially for open-ended terminals such as `/.*/` -**TODO: Add documentation on dynamic_complete** ## LALR(1) @@ -37,7 +46,9 @@ Lark comes with an efficient implementation that outperforms every other parsing Lark extends the traditional YACC-based architecture with a *contextual lexer*, which automatically provides feedback from the parser to the lexer, making the LALR(1) algorithm stronger than ever. -The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows one to parse languages that LALR(1) was previously incapable of parsing. +The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of terminals. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows one to parse languages that LALR(1) was previously incapable of parsing. + +(If you're familiar with YACC, you can think of it as automatic lexer-states) This is an improvement to LALR(1) that is unique to Lark. diff --git a/docs/philosophy.md b/docs/philosophy.md index a1d8f8c..b2ce6e7 100644 --- a/docs/philosophy.md +++ b/docs/philosophy.md @@ -16,7 +16,7 @@ Lark's mission is to make the process of writing them as simple and abstract as 5. Performance is still very important -6. Follow the Zen Of Python, whenever possible and applicable +6. Follow the Zen of Python, whenever possible and applicable In accordance with these principles, I arrived at the following design choices: diff --git a/docs/visitors.rst b/docs/visitors.rst index 8086f7b..a0e1711 100644 --- a/docs/visitors.rst +++ b/docs/visitors.rst @@ -30,15 +30,17 @@ Example: :: class IncreaseAllNumbers(Visitor): - def number(self, tree): - assert tree.data == "number" - tree.children[0] += 1 + def number(self, tree): + assert tree.data == "number" + tree.children[0] += 1 IncreaseAllNumbers().visit(parse_tree) .. autoclass:: lark.visitors.Visitor + :members: visit, visit_topdown, __default__ .. autoclass:: lark.visitors.Visitor_Recursive + :members: visit, visit_topdown, __default__ Interpreter ----------- @@ -63,7 +65,7 @@ Transformer ----------- .. autoclass:: lark.visitors.Transformer - :members: __default__, __default_token__ + :members: transform, __default__, __default_token__, __mul__ Example: :: @@ -90,6 +92,11 @@ Example: T(visit_tokens=True).transform(tree) +.. autoclass:: lark.visitors.Transformer_NonRecursive + +.. autoclass:: lark.visitors.Transformer_InPlace + +.. autoclass:: lark.visitors.Transformer_InPlaceRecursive v_args ------ diff --git a/examples/advanced/error_reporting_earley.py b/examples/advanced/error_reporting_earley.py new file mode 100644 index 0000000..f0bcc20 --- /dev/null +++ b/examples/advanced/error_reporting_earley.py @@ -0,0 +1,79 @@ +""" +Example-Driven Error Reporting +============================== + +A demonstration of example-driven error reporting with the Earley parser +(See also: error_reporting_lalr.py) +""" +from lark import Lark, UnexpectedInput + +from _json_parser import json_grammar # Using the grammar from the json_parser example + +json_parser = Lark(json_grammar) + +class JsonSyntaxError(SyntaxError): + def __str__(self): + context, line, column = self.args + return '%s at line %s, column %s.\n\n%s' % (self.label, line, column, context) + +class JsonMissingValue(JsonSyntaxError): + label = 'Missing Value' + +class JsonMissingOpening(JsonSyntaxError): + label = 'Missing Opening' + +class JsonMissingClosing(JsonSyntaxError): + label = 'Missing Closing' + +class JsonMissingComma(JsonSyntaxError): + label = 'Missing Comma' + +class JsonTrailingComma(JsonSyntaxError): + label = 'Trailing Comma' + + +def parse(json_text): + try: + j = json_parser.parse(json_text) + except UnexpectedInput as u: + exc_class = u.match_examples(json_parser.parse, { + JsonMissingOpening: ['{"foo": ]}', + '{"foor": }}', + '{"foo": }'], + JsonMissingClosing: ['{"foo": [}', + '{', + '{"a": 1', + '[1'], + JsonMissingComma: ['[1 2]', + '[false 1]', + '["b" 1]', + '{"a":true 1:4}', + '{"a":1 1:4}', + '{"a":"b" 1:4}'], + JsonTrailingComma: ['[,]', + '[1,]', + '[1,2,]', + '{"foo":1,}', + '{"foo":false,"bar":true,}'] + }, use_accepts=True) + if not exc_class: + raise + raise exc_class(u.get_context(json_text), u.line, u.column) + + +def test(): + try: + parse('{"example1": "value"') + except JsonMissingClosing as e: + print(e) + + try: + parse('{"example2": ] ') + except JsonMissingOpening as e: + print(e) + + +if __name__ == '__main__': + test() + + diff --git a/examples/advanced/error_reporting_lalr.py b/examples/advanced/error_reporting_lalr.py index 102f7b1..c2cb239 100644 --- a/examples/advanced/error_reporting_lalr.py +++ b/examples/advanced/error_reporting_lalr.py @@ -3,7 +3,7 @@ Example-Driven Error Reporting ============================== A demonstration of example-driven error reporting with the LALR parser - +(See also: error_reporting_earley.py) """ from lark import Lark, UnexpectedInput diff --git a/examples/advanced/python3.lark b/examples/advanced/python3.lark index 78c9875..803a21b 100644 --- a/examples/advanced/python3.lark +++ b/examples/advanced/python3.lark @@ -23,7 +23,7 @@ decorated: decorators (classdef | funcdef | async_funcdef) async_funcdef: "async" funcdef funcdef: "def" NAME "(" parameters? ")" ["->" test] ":" suite -parameters: paramvalue ("," paramvalue)* ["," [ starparams | kwparams]] +parameters: paramvalue ("," paramvalue)* ["," "/"] ["," [starparams | kwparams]] | starparams | kwparams starparams: "*" typedparam? ("," paramvalue)* ["," kwparams] @@ -163,22 +163,14 @@ yield_arg: "from" test | testlist number: DEC_NUMBER | HEX_NUMBER | BIN_NUMBER | OCT_NUMBER | FLOAT_NUMBER | IMAG_NUMBER string: STRING | LONG_STRING -// Tokens - -NAME: /[a-zA-Z_]\w*/ -COMMENT: /#[^\n]*/ -_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+ +// Import terminals from standard library (grammars/python.lark) +%import python (NAME, COMMENT, STRING, LONG_STRING) +%import python (DEC_NUMBER, HEX_NUMBER, OCT_NUMBER, BIN_NUMBER, FLOAT_NUMBER, IMAG_NUMBER) -STRING : /[ubf]?r?("(?!"").*?(? Tuple[PackageResource, str]: ... class Lark: - source: str - grammar_source: str + source_path: str + source_grammar: str options: LarkOptions lexer: Lexer terminals: List[TerminalDef] @@ -49,7 +63,7 @@ class Lark: *, start: Union[None, str, List[str]] = "start", parser: Literal["earley", "lalr", "cyk"] = "auto", - lexer: Union[Literal["auto", "standard", "contextual", "dynamic", "dynamic_complete"], Lexer] = "auto", + lexer: Union[Literal["auto", "standard", "contextual", "dynamic", "dynamic_complete"], Type[Lexer]] = "auto", transformer: Optional[Transformer] = None, postlex: Optional[PostLex] = None, ambiguity: Literal["explicit", "resolve"] = "resolve", @@ -62,6 +76,8 @@ class Lark: cache: Union[bool, str] = False, g_regex_flags: int = ..., use_bytes: bool = False, + import_paths: List[Union[str, Callable[[Union[None, str, PackageResource], str], Tuple[str, str]]]] = ..., + source_path: Optional[str]=None, ): ... @@ -71,6 +87,10 @@ class Lark: @classmethod def open(cls: Type[_T], grammar_filename: str, rel_to: Optional[str] = None, **options) -> _T: ... + + @classmethod + def open_from_package(cls: Type[_T], package: str, grammar_path: str, search_paths: Tuple[str, ...] = ..., **options) -> _T: + ... def lex(self, text: str) -> Iterator[Token]: ... diff --git a/lark-stubs/lexer.pyi b/lark-stubs/lexer.pyi index ae7d68a..3f246fb 100644 --- a/lark-stubs/lexer.pyi +++ b/lark-stubs/lexer.pyi @@ -1,7 +1,7 @@ # -*- coding: utf-8 -*- from types import ModuleType from typing import ( - TypeVar, Type, Tuple, List, Dict, Iterator, Collection, Callable, Optional, + TypeVar, Type, Tuple, List, Dict, Iterator, Collection, Callable, Optional, FrozenSet, Any, Pattern as REPattern, ) from abc import abstractmethod, ABC @@ -85,6 +85,9 @@ class Token(str): end_column: int end_pos: int + def __init__(self, type_: str, value: Any, pos_in_stream: int = None, line: int = None, column: int = None, end_line: int = None, end_column: int = None, end_pos: int = None): + ... + def update(self, type_: Optional[str] = None, value: Optional[str] = None) -> Token: ... @@ -100,10 +103,22 @@ class Lexer(ABC): lex: Callable[..., Iterator[Token]] +class LexerConf: + tokens: Collection[TerminalDef] + re_module: ModuleType + ignore: Collection[str] = () + postlex: Any =None + callbacks: Optional[Dict[str, _Callback]] = None + g_regex_flags: int = 0 + skip_validation: bool = False + use_bytes: bool = False + + + class TraditionalLexer(Lexer): terminals: Collection[TerminalDef] - ignore_types: List[str] - newline_types: List[str] + ignore_types: FrozenSet[str] + newline_types: FrozenSet[str] user_callbacks: Dict[str, _Callback] callback: Dict[str, _Callback] mres: List[Tuple[REPattern, Dict[int, str]]] @@ -111,11 +126,7 @@ class TraditionalLexer(Lexer): def __init__( self, - terminals: Collection[TerminalDef], - re_: ModuleType, - ignore: Collection[str] = ..., - user_callbacks: Dict[str, _Callback] = ..., - g_regex_flags: int = ... + conf: LexerConf ): ... @@ -128,6 +139,8 @@ class TraditionalLexer(Lexer): def lex(self, stream: str) -> Iterator[Token]: ... + def next_token(self, lex_state: Any, parser_state: Any = None) -> Token: + ... class ContextualLexer(Lexer): lexers: Dict[str, TraditionalLexer] diff --git a/lark/__init__.py b/lark/__init__.py index 5b46f1b..168a969 100644 --- a/lark/__init__.py +++ b/lark/__init__.py @@ -3,8 +3,8 @@ from .tree import Tree from .visitors import Transformer, Visitor, v_args, Discard, Transformer_NonRecursive from .visitors import InlineTransformer, inline_args # XXX Deprecated from .exceptions import (ParseError, LexError, GrammarError, UnexpectedToken, - UnexpectedInput, UnexpectedCharacters, LarkError) + UnexpectedInput, UnexpectedCharacters, UnexpectedEOF, LarkError) from .lexer import Token from .lark import Lark -__version__ = "0.10.1" +__version__ = "0.11.1" diff --git a/lark/common.py b/lark/common.py index ad6dbc2..54b33df 100644 --- a/lark/common.py +++ b/lark/common.py @@ -5,8 +5,9 @@ from .lexer import TerminalDef ###{standalone + class LexerConf(Serialize): - __serialize_fields__ = 'terminals', 'ignore', 'g_regex_flags', 'use_bytes' + __serialize_fields__ = 'terminals', 'ignore', 'g_regex_flags', 'use_bytes', 'lexer_type' __serialize_namespace__ = TerminalDef, def __init__(self, terminals, re_module, ignore=(), postlex=None, callbacks=None, g_regex_flags=0, skip_validation=False, use_bytes=False): @@ -18,19 +19,24 @@ class LexerConf(Serialize): self.re_module = re_module self.skip_validation = skip_validation self.use_bytes = use_bytes + self.lexer_type = None @property def tokens(self): warn("LexerConf.tokens is deprecated. Use LexerConf.terminals instead", DeprecationWarning) return self.terminals -###} -class ParserConf: + +class ParserConf(Serialize): + __serialize_fields__ = 'rules', 'start', 'parser_type' + def __init__(self, rules, callbacks, start): assert isinstance(start, list) self.rules = rules self.callbacks = callbacks self.start = start + self.parser_type = None +###} diff --git a/lark/exceptions.py b/lark/exceptions.py index 8bcc855..bf6546f 100644 --- a/lark/exceptions.py +++ b/lark/exceptions.py @@ -6,22 +6,27 @@ from .utils import STRING_TYPE, logger class LarkError(Exception): pass + +class ConfigurationError(LarkError, ValueError): + pass + + +def assert_config(value, options, msg='Got %r, expected one of %s'): + if value not in options: + raise ConfigurationError(msg % (value, options)) + + class GrammarError(LarkError): pass + class ParseError(LarkError): pass + class LexError(LarkError): pass -class UnexpectedEOF(ParseError): - def __init__(self, expected): - self.expected = expected - - message = ("Unexpected end-of-input. Expected one of: \n\t* %s\n" % '\n\t* '.join(x.name for x in self.expected)) - super(UnexpectedEOF, self).__init__(message) - class UnexpectedInput(LarkError): """UnexpectedInput Error. @@ -44,6 +49,7 @@ class UnexpectedInput(LarkError): The parser doesn't hold a copy of the text it has to parse, so you have to provide it again """ + assert self.pos_in_stream is not None, self pos = self.pos_in_stream start = max(pos - span, 0) end = pos + span @@ -88,7 +94,7 @@ class UnexpectedInput(LarkError): parse_fn(malformed) except UnexpectedInput as ut: if ut.state == self.state: - if use_accepts and ut.accepts != self.accepts: + if use_accepts and hasattr(self, 'accepts') and ut.accepts != self.accepts: logger.debug("Different accepts with same state[%d]: %s != %s at example [%s][%s]" % (self.state, self.accepts, ut.accepts, i, j)) continue @@ -105,7 +111,7 @@ class UnexpectedInput(LarkError): except AttributeError: pass - if not candidate[0]: + if candidate[0] is None: logger.debug("Same State match at example [%s][%s]" % (i, j)) candidate = label, False @@ -127,10 +133,24 @@ class UnexpectedInput(LarkError): ts = names return "Expected one of: \n\t* %s\n" % '\n\t* '.join(ts) +class UnexpectedEOF(ParseError, UnexpectedInput): + def __init__(self, expected, state=None): + self.expected = expected + self.state = state + from .lexer import Token + self.token = Token("", "") #, line=-1, column=-1, pos_in_stream=-1) + self.pos_in_stream = -1 + self.line = -1 + self.column = -1 + + message = ("Unexpected end-of-input. Expected one of: \n\t* %s\n" % '\n\t* '.join(x.name for x in self.expected)) + super(UnexpectedEOF, self).__init__(message) + class UnexpectedCharacters(LexError, UnexpectedInput): def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None, _all_terminals=None): + # TODO considered_tokens and allowed can be figured out using state self.line = line self.column = column self.pos_in_stream = lex_pos @@ -168,7 +188,8 @@ class UnexpectedToken(ParseError, UnexpectedInput): see: :ref:`ParserPuppet`. """ - def __init__(self, token, expected, considered_rules=None, state=None, puppet=None, all_terminals=None): + def __init__(self, token, expected, considered_rules=None, state=None, puppet=None, all_terminals=None, token_history=None): + # TODO considered_rules and expected can be figured out using state self.line = getattr(token, 'line', '?') self.column = getattr(token, 'column', '?') self.pos_in_stream = getattr(token, 'pos_in_stream', None) @@ -179,6 +200,7 @@ class UnexpectedToken(ParseError, UnexpectedInput): self.considered_rules = considered_rules self.puppet = puppet self._all_terminals = all_terminals + self.token_history = token_history super(UnexpectedToken, self).__init__() @@ -191,6 +213,9 @@ class UnexpectedToken(ParseError, UnexpectedInput): # Be aware: Broken __str__ for Exceptions are terrible to debug. Make sure there is as little room as possible for errors message = ("Unexpected token %r at line %s, column %s.\n%s" % (self.token, self.line, self.column, self._format_terminals(self.accepts or self.expected))) + if self.token_history: + message += "Previous tokens: %r\n" % self.token_history + return message @@ -207,4 +232,6 @@ class VisitError(LarkError): message = 'Error trying to process rule "%s":\n\n%s' % (rule, orig_exc) super(VisitError, self).__init__(message) + + ###} diff --git a/lark/grammar.py b/lark/grammar.py index bb84351..405086a 100644 --- a/lark/grammar.py +++ b/lark/grammar.py @@ -40,14 +40,12 @@ class Terminal(Symbol): return '%s(%r, %r)' % (type(self).__name__, self.name, self.filter_out) - class NonTerminal(Symbol): __serialize_fields__ = 'name', is_term = False - class RuleOptions(Serialize): __serialize_fields__ = 'keep_all_tokens', 'expand1', 'priority', 'template_source', 'empty_indices' @@ -104,5 +102,4 @@ class Rule(Serialize): return self.origin == other.origin and self.expansion == other.expansion - ###} diff --git a/lark/grammars/common.lark b/lark/grammars/common.lark index a675ca4..1158026 100644 --- a/lark/grammars/common.lark +++ b/lark/grammars/common.lark @@ -1,3 +1,6 @@ +// Basic terminals for common use + + // // Numbers // @@ -21,7 +24,7 @@ SIGNED_NUMBER: ["+"|"-"] NUMBER // Strings // _STRING_INNER: /.*?/ -_STRING_ESC_INNER: _STRING_INNER /(? _STRING diff --git a/lark/grammars/python.lark b/lark/grammars/python.lark new file mode 100644 index 0000000..684193d --- /dev/null +++ b/lark/grammars/python.lark @@ -0,0 +1,19 @@ +// Python terminals + +NAME: /[a-zA-Z_]\w*/ +COMMENT: /#[^\n]*/ + +STRING : /[ubf]?r?("(?!"").*?(?' + if self.options.source_path is None: + try: + self.source_path = grammar.name + except AttributeError: + self.source_path = '' + else: + self.source_path = self.options.source_path # Drain file-like objects to get their contents try: @@ -222,29 +235,29 @@ class Lark(Serialize): grammar = read() assert isinstance(grammar, STRING_TYPE) - self.grammar_source = grammar + self.source_grammar = grammar if self.options.use_bytes: if not isascii(grammar): - raise ValueError("Grammar must be ascii only, when use_bytes=True") + raise ConfigurationError("Grammar must be ascii only, when use_bytes=True") if sys.version_info[0] == 2 and self.options.use_bytes != 'force': - raise NotImplementedError("`use_bytes=True` may have issues on python2." + raise ConfigurationError("`use_bytes=True` may have issues on python2." "Use `use_bytes='force'` to use it at your own risk.") cache_fn = None if self.options.cache: if self.options.parser != 'lalr': - raise NotImplementedError("cache only works with parser='lalr' for now") + raise ConfigurationError("cache only works with parser='lalr' for now") if isinstance(self.options.cache, STRING_TYPE): cache_fn = self.options.cache else: if self.options.cache is not True: - raise ValueError("cache argument must be bool or str") + raise ConfigurationError("cache argument must be bool or str") unhashable = ('transformer', 'postlex', 'lexer_callbacks', 'edit_terminals') from . import __version__ options_str = ''.join(k+str(v) for k, v in options.items() if k not in unhashable) s = grammar + options_str + __version__ md5 = hashlib.md5(s.encode()).hexdigest() - cache_fn = '.lark_cache_%s.tmp' % md5 + cache_fn = tempfile.gettempdir() + '/.lark_cache_%s.tmp' % md5 if FS.exists(cache_fn): logger.debug('Loading grammar from cache: %s', cache_fn) @@ -265,27 +278,28 @@ class Lark(Serialize): else: assert False, self.options.parser lexer = self.options.lexer - assert lexer in ('standard', 'contextual', 'dynamic', 'dynamic_complete') or issubclass(lexer, Lexer) + if isinstance(lexer, type): + assert issubclass(lexer, Lexer) # XXX Is this really important? Maybe just ensure interface compliance + else: + assert_config(lexer, ('standard', 'contextual', 'dynamic', 'dynamic_complete')) if self.options.ambiguity == 'auto': if self.options.parser == 'earley': self.options.ambiguity = 'resolve' else: - disambig_parsers = ['earley', 'cyk'] - assert self.options.parser in disambig_parsers, ( - 'Only %s supports disambiguation right now') % ', '.join(disambig_parsers) + assert_config(self.options.parser, ('earley', 'cyk'), "%r doesn't support disambiguation. Use one of these parsers instead: %s") if self.options.priority == 'auto': self.options.priority = 'normal' if self.options.priority not in _VALID_PRIORITY_OPTIONS: - raise ValueError("invalid priority option: %r. Must be one of %r" % (self.options.priority, _VALID_PRIORITY_OPTIONS)) + raise ConfigurationError("invalid priority option: %r. Must be one of %r" % (self.options.priority, _VALID_PRIORITY_OPTIONS)) assert self.options.ambiguity not in ('resolve__antiscore_sum', ), 'resolve__antiscore_sum has been replaced with the option priority="invert"' if self.options.ambiguity not in _VALID_AMBIGUITY_OPTIONS: - raise ValueError("invalid ambiguity option: %r. Must be one of %r" % (self.options.ambiguity, _VALID_AMBIGUITY_OPTIONS)) + raise ConfigurationError("invalid ambiguity option: %r. Must be one of %r" % (self.options.ambiguity, _VALID_AMBIGUITY_OPTIONS)) - # Parse the grammar file and compose the grammars (TODO) - self.grammar = load_grammar(grammar, self.source, re_module, self.options.keep_all_tokens) + # Parse the grammar file and compose the grammars + self.grammar = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens) if self.options.postlex is not None: terminals_to_keep = set(self.options.postlex.always_accept) @@ -310,7 +324,7 @@ class Lark(Serialize): # Else, if the user asked to disable priorities, strip them from the # rules. This allows the Earley parsers to skip an extra forest walk # for improved performance, if you don't need them (or didn't specify any). - elif self.options.priority == None: + elif self.options.priority is None: for rule in self.rules: if rule.options.priority is not None: rule.options.priority = None @@ -350,7 +364,7 @@ class Lark(Serialize): self.rules, self.options.tree_class or Tree, self.options.propagate_positions, - self.options.parser!='lalr' and self.options.ambiguity=='explicit', + self.options.parser != 'lalr' and self.options.ambiguity == 'explicit', self.options.maybe_placeholders ) self._callbacks = self._parse_tree_builder.create_callback(self.options.transformer) @@ -389,18 +403,18 @@ class Lark(Serialize): memo = SerializeMemoizer.deserialize(memo, {'Rule': Rule, 'TerminalDef': TerminalDef}, {}) options = dict(data['options']) if (set(kwargs) - _LOAD_ALLOWED_OPTIONS) & set(LarkOptions._defaults): - raise ValueError("Some options are not allowed when loading a Parser: {}" + raise ConfigurationError("Some options are not allowed when loading a Parser: {}" .format(set(kwargs) - _LOAD_ALLOWED_OPTIONS)) options.update(kwargs) self.options = LarkOptions.deserialize(options, memo) self.rules = [Rule.deserialize(r, memo) for r in data['rules']] - self.source = '' + self.source_path = '' self._prepare_callbacks() self.parser = self.parser_class.deserialize( data['parser'], memo, self._callbacks, - self.options, # Not all, but multiple attributes are used + self.options, # Not all, but multiple attributes are used ) self.terminals = self.parser.lexer_conf.terminals self._terminals_dict = {t.name: t for t in self.terminals} @@ -429,8 +443,26 @@ class Lark(Serialize): with open(grammar_filename, encoding='utf8') as f: return cls(f, **options) + @classmethod + def open_from_package(cls, package, grammar_path, search_paths=("",), **options): + """Create an instance of Lark with the grammar loaded from within the package `package`. + This allows grammar loading from zipapps. + + Imports in the grammar will use the `package` and `search_paths` provided, through `FromPackageLoader` + + Example: + + Lark.open_from_package(__name__, "example.lark", ("grammars",), parser=...) + """ + package = FromPackageLoader(package, search_paths) + full_path, text = package(None, grammar_path) + options.setdefault('source_path', full_path) + options.setdefault('import_paths', []) + options['import_paths'].append(package) + return cls(text, **options) + def __repr__(self): - return 'Lark(open(%r), parser=%r, lexer=%r, ...)' % (self.source, self.options.parser, self.options.lexer) + return 'Lark(open(%r), parser=%r, lexer=%r, ...)' % (self.source_path, self.options.parser, self.options.lexer) def lex(self, text): @@ -490,5 +522,23 @@ class Lark(Serialize): except UnexpectedCharacters as e2: e = e2 + @property + def source(self): + warn("Lark.source attribute has been renamed to Lark.source_path", DeprecationWarning) + return self.source_path + + @source.setter + def source(self, value): + self.source_path = value + + @property + def grammar_source(self): + warn("Lark.grammar_source attribute has been renamed to Lark.source_grammar", DeprecationWarning) + return self.source_grammar + + @grammar_source.setter + def grammar_source(self, value): + self.source_grammar = value + ###} diff --git a/lark/lexer.py b/lark/lexer.py index 3a3a42e..43176ac 100644 --- a/lark/lexer.py +++ b/lark/lexer.py @@ -1,4 +1,4 @@ -## Lexer Implementation +# Lexer Implementation import re @@ -8,6 +8,7 @@ from .exceptions import UnexpectedCharacters, LexError, UnexpectedToken ###{standalone from copy import copy + class Pattern(Serialize): def __init__(self, value, flags=(), raw=None): @@ -21,6 +22,7 @@ class Pattern(Serialize): # Pattern Hashing assumes all subclasses have a different priority! def __hash__(self): return hash((type(self), self.value, self.flags)) + def __eq__(self, other): return type(self) == type(other) and self.value == other.value and self.flags == other.flags @@ -54,6 +56,7 @@ class PatternStr(Pattern): return len(self.value) max_width = min_width + class PatternRE(Pattern): __serialize_fields__ = 'value', 'flags', '_width' @@ -71,6 +74,7 @@ class PatternRE(Pattern): @property def min_width(self): return self._get_width()[0] + @property def max_width(self): return self._get_width()[1] @@ -141,7 +145,7 @@ class Token(Str): return cls(type_, value, borrow_t.pos_in_stream, borrow_t.line, borrow_t.column, borrow_t.end_line, borrow_t.end_column, borrow_t.end_pos) def __reduce__(self): - return (self.__class__, (self.type, self.value, self.pos_in_stream, self.line, self.column, )) + return (self.__class__, (self.type, self.value, self.pos_in_stream, self.line, self.column)) def __repr__(self): return 'Token(%r, %r)' % (self.type, self.value) @@ -195,6 +199,7 @@ class UnlessCallback: break return t + class CallChain: def __init__(self, callback1, callback2, cond): self.callback1 = callback1 @@ -206,16 +211,13 @@ class CallChain: return self.callback2(t) if self.cond(t2) else t2 - - - def _create_unless(terminals, g_regex_flags, re_, use_bytes): tokens_by_type = classify(terminals, lambda t: type(t.pattern)) assert len(tokens_by_type) <= 2, tokens_by_type.keys() embedded_strs = set() callback = {} for retok in tokens_by_type.get(PatternRE, []): - unless = [] # {} + unless = [] for strtok in tokens_by_type.get(PatternStr, []): if strtok.priority > retok.priority: continue @@ -247,13 +249,15 @@ def _build_mres(terminals, max_size, g_regex_flags, match_whole, re_, use_bytes) except AssertionError: # Yes, this is what Python provides us.. :/ return _build_mres(terminals, max_size//2, g_regex_flags, match_whole, re_, use_bytes) - mres.append((mre, {i:n for n,i in mre.groupindex.items()} )) + mres.append((mre, {i: n for n, i in mre.groupindex.items()})) terminals = terminals[max_size:] return mres + def build_mres(terminals, g_regex_flags, re_, use_bytes, match_whole=False): return _build_mres(terminals, len(terminals), g_regex_flags, match_whole, re_, use_bytes) + def _regexp_has_newline(r): r"""Expressions that may indicate newlines in a regexp: - newlines (\n) @@ -264,6 +268,7 @@ def _regexp_has_newline(r): """ return '\n' in r or '\\n' in r or '\\s' in r or '[^' in r or ('(?s' in r and '.' in r) + class Lexer(object): """Lexer interface @@ -302,7 +307,7 @@ class TraditionalLexer(Lexer): self.newline_types = frozenset(t.name for t in terminals if _regexp_has_newline(t.pattern.to_regexp())) self.ignore_types = frozenset(conf.ignore) - terminals.sort(key=lambda x:(-x.priority, -x.pattern.max_width, -len(x.pattern.value), x.name)) + terminals.sort(key=lambda x: (-x.priority, -x.pattern.max_width, -len(x.pattern.value), x.name)) self.terminals = terminals self.user_callbacks = conf.callbacks self.g_regex_flags = conf.g_regex_flags @@ -311,7 +316,7 @@ class TraditionalLexer(Lexer): self._mres = None def _build(self): - terminals, self.callback = _create_unless(self.terminals, self.g_regex_flags, re_=self.re, use_bytes=self.use_bytes) + terminals, self.callback = _create_unless(self.terminals, self.g_regex_flags, self.re, self.use_bytes) assert all(self.callback.values()) for type_, f in self.user_callbacks.items(): @@ -338,9 +343,9 @@ class TraditionalLexer(Lexer): def lex(self, state, parser_state): with suppress(EOFError): while True: - yield self.next_token(state) + yield self.next_token(state, parser_state) - def next_token(self, lex_state): + def next_token(self, lex_state, parser_state=None): line_ctr = lex_state.line_ctr while line_ctr.char_pos < len(lex_state.text): res = self.match(lex_state.text, line_ctr.char_pos) @@ -350,7 +355,7 @@ class TraditionalLexer(Lexer): allowed = {""} raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column, allowed=allowed, token_history=lex_state.last_token and [lex_state.last_token], - _all_terminals=self.terminals) + state=parser_state, _all_terminals=self.terminals) value, type_ = res @@ -363,7 +368,7 @@ class TraditionalLexer(Lexer): if t.type in self.callback: t = self.callback[t.type](t) if not isinstance(t, Token): - raise ValueError("Callbacks must return a token (returned %r)" % t) + raise LexError("Callbacks must return a token (returned %r)" % t) lex_state.last_token = t return t else: @@ -375,6 +380,7 @@ class TraditionalLexer(Lexer): # EOF raise EOFError(self) + class LexerState: __slots__ = 'text', 'line_ctr', 'last_token' @@ -386,6 +392,7 @@ class LexerState: def __copy__(self): return type(self)(self.text, copy(self.line_ctr), self.last_token) + class ContextualLexer(Lexer): def __init__(self, conf, states, always_accept=()): @@ -424,17 +431,17 @@ class ContextualLexer(Lexer): try: while True: lexer = self.lexers[parser_state.position] - yield lexer.next_token(lexer_state) + yield lexer.next_token(lexer_state, parser_state) except EOFError: pass except UnexpectedCharacters as e: # In the contextual lexer, UnexpectedCharacters can mean that the terminal is defined, but not in the current context. # This tests the input against the global context, to provide a nicer error. token = self.root_lexer.next_token(lexer_state) - raise UnexpectedToken(token, e.allowed, state=parser_state.position, all_terminals=self.root_lexer.terminals) + raise UnexpectedToken(token, e.allowed, state=parser_state.position, token_history=[lexer_state.last_token], all_terminals=self.root_lexer.terminals) class LexerThread: - "A thread that ties a lexer instance and a lexer state, to be used by the parser" + """A thread that ties a lexer instance and a lexer state, to be used by the parser""" def __init__(self, lexer, text): self.lexer = lexer diff --git a/lark/load_grammar.py b/lark/load_grammar.py index 413f921..9f6bf2e 100644 --- a/lark/load_grammar.py +++ b/lark/load_grammar.py @@ -1,15 +1,17 @@ -"Parses and creates Grammar objects" +"""Parses and creates Grammar objects""" import os.path import sys from copy import copy, deepcopy from io import open +import pkgutil +from ast import literal_eval -from .utils import bfs, eval_escaping, Py36, logger, classify_bool, isascii +from .utils import bfs, Py36, logger, classify_bool from .lexer import Token, TerminalDef, PatternStr, PatternRE from .parse_tree_builder import ParseTreeBuilder -from .parser_frontends import LALR_TraditionalLexer +from .parser_frontends import ParsingFrontend from .common import LexerConf, ParserConf from .grammar import RuleOptions, Rule, Terminal, NonTerminal, Symbol from .utils import classify, suppress, dedup_list, Str @@ -20,7 +22,7 @@ from .visitors import Transformer, Visitor, v_args, Transformer_InPlace, Transfo inline_args = v_args(inline=True) __path__ = os.path.dirname(__file__) -IMPORT_PATHS = [os.path.join(__path__, 'grammars')] +IMPORT_PATHS = ['grammars'] EXT = '.lark' @@ -165,6 +167,7 @@ RULES = { 'literal': ['REGEXP', 'STRING'], } + @inline_args class EBNF_to_BNF(Transformer_InPlace): def __init__(self): @@ -258,9 +261,9 @@ class SimplifyRule_Visitor(Visitor): for i, child in enumerate(tree.children): if isinstance(child, Tree) and child.data == 'expansions': tree.data = 'expansions' - tree.children = [self.visit(ST('expansion', [option if i==j else other - for j, other in enumerate(tree.children)])) - for option in dedup_list(child.children)] + tree.children = [self.visit(ST('expansion', [option if i == j else other + for j, other in enumerate(tree.children)])) + for option in dedup_list(child.children)] self._flatten(tree) break @@ -283,8 +286,10 @@ class SimplifyRule_Visitor(Visitor): class RuleTreeToText(Transformer): def expansions(self, x): return x + def expansion(self, symbols): return symbols, None + def alias(self, x): (expansion, _alias), alias = x assert _alias is None, (alias, expansion, '-', _alias) # Double alias not allowed @@ -299,8 +304,9 @@ class CanonizeTree(Transformer_InPlace): tokenmods, value = args return tokenmods + [value] + class PrepareAnonTerminals(Transformer_InPlace): - "Create a unique list of anonymous terminals. Attempt to give meaningful names to them when we add them" + """Create a unique list of anonymous terminals. Attempt to give meaningful names to them when we add them""" def __init__(self, terminals): self.terminals = terminals @@ -309,7 +315,6 @@ class PrepareAnonTerminals(Transformer_InPlace): self.i = 0 self.rule_options = None - @inline_args def pattern(self, p): value = p.value @@ -328,14 +333,16 @@ class PrepareAnonTerminals(Transformer_InPlace): try: term_name = _TERMINAL_NAMES[value] except KeyError: - if value.isalnum() and value[0].isalpha() and value.upper() not in self.term_set and isascii(value): - term_name = value.upper() + if value.isalnum() and value[0].isalpha() and value.upper() not in self.term_set: + with suppress(UnicodeEncodeError): + value.upper().encode('ascii') # Make sure we don't have unicode in our terminal names + term_name = value.upper() if term_name in self.term_set: term_name = None elif isinstance(p, PatternRE): - if p in self.term_reverse: # Kind of a weird placement.name + if p in self.term_reverse: # Kind of a weird placement.name term_name = self.term_reverse[p].name else: assert False, p @@ -357,7 +364,7 @@ class PrepareAnonTerminals(Transformer_InPlace): class _ReplaceSymbols(Transformer_InPlace): - " Helper for ApplyTemplates " + """Helper for ApplyTemplates""" def __init__(self): self.names = {} @@ -372,8 +379,9 @@ class _ReplaceSymbols(Transformer_InPlace): return self.__default__('template_usage', [self.names[c[0]].name] + c[1:], None) return self.__default__('template_usage', c, None) + class ApplyTemplates(Transformer_InPlace): - " Apply the templates, creating new rules that represent the used templates " + """Apply the templates, creating new rules that represent the used templates""" def __init__(self, rule_defs): self.rule_defs = rule_defs @@ -399,6 +407,30 @@ def _rfind(s, choices): return max(s.rfind(c) for c in choices) +def eval_escaping(s): + w = '' + i = iter(s) + for n in i: + w += n + if n == '\\': + try: + n2 = next(i) + except StopIteration: + raise GrammarError("Literal ended unexpectedly (bad escaping): `%r`" % s) + if n2 == '\\': + w += '\\\\' + elif n2 not in 'uxnftr': + w += '\\' + w += n2 + w = w.replace('\\"', '"').replace("'", "\\'") + + to_eval = "u'''%s'''" % w + try: + s = literal_eval(to_eval) + except SyntaxError as e: + raise GrammarError(s, e) + + return s def _literal_to_pattern(literal): @@ -439,7 +471,7 @@ class PrepareLiterals(Transformer_InPlace): assert start.type == end.type == 'STRING' start = start.value[1:-1] end = end.value[1:-1] - assert len(eval_escaping(start)) == len(eval_escaping(end)) == 1, (start, end, len(eval_escaping(start)), len(eval_escaping(end))) + assert len(eval_escaping(start)) == len(eval_escaping(end)) == 1 regexp = '[%s-%s]' % (start, end) return ST('pattern', [PatternRE(regexp)]) @@ -458,6 +490,7 @@ def _make_joined_pattern(regexp, flags_set): return PatternRE(regexp, flags) + class TerminalTreeToPattern(Transformer): def pattern(self, ps): p ,= ps @@ -501,6 +534,7 @@ class TerminalTreeToPattern(Transformer): def value(self, v): return v[0] + class PrepareSymbols(Transformer_InPlace): def value(self, v): v ,= v @@ -512,13 +546,16 @@ class PrepareSymbols(Transformer_InPlace): return Terminal(Str(v.value), filter_out=v.startswith('_')) assert False + def _choice_of_rules(rules): return ST('expansions', [ST('expansion', [Token('RULE', name)]) for name in rules]) + def nr_deepcopy_tree(t): - "Deepcopy tree `t` without recursion" + """Deepcopy tree `t` without recursion""" return Transformer_NonRecursive(False).transform(t) + class Grammar: def __init__(self, rule_defs, term_defs, ignore): self.term_defs = term_defs @@ -545,7 +582,7 @@ class Grammar: raise GrammarError("Terminals cannot be empty (%s)" % name) transformer = PrepareLiterals() * TerminalTreeToPattern() - terminals = [TerminalDef(name, transformer.transform( term_tree ), priority) + terminals = [TerminalDef(name, transformer.transform(term_tree), priority) for name, (term_tree, priority) in term_defs if term_tree] # ================= @@ -564,10 +601,10 @@ class Grammar: ebnf_to_bnf = EBNF_to_BNF() rules = [] i = 0 - while i < len(rule_defs): # We have to do it like this because rule_defs might grow due to templates + while i < len(rule_defs): # We have to do it like this because rule_defs might grow due to templates name, params, rule_tree, options = rule_defs[i] i += 1 - if len(params) != 0: # Dont transform templates + if len(params) != 0: # Dont transform templates continue rule_options = RuleOptions(keep_all_tokens=True) if options and options.keep_all_tokens else None ebnf_to_bnf.rule_options = rule_options @@ -592,7 +629,7 @@ class Grammar: for i, (expansion, alias) in enumerate(expansions): if alias and name.startswith('_'): - raise GrammarError("Rule %s is marked for expansion (it starts with an underscore) and isn't allowed to have aliases (alias=%s)" % (name, alias)) + raise GrammarError("Rule %s is marked for expansion (it starts with an underscore) and isn't allowed to have aliases (alias=%s)"% (name, alias)) empty_indices = [x==_EMPTY for x in expansion] if any(empty_indices): @@ -621,14 +658,13 @@ class Grammar: # Remove duplicates compiled_rules = list(set(compiled_rules)) - # Filter out unused rules while True: c = len(compiled_rules) used_rules = {s for r in compiled_rules - for s in r.expansion - if isinstance(s, NonTerminal) - and s != r.origin} + for s in r.expansion + if isinstance(s, NonTerminal) + and s != r.origin} used_rules |= {NonTerminal(s) for s in start} compiled_rules, unused = classify_bool(compiled_rules, lambda r: r.origin in used_rules) for r in unused: @@ -647,9 +683,63 @@ class Grammar: return terminals, compiled_rules, self.ignore +class PackageResource(object): + """ + Represents a path inside a Package. Used by `FromPackageLoader` + """ + def __init__(self, pkg_name, path): + self.pkg_name = pkg_name + self.path = path + + def __str__(self): + return "<%s: %s>" % (self.pkg_name, self.path) + + def __repr__(self): + return "%s(%r, %r)" % (type(self).__name__, self.pkg_name, self.path) + + +class FromPackageLoader(object): + """ + Provides a simple way of creating custom import loaders that load from packages via ``pkgutil.get_data`` instead of using `open`. + This allows them to be compatible even from within zip files. + + Relative imports are handled, so you can just freely use them. + + pkg_name: The name of the package. You can probably provide `__name__` most of the time + search_paths: All the path that will be search on absolute imports. + """ + def __init__(self, pkg_name, search_paths=("", )): + self.pkg_name = pkg_name + self.search_paths = search_paths + + def __repr__(self): + return "%s(%r, %r)" % (type(self).__name__, self.pkg_name, self.search_paths) + + def __call__(self, base_path, grammar_path): + if base_path is None: + to_try = self.search_paths + else: + # Check whether or not the importing grammar was loaded by this module. + if not isinstance(base_path, PackageResource) or base_path.pkg_name != self.pkg_name: + # Technically false, but FileNotFound doesn't exist in python2.7, and this message should never reach the end user anyway + raise IOError() + to_try = [base_path.path] + for path in to_try: + full_path = os.path.join(path, grammar_path) + try: + text = pkgutil.get_data(self.pkg_name, full_path) + except IOError: + continue + else: + return PackageResource(self.pkg_name, full_path), text.decode() + raise IOError() + + +stdlib_loader = FromPackageLoader('lark', IMPORT_PATHS) _imported_grammars = {} + def import_from_grammar_into_namespace(grammar, namespace, aliases): """Returns all rules and terminals of grammar, prepended with a 'namespace' prefix, except for those which are aliased. @@ -670,8 +760,6 @@ def import_from_grammar_into_namespace(grammar, namespace, aliases): raise GrammarError("Missing symbol '%s' in grammar %s" % (symbol, namespace)) return _find_used_symbols(tree) - set(params) - - def get_namespace_name(name, params): if params is not None: try: @@ -692,19 +780,17 @@ def import_from_grammar_into_namespace(grammar, namespace, aliases): else: assert symbol.type == 'RULE' _, params, tree, options = imported_rules[symbol] - params_map = {p: ('%s__%s' if p[0]!='_' else '_%s__%s' ) % (namespace, p) for p in params} + params_map = {p: ('%s__%s' if p[0]!='_' else '_%s__%s') % (namespace, p) for p in params} for t in tree.iter_subtrees(): for i, c in enumerate(t.children): if isinstance(c, Token) and c.type in ('RULE', 'TERMINAL'): t.children[i] = Token(c.type, get_namespace_name(c, params_map)) - params = [params_map[p] for p in params] # We can not rely on ordered dictionaries + params = [params_map[p] for p in params] # We can not rely on ordered dictionaries rule_defs.append((get_namespace_name(symbol, params_map), params, tree, options)) - return term_defs, rule_defs - def resolve_term_references(term_defs): # TODO Solve with transitive closure (maybe) @@ -744,7 +830,7 @@ def options_from_rule(name, params, *x): else: expansions ,= x priority = None - params = [t.value for t in params.children] if params is not None else [] # For the grammar parser + params = [t.value for t in params.children] if params is not None else [] # For the grammar parser keep_all_tokens = name.startswith('!') name = name.lstrip('!') @@ -758,10 +844,12 @@ def options_from_rule(name, params, *x): def symbols_from_strcase(expansion): return [Terminal(x, filter_out=x.startswith('_')) if x.isupper() else NonTerminal(x) for x in expansion] + @inline_args class PrepareGrammar(Transformer_InPlace): def terminal(self, name): return name + def nonterminal(self, name): return name @@ -771,10 +859,11 @@ def _find_used_symbols(tree): return {t for x in tree.find_data('expansion') for t in x.scan_values(lambda t: t.type in ('RULE', 'TERMINAL'))} + class GrammarLoader: ERRORS = [ ('Unclosed parenthesis', ['a: (\n']), - ('Umatched closing parenthesis', ['a: )\n', 'a: [)\n', 'a: (]\n']), + ('Unmatched closing parenthesis', ['a: )\n', 'a: [)\n', 'a: (]\n']), ('Expecting rule or terminal definition (missing colon)', ['a\n', 'A\n', 'a->\n', 'A->\n', 'a A\n']), ('Illegal name for rules or terminals', ['Aa:\n']), ('Alias expects lowercase name', ['a: -> "a"\n']), @@ -786,43 +875,53 @@ class GrammarLoader: ('%ignore expects a value', ['%ignore %import\n']), ] - def __init__(self, re_module, global_keep_all_tokens): + def __init__(self, global_keep_all_tokens): terminals = [TerminalDef(name, PatternRE(value)) for name, value in TERMINALS.items()] - rules = [options_from_rule(name, None, x) for name, x in RULES.items()] - rules = [Rule(NonTerminal(r), symbols_from_strcase(x.split()), i, None, o) for r, _p, xs, o in rules for i, x in enumerate(xs)] + rules = [options_from_rule(name, None, x) for name, x in RULES.items()] + rules = [Rule(NonTerminal(r), symbols_from_strcase(x.split()), i, None, o) + for r, _p, xs, o in rules for i, x in enumerate(xs)] callback = ParseTreeBuilder(rules, ST).create_callback() - lexer_conf = LexerConf(terminals, re_module, ['WS', 'COMMENT']) - + import re + lexer_conf = LexerConf(terminals, re, ['WS', 'COMMENT']) parser_conf = ParserConf(rules, callback, ['start']) - self.parser = LALR_TraditionalLexer(lexer_conf, parser_conf) + lexer_conf.lexer_type = 'standard' + parser_conf.parser_type = 'lalr' + self.parser = ParsingFrontend(lexer_conf, parser_conf, {}) self.canonize_tree = CanonizeTree() - self.re_module = re_module self.global_keep_all_tokens = global_keep_all_tokens - def import_grammar(self, grammar_path, base_paths=[]): + def import_grammar(self, grammar_path, base_path=None, import_paths=[]): if grammar_path not in _imported_grammars: - import_paths = base_paths + IMPORT_PATHS - for import_path in import_paths: - with suppress(IOError): - joined_path = os.path.join(import_path, grammar_path) - with open(joined_path, encoding='utf8') as f: - text = f.read() - grammar = self.load_grammar(text, joined_path) + # import_paths take priority over base_path since they should handle relative imports and ignore everything else. + to_try = import_paths + ([base_path] if base_path is not None else []) + [stdlib_loader] + for source in to_try: + try: + if callable(source): + joined_path, text = source(base_path, grammar_path) + else: + joined_path = os.path.join(source, grammar_path) + with open(joined_path, encoding='utf8') as f: + text = f.read() + except IOError: + continue + else: + grammar = self.load_grammar(text, joined_path, import_paths) _imported_grammars[grammar_path] = grammar break else: - open(grammar_path, encoding='utf8') # Force a file not found error + # Search failed. Make Python throw a nice error. + open(grammar_path, encoding='utf8') assert False return _imported_grammars[grammar_path] - def load_grammar(self, grammar_text, grammar_name=''): - "Parse grammar_text, verify, and create Grammar object. Display nice messages on error." + def load_grammar(self, grammar_text, grammar_name='', import_paths=[]): + """Parse grammar_text, verify, and create Grammar object. Display nice messages on error.""" try: - tree = self.canonize_tree.transform( self.parser.parse(grammar_text+'\n') ) + tree = self.canonize_tree.transform(self.parser.parse(grammar_text+'\n')) except UnexpectedCharacters as e: context = e.get_context(grammar_text) raise GrammarError("Unexpected input at line %d column %d in %s: \n\n%s" % @@ -872,7 +971,7 @@ class GrammarLoader: aliases = {name: arg1 or name} # Aliases if exist if path_node.data == 'import_lib': # Import from library - base_paths = [] + base_path = None else: # Relative import if grammar_name == '': # Import relative to script file path if grammar is coded in script try: @@ -882,16 +981,19 @@ class GrammarLoader: else: base_file = grammar_name # Import relative to grammar file path if external grammar file if base_file: - base_paths = [os.path.split(base_file)[0]] + if isinstance(base_file, PackageResource): + base_path = PackageResource(base_file.pkg_name, os.path.split(base_file.path)[0]) + else: + base_path = os.path.split(base_file)[0] else: - base_paths = [os.path.abspath(os.path.curdir)] + base_path = os.path.abspath(os.path.curdir) try: - import_base_paths, import_aliases = imports[dotted_path] - assert base_paths == import_base_paths, 'Inconsistent base_paths for %s.' % '.'.join(dotted_path) + import_base_path, import_aliases = imports[dotted_path] + assert base_path == import_base_path, 'Inconsistent base_path for %s.' % '.'.join(dotted_path) import_aliases.update(aliases) except KeyError: - imports[dotted_path] = base_paths, aliases + imports[dotted_path] = base_path, aliases elif stmt.data == 'declare': for t in stmt.children: @@ -900,9 +1002,9 @@ class GrammarLoader: assert False, stmt # import grammars - for dotted_path, (base_paths, aliases) in imports.items(): + for dotted_path, (base_path, aliases) in imports.items(): grammar_path = os.path.join(*dotted_path) + EXT - g = self.import_grammar(grammar_path, base_paths=base_paths) + g = self.import_grammar(grammar_path, base_path=base_path, import_paths=import_paths) new_td, new_rd = import_from_grammar_into_namespace(g, '__'.join(dotted_path), aliases) term_defs += new_td @@ -972,7 +1074,7 @@ class GrammarLoader: raise GrammarError("Template '%s' used but not defined (in rule %s)" % (sym, name)) if len(args) != rule_names[sym]: raise GrammarError("Wrong number of template arguments used for %s " - "(expected %s, got %s) (in rule %s)"%(sym, rule_names[sym], len(args), name)) + "(expected %s, got %s) (in rule %s)" % (sym, rule_names[sym], len(args), name)) for sym in _find_used_symbols(expansions): if sym.type == 'TERMINAL': if sym not in terminal_names: @@ -981,10 +1083,8 @@ class GrammarLoader: if sym not in rule_names and sym not in params: raise GrammarError("Rule '%s' used but not defined (in rule %s)" % (sym, name)) - return Grammar(rules, term_defs, ignore_names) - -def load_grammar(grammar, source, re_, global_keep_all_tokens): - return GrammarLoader(re_, global_keep_all_tokens).load_grammar(grammar, source) +def load_grammar(grammar, source, import_paths, global_keep_all_tokens): + return GrammarLoader(global_keep_all_tokens).load_grammar(grammar, source, import_paths) diff --git a/lark/parse_tree_builder.py b/lark/parse_tree_builder.py index a4c4330..569761a 100644 --- a/lark/parse_tree_builder.py +++ b/lark/parse_tree_builder.py @@ -1,7 +1,7 @@ from .exceptions import GrammarError from .lexer import Token from .tree import Tree -from .visitors import InlineTransformer # XXX Deprecated +from .visitors import InlineTransformer # XXX Deprecated from .visitors import Transformer_InPlace from .visitors import _vargs_meta, _vargs_meta_inline @@ -20,6 +20,7 @@ class ExpandSingleChild: else: return self.node_builder(children) + class PropagatePositions: def __init__(self, node_builder): self.node_builder = node_builder @@ -87,8 +88,9 @@ class ChildFilter: return self.node_builder(filtered) + class ChildFilterLALR(ChildFilter): - "Optimized childfilter for LALR (assumes no duplication in parse tree, so it's safe to change it)" + """Optimized childfilter for LALR (assumes no duplication in parse tree, so it's safe to change it)""" def __call__(self, children): filtered = [] @@ -108,6 +110,7 @@ class ChildFilterLALR(ChildFilter): return self.node_builder(filtered) + class ChildFilterLALR_NoPlaceholders(ChildFilter): "Optimized childfilter for LALR (assumes no duplication in parse tree, so it's safe to change it)" def __init__(self, to_include, node_builder): @@ -126,9 +129,11 @@ class ChildFilterLALR_NoPlaceholders(ChildFilter): filtered.append(children[i]) return self.node_builder(filtered) + def _should_expand(sym): return not sym.is_term and sym.name.startswith('_') + def maybe_create_child_filter(expansion, keep_all_tokens, ambiguous, _empty_indices): # Prepare empty_indices as: How many Nones to insert at each index? if _empty_indices: @@ -156,21 +161,22 @@ def maybe_create_child_filter(expansion, keep_all_tokens, ambiguous, _empty_indi # LALR without placeholders return partial(ChildFilterLALR_NoPlaceholders, [(i, x) for i,x,_ in to_include]) + class AmbiguousExpander: """Deal with the case where we're expanding children ('_rule') into a parent but the children are ambiguous. i.e. (parent->_ambig->_expand_this_rule). In this case, make the parent itself ambiguous with as many copies as their are ambiguous children, and then copy the ambiguous children - into the right parents in the right places, essentially shifting the ambiguiuty up the tree.""" + into the right parents in the right places, essentially shifting the ambiguity up the tree.""" def __init__(self, to_expand, tree_class, node_builder): self.node_builder = node_builder self.tree_class = tree_class self.to_expand = to_expand def __call__(self, children): - def _is_ambig_tree(child): - return hasattr(child, 'data') and child.data == '_ambig' + def _is_ambig_tree(t): + return hasattr(t, 'data') and t.data == '_ambig' - #### When we're repeatedly expanding ambiguities we can end up with nested ambiguities. + # -- When we're repeatedly expanding ambiguities we can end up with nested ambiguities. # All children of an _ambig node should be a derivation of that ambig node, hence # it is safe to assume that if we see an _ambig node nested within an ambig node # it is safe to simply expand it into the parent _ambig node as an alternative derivation. @@ -186,15 +192,17 @@ class AmbiguousExpander: if not ambiguous: return self.node_builder(children) - expand = [ iter(child.children) if i in ambiguous else repeat(child) for i, child in enumerate(children) ] + expand = [iter(child.children) if i in ambiguous else repeat(child) for i, child in enumerate(children)] return self.tree_class('_ambig', [self.node_builder(list(f[0])) for f in product(zip(*expand))]) + def maybe_create_ambiguous_expander(tree_class, expansion, keep_all_tokens): to_expand = [i for i, sym in enumerate(expansion) if keep_all_tokens or ((not (sym.is_term and sym.filter_out)) and _should_expand(sym))] if to_expand: return partial(AmbiguousExpander, to_expand, tree_class) + class AmbiguousIntermediateExpander: """ Propagate ambiguous intermediate nodes and their derivations up to the @@ -275,12 +283,14 @@ class AmbiguousIntermediateExpander: return self.node_builder(children) + def ptb_inline_args(func): @wraps(func) def f(children): return func(*children) return f + def inplace_transformer(func): @wraps(func) def f(children): @@ -289,9 +299,11 @@ def inplace_transformer(func): return func(tree) return f + def apply_visit_wrapper(func, name, wrapper): if wrapper is _vargs_meta or wrapper is _vargs_meta_inline: raise NotImplementedError("Meta args not supported for internal transformer") + @wraps(func) def f(children): return wrapper(func, name, children, None) @@ -323,7 +335,6 @@ class ParseTreeBuilder: yield rule, wrapper_chain - def create_callback(self, transformer=None): callbacks = {} diff --git a/lark/parser_frontends.py b/lark/parser_frontends.py index 739f9b5..0fab159 100644 --- a/lark/parser_frontends.py +++ b/lark/parser_frontends.py @@ -1,12 +1,11 @@ +from .exceptions import ConfigurationError, GrammarError, assert_config from .utils import get_regexp_width, Serialize from .parsers.grammar_analysis import GrammarAnalyzer from .lexer import LexerThread, TraditionalLexer, ContextualLexer, Lexer, Token, TerminalDef from .parsers import earley, xearley, cyk from .parsers.lalr_parser import LALR_Parser -from .grammar import Rule from .tree import Tree -from .common import LexerConf -from .exceptions import UnexpectedInput +from .common import LexerConf, ParserConf try: import regex except ImportError: @@ -15,63 +14,118 @@ import re ###{standalone -def get_frontend(parser, lexer): - if parser=='lalr': - if lexer is None: - raise ValueError('The LALR parser requires use of a lexer') - elif lexer == 'standard': - return LALR_TraditionalLexer - elif lexer == 'contextual': - return LALR_ContextualLexer - elif issubclass(lexer, Lexer): - class CustomLexerWrapper(Lexer): - def __init__(self, lexer_conf): - self.lexer = lexer(lexer_conf) - def lex(self, lexer_state, parser_state): - return self.lexer.lex(lexer_state.text) - - class LALR_CustomLexerWrapper(LALR_CustomLexer): - def __init__(self, lexer_conf, parser_conf, options=None): - super(LALR_CustomLexerWrapper, self).__init__( - lexer, lexer_conf, parser_conf, options=options) - def init_lexer(self): - future_interface = getattr(lexer, '__future_interface__', False) - if future_interface: - self.lexer = lexer(self.lexer_conf) - else: - self.lexer = CustomLexerWrapper(self.lexer_conf) - - return LALR_CustomLexerWrapper - else: - raise ValueError('Unknown lexer: %s' % lexer) - elif parser=='earley': - if lexer=='standard': - return Earley - elif lexer=='dynamic': - return XEarley - elif lexer=='dynamic_complete': - return XEarley_CompleteLex - elif lexer=='contextual': - raise ValueError('The Earley parser does not support the contextual parser') +def _wrap_lexer(lexer_class): + future_interface = getattr(lexer_class, '__future_interface__', False) + if future_interface: + return lexer_class + else: + class CustomLexerWrapper(Lexer): + def __init__(self, lexer_conf): + self.lexer = lexer_class(lexer_conf) + def lex(self, lexer_state, parser_state): + return self.lexer.lex(lexer_state.text) + return CustomLexerWrapper + + +class MakeParsingFrontend: + def __init__(self, parser_type, lexer_type): + self.parser_type = parser_type + self.lexer_type = lexer_type + + def __call__(self, lexer_conf, parser_conf, options): + assert isinstance(lexer_conf, LexerConf) + assert isinstance(parser_conf, ParserConf) + parser_conf.parser_type = self.parser_type + lexer_conf.lexer_type = self.lexer_type + return ParsingFrontend(lexer_conf, parser_conf, options) + + @classmethod + def deserialize(cls, data, memo, callbacks, options): + lexer_conf = LexerConf.deserialize(data['lexer_conf'], memo) + parser_conf = ParserConf.deserialize(data['parser_conf'], memo) + parser = LALR_Parser.deserialize(data['parser'], memo, callbacks, options.debug) + parser_conf.callbacks = callbacks + + terminals = [item for item in memo.values() if isinstance(item, TerminalDef)] + + lexer_conf.callbacks = _get_lexer_callbacks(options.transformer, terminals) + lexer_conf.re_module = regex if options.regex else re + lexer_conf.use_bytes = options.use_bytes + lexer_conf.g_regex_flags = options.g_regex_flags + lexer_conf.skip_validation = True + lexer_conf.postlex = options.postlex + + return ParsingFrontend(lexer_conf, parser_conf, options, parser=parser) + + + + +class ParsingFrontend(Serialize): + __serialize_fields__ = 'lexer_conf', 'parser_conf', 'parser', 'options' + + def __init__(self, lexer_conf, parser_conf, options, parser=None): + self.parser_conf = parser_conf + self.lexer_conf = lexer_conf + self.options = options + + # Set-up parser + if parser: # From cache + self.parser = parser else: - raise ValueError('Unknown lexer: %s' % lexer) - elif parser == 'cyk': - if lexer == 'standard': - return CYK + create_parser = { + 'lalr': create_lalr_parser, + 'earley': create_earley_parser, + 'cyk': CYK_FrontEnd, + }[parser_conf.parser_type] + self.parser = create_parser(lexer_conf, parser_conf, options) + + # Set-up lexer + lexer_type = lexer_conf.lexer_type + self.skip_lexer = False + if lexer_type in ('dynamic', 'dynamic_complete'): + self.skip_lexer = True + return + + try: + create_lexer = { + 'standard': create_traditional_lexer, + 'contextual': create_contextual_lexer, + }[lexer_type] + except KeyError: + assert issubclass(lexer_type, Lexer), lexer_type + self.lexer = _wrap_lexer(lexer_type)(lexer_conf) else: - raise ValueError('CYK parser requires using standard parser.') - else: - raise ValueError('Unknown parser: %s' % parser) + self.lexer = create_lexer(lexer_conf, self.parser, lexer_conf.postlex) + if lexer_conf.postlex: + self.lexer = PostLexConnector(self.lexer, lexer_conf.postlex) -class _ParserFrontend(Serialize): - def _parse(self, start, input, *args): + + def parse(self, text, start=None): if start is None: - start = self.start + start = self.parser_conf.start if len(start) > 1: - raise ValueError("Lark initialized with more than 1 possible start rule. Must specify which start rule to parse", start) + raise ConfigurationError("Lark initialized with more than 1 possible start rule. Must specify which start rule to parse", start) start ,= start - return self.parser.parse(input, start, *args) + + if self.skip_lexer: + return self.parser.parse(text, start) + + lexer_thread = LexerThread(self.lexer, text) + return self.parser.parse(lexer_thread, start) + + +def get_frontend(parser, lexer): + assert_config(parser, ('lalr', 'earley', 'cyk')) + if not isinstance(lexer, type): # not custom lexer? + expected = { + 'lalr': ('standard', 'contextual'), + 'earley': ('standard', 'dynamic', 'dynamic_complete'), + 'cyk': ('standard', ), + }[parser] + assert_config(lexer, expected, 'Parser %r does not support lexer %%r, expected one of %%s' % parser) + + return MakeParsingFrontend(parser, lexer) def _get_lexer_callbacks(transformer, terminals): @@ -95,174 +149,86 @@ class PostLexConnector: return self.postlexer.process(i) -class WithLexer(_ParserFrontend): - lexer = None - parser = None - lexer_conf = None - start = None - - __serialize_fields__ = 'parser', 'lexer_conf', 'start' - __serialize_namespace__ = LexerConf, - - def __init__(self, lexer_conf, parser_conf, options=None): - self.lexer_conf = lexer_conf - self.start = parser_conf.start - self.postlex = lexer_conf.postlex - - @classmethod - def deserialize(cls, data, memo, callbacks, options): - inst = super(WithLexer, cls).deserialize(data, memo) - - inst.postlex = options.postlex - inst.parser = LALR_Parser.deserialize(inst.parser, memo, callbacks, options.debug) - - terminals = [item for item in memo.values() if isinstance(item, TerminalDef)] - inst.lexer_conf.callbacks = _get_lexer_callbacks(options.transformer, terminals) - inst.lexer_conf.re_module = regex if options.regex else re - inst.lexer_conf.use_bytes = options.use_bytes - inst.lexer_conf.g_regex_flags = options.g_regex_flags - inst.lexer_conf.skip_validation = True - inst.init_lexer() - - return inst - - def _serialize(self, data, memo): - data['parser'] = data['parser'].serialize(memo) - def make_lexer(self, text): - lexer = self.lexer - if self.postlex: - lexer = PostLexConnector(self.lexer, self.postlex) - return LexerThread(lexer, text) +def create_traditional_lexer(lexer_conf, parser, postlex): + return TraditionalLexer(lexer_conf) - def parse(self, text, start=None): - try: - return self._parse(start, self.make_lexer(text)) - except UnexpectedInput as e: - if e._all_terminals is None: - e._all_terminals = self.lexer_conf.terminals - raise e - - def init_traditional_lexer(self): - self.lexer = TraditionalLexer(self.lexer_conf) - -class LALR_WithLexer(WithLexer): - def __init__(self, lexer_conf, parser_conf, options=None): - debug = options.debug if options else False - self.parser = LALR_Parser(parser_conf, debug=debug) - WithLexer.__init__(self, lexer_conf, parser_conf, options) - - self.init_lexer() +def create_contextual_lexer(lexer_conf, parser, postlex): + states = {idx:list(t.keys()) for idx, t in parser._parse_table.states.items()} + always_accept = postlex.always_accept if postlex else () + return ContextualLexer(lexer_conf, states, always_accept=always_accept) - def init_lexer(self, **kw): - raise NotImplementedError() +def create_lalr_parser(lexer_conf, parser_conf, options=None): + debug = options.debug if options else False + return LALR_Parser(parser_conf, debug=debug) -class LALR_TraditionalLexer(LALR_WithLexer): - def init_lexer(self): - self.init_traditional_lexer() - -class LALR_ContextualLexer(LALR_WithLexer): - def init_lexer(self): - states = {idx:list(t.keys()) for idx, t in self.parser._parse_table.states.items()} - always_accept = self.postlex.always_accept if self.postlex else () - self.lexer = ContextualLexer(self.lexer_conf, states, always_accept=always_accept) +create_earley_parser = NotImplemented +CYK_FrontEnd = NotImplemented ###} -class LALR_CustomLexer(LALR_WithLexer): - def __init__(self, lexer_cls, lexer_conf, parser_conf, options=None): - self.lexer = lexer_cls(lexer_conf) - debug = options.debug if options else False - self.parser = LALR_Parser(parser_conf, debug=debug) - WithLexer.__init__(self, lexer_conf, parser_conf, options) - - -class Earley(WithLexer): - def __init__(self, lexer_conf, parser_conf, options=None): - WithLexer.__init__(self, lexer_conf, parser_conf, options) - self.init_traditional_lexer() - - resolve_ambiguity = options.ambiguity == 'resolve' - debug = options.debug if options else False - tree_class = options.tree_class or Tree if options.ambiguity != 'forest' else None - self.parser = earley.Parser(parser_conf, self.match, resolve_ambiguity=resolve_ambiguity, debug=debug, tree_class=tree_class) - - def make_lexer(self, text): - return WithLexer.make_lexer(self, text).lex(None) - - def match(self, term, token): - return term.name == token.type - - -class XEarley(_ParserFrontend): - def __init__(self, lexer_conf, parser_conf, options=None, **kw): - self.terminals_by_name = {t.name:t for t in lexer_conf.terminals} - self.start = parser_conf.start - - self._prepare_match(lexer_conf) - resolve_ambiguity = options.ambiguity == 'resolve' - debug = options.debug if options else False - tree_class = options.tree_class or Tree if options.ambiguity != 'forest' else None - self.parser = xearley.Parser(parser_conf, - self.match, - ignore=lexer_conf.ignore, - resolve_ambiguity=resolve_ambiguity, - debug=debug, - tree_class=tree_class, - **kw - ) - - def match(self, term, text, index=0): - return self.regexps[term.name].match(text, index) - - def _prepare_match(self, lexer_conf): +class EarleyRegexpMatcher: + def __init__(self, lexer_conf): self.regexps = {} for t in lexer_conf.terminals: if t.priority != 1: - raise ValueError("Dynamic Earley doesn't support weights on terminals", t, t.priority) + raise GrammarError("Dynamic Earley doesn't support weights on terminals", t, t.priority) regexp = t.pattern.to_regexp() try: width = get_regexp_width(regexp)[0] except ValueError: - raise ValueError("Bad regexp in token %s: %s" % (t.name, regexp)) + raise GrammarError("Bad regexp in token %s: %s" % (t.name, regexp)) else: if width == 0: - raise ValueError("Dynamic Earley doesn't allow zero-width regexps", t) + raise GrammarError("Dynamic Earley doesn't allow zero-width regexps", t) if lexer_conf.use_bytes: regexp = regexp.encode('utf-8') self.regexps[t.name] = lexer_conf.re_module.compile(regexp, lexer_conf.g_regex_flags) - def parse(self, text, start): - try: - return self._parse(start, text) - except UnexpectedInput as e: - if e._all_terminals is None: - e._all_terminals = self.terminals_by_name - raise e + def match(self, term, text, index=0): + return self.regexps[term.name].match(text, index) -class XEarley_CompleteLex(XEarley): - def __init__(self, *args, **kw): - XEarley.__init__(self, *args, complete_lex=True, **kw) +def create_earley_parser__dynamic(lexer_conf, parser_conf, options=None, **kw): + earley_matcher = EarleyRegexpMatcher(lexer_conf) + return xearley.Parser(parser_conf, earley_matcher.match, ignore=lexer_conf.ignore, **kw) +def _match_earley_basic(term, token): + return term.name == token.type -class CYK(WithLexer): +def create_earley_parser__basic(lexer_conf, parser_conf, options, **kw): + return earley.Parser(parser_conf, _match_earley_basic, **kw) - def __init__(self, lexer_conf, parser_conf, options=None): - WithLexer.__init__(self, lexer_conf, parser_conf, options) - self.init_traditional_lexer() +def create_earley_parser(lexer_conf, parser_conf, options): + resolve_ambiguity = options.ambiguity == 'resolve' + debug = options.debug if options else False + tree_class = options.tree_class or Tree if options.ambiguity != 'forest' else None + + extra = {} + if lexer_conf.lexer_type == 'dynamic': + f = create_earley_parser__dynamic + elif lexer_conf.lexer_type == 'dynamic_complete': + extra['complete_lex'] =True + f = create_earley_parser__dynamic + else: + f = create_earley_parser__basic + return f(lexer_conf, parser_conf, options, resolve_ambiguity=resolve_ambiguity, debug=debug, tree_class=tree_class, **extra) + + + +class CYK_FrontEnd: + def __init__(self, lexer_conf, parser_conf, options=None): self._analysis = GrammarAnalyzer(parser_conf) self.parser = cyk.Parser(parser_conf.rules) self.callbacks = parser_conf.callbacks - def parse(self, text, start): - tokens = list(self.make_lexer(text).lex(None)) - parse = self._parse(start, tokens) - parse = self._transform(parse) - return parse + def parse(self, lexer_thread, start): + tokens = list(lexer_thread.lex(None)) + tree = self.parser.parse(tokens, start) + return self._transform(tree) def _transform(self, tree): subtrees = list(tree.iter_subtrees()) diff --git a/lark/parsers/earley.py b/lark/parsers/earley.py index 42542c2..320b59a 100644 --- a/lark/parsers/earley.py +++ b/lark/parsers/earley.py @@ -1,4 +1,4 @@ -"""This module implements an scanerless Earley parser. +"""This module implements an Earley parser. The core Earley algorithm used here is based on Elizabeth Scott's implementation, here: https://www.sciencedirect.com/science/article/pii/S1571066108001497 @@ -6,8 +6,7 @@ The core Earley algorithm used here is based on Elizabeth Scott's implementation That is probably the best reference for understanding the algorithm here. The Earley parser outputs an SPPF-tree as per that document. The SPPF tree format -is better documented here: - http://www.bramvandersanden.com/post/2014/06/shared-packed-parse-forest/ +is explained here: https://lark-parser.readthedocs.io/en/latest/_static/sppf/sppf.html """ from collections import deque @@ -147,7 +146,7 @@ class Parser: column.add(new_item) items.append(new_item) - def _parse(self, stream, columns, to_scan, start_symbol=None): + def _parse(self, lexer, columns, to_scan, start_symbol=None): def is_quasi_complete(item): if item.is_complete: return True @@ -246,7 +245,7 @@ class Parser: if not next_set and not next_to_scan: expect = {i.expect.name for i in to_scan} - raise UnexpectedToken(token, expect, considered_rules = set(to_scan)) + raise UnexpectedToken(token, expect, considered_rules=set(to_scan), state=frozenset(i.s for i in to_scan)) return next_to_scan @@ -262,20 +261,24 @@ class Parser: # Completions will be added to the SPPF tree, and predictions will be recursively # processed down to terminals/empty nodes to be added to the scanner for the next # step. + expects = {i.expect for i in to_scan} i = 0 - for token in stream: + for token in lexer.lex(expects): self.predict_and_complete(i, to_scan, columns, transitives) to_scan = scan(i, token, to_scan) i += 1 + expects.clear() + expects |= {i.expect for i in to_scan} + self.predict_and_complete(i, to_scan, columns, transitives) ## Column is now the final column in the parse. assert i == len(columns)-1 return to_scan - def parse(self, stream, start): + def parse(self, lexer, start): assert start, start start_symbol = NonTerminal(start) @@ -292,12 +295,16 @@ class Parser: else: columns[0].add(item) - to_scan = self._parse(stream, columns, to_scan, start_symbol) + to_scan = self._parse(lexer, columns, to_scan, start_symbol) # If the parse was successful, the start # symbol should have been completed in the last step of the Earley cycle, and will be in # this column. Find the item for the start_symbol, which is the root of the SPPF tree. solutions = [n.node for n in columns[-1] if n.is_complete and n.node is not None and n.s == start_symbol and n.start == 0] + if not solutions: + expected_terminals = [t.expect for t in to_scan] + raise UnexpectedEOF(expected_terminals, state=frozenset(i.s for i in to_scan)) + if self.debug: from .earley_forest import ForestToPyDotVisitor try: @@ -308,10 +315,7 @@ class Parser: debug_walker.visit(solutions[0], "sppf.png") - if not solutions: - expected_tokens = [t.expect for t in to_scan] - raise UnexpectedEOF(expected_tokens) - elif len(solutions) > 1: + if len(solutions) > 1: assert False, 'Earley should not generate multiple start symbol items!' if self.tree_class is not None: diff --git a/lark/parsers/earley_forest.py b/lark/parsers/earley_forest.py index 532dedf..03c4573 100644 --- a/lark/parsers/earley_forest.py +++ b/lark/parsers/earley_forest.py @@ -459,15 +459,20 @@ class PackedData(): that comes from the left child and the right child. """ + class _NoData(): + pass + + NO_DATA = _NoData() + def __init__(self, node, data): - self.left = None - self.right = None + self.left = self.NO_DATA + self.right = self.NO_DATA if data: - if node.left: + if node.left is not None: self.left = data[0] - if len(data) > 1 and node.right: + if len(data) > 1: self.right = data[1] - elif node.right: + else: self.right = data[0] class ForestToParseTree(ForestTransformer): @@ -490,19 +495,22 @@ class ForestToParseTree(ForestTransformer): self.prioritizer = prioritizer self.resolve_ambiguity = resolve_ambiguity self._on_cycle_retreat = False + self._cycle_node = None + self._successful_visits = set() def on_cycle(self, node, path): - logger.warning("Cycle encountered in the SPPF at node: %s. " + logger.debug("Cycle encountered in the SPPF at node: %s. " "As infinite ambiguities cannot be represented in a tree, " "this family of derivations will be discarded.", node) - if self.resolve_ambiguity: - # TODO: choose a different path if cycle is encountered - logger.warning("At this time, using ambiguity resolution for SPPFs " - "with cycles may result in None being returned.") + self._cycle_node = node self._on_cycle_retreat = True def _check_cycle(self, node): if self._on_cycle_retreat: + if id(node) == id(self._cycle_node): + self._cycle_node = None + self._on_cycle_retreat = False + return raise Discard() def _collapse_ambig(self, children): @@ -531,11 +539,17 @@ class ForestToParseTree(ForestTransformer): raise Discard() def transform_symbol_node(self, node, data): + if id(node) not in self._successful_visits: + raise Discard() + self._successful_visits.remove(id(node)) self._check_cycle(node) data = self._collapse_ambig(data) return self._call_ambig_func(node, data) def transform_intermediate_node(self, node, data): + if id(node) not in self._successful_visits: + raise Discard() + self._successful_visits.remove(id(node)) self._check_cycle(node) if len(data) > 1: children = [self.tree_class('_inter', c) for c in data] @@ -544,36 +558,40 @@ class ForestToParseTree(ForestTransformer): def transform_packed_node(self, node, data): self._check_cycle(node) + if self.resolve_ambiguity and id(node.parent) in self._successful_visits: + raise Discard() children = [] assert len(data) <= 2 data = PackedData(node, data) - if data.left is not None: + if data.left is not PackedData.NO_DATA: if node.left.is_intermediate and isinstance(data.left, list): children += data.left else: children.append(data.left) - if data.right is not None: + if data.right is not PackedData.NO_DATA: children.append(data.right) if node.parent.is_intermediate: return children return self._call_rule_func(node, children) def visit_symbol_node_in(self, node): - self._on_cycle_retreat = False super(ForestToParseTree, self).visit_symbol_node_in(node) + if self._on_cycle_retreat: + return if self.prioritizer and node.is_ambiguous and isinf(node.priority): self.prioritizer.visit(node) - if self.resolve_ambiguity: - return node.children[0] return node.children def visit_packed_node_in(self, node): self._on_cycle_retreat = False - return super(ForestToParseTree, self).visit_packed_node_in(node) + to_visit = super(ForestToParseTree, self).visit_packed_node_in(node) + if not self.resolve_ambiguity or id(node.parent) not in self._successful_visits: + return to_visit - def visit_token_node(self, node): - self._on_cycle_retreat = False - return super(ForestToParseTree, self).visit_token_node(node) + def visit_packed_node_out(self, node): + super(ForestToParseTree, self).visit_packed_node_out(node) + if not self._on_cycle_retreat: + self._successful_visits.add(id(node.parent)) def handles_ambiguity(func): """Decorator for methods of subclasses of ``TreeForestTransformer``. @@ -679,7 +697,10 @@ class ForestToPyDotVisitor(ForestVisitor): def visit(self, root, filename): super(ForestToPyDotVisitor, self).visit(root) - self.graph.write_png(filename) + try: + self.graph.write_png(filename) + except FileNotFoundError as e: + logger.error("Could not write png: ", e) def visit_token_node(self, node): graph_node_id = str(id(node)) diff --git a/lark/parsers/lalr_parser.py b/lark/parsers/lalr_parser.py index 99b3672..9f08b81 100644 --- a/lark/parsers/lalr_parser.py +++ b/lark/parsers/lalr_parser.py @@ -3,15 +3,16 @@ # Author: Erez Shinan (2017) # Email : erezshin@gmail.com from copy import deepcopy, copy -from ..exceptions import UnexpectedCharacters, UnexpectedInput, UnexpectedToken +from ..exceptions import UnexpectedInput, UnexpectedToken from ..lexer import Token +from ..utils import Serialize from .lalr_analysis import LALR_Analyzer, Shift, Reduce, IntParseTable from .lalr_puppet import ParserPuppet ###{standalone -class LALR_Parser(object): +class LALR_Parser(Serialize): def __init__(self, parser_conf, debug=False): analysis = LALR_Analyzer(parser_conf, debug=debug) analysis.compute_lalr() @@ -62,6 +63,12 @@ class ParserState(object): def position(self): return self.state_stack[-1] + # Necessary for match_examples() to work + def __eq__(self, other): + if not isinstance(other, ParserState): + return False + return self.position == other.position + def __copy__(self): return type(self)( self.parse_conf, @@ -86,7 +93,7 @@ class ParserState(object): action, arg = states[state][token.type] except KeyError: expected = {s for s in states[state].keys() if s.isupper()} - raise UnexpectedToken(token, expected, state=state, puppet=None) + raise UnexpectedToken(token, expected, state=self, puppet=None) assert arg != end_state @@ -95,7 +102,7 @@ class ParserState(object): assert not is_end state_stack.append(arg) value_stack.append(token) - return arg + return else: # reduce+shift as many times as necessary rule = arg diff --git a/lark/parsers/lalr_puppet.py b/lark/parsers/lalr_puppet.py index e1496f9..93ba287 100644 --- a/lark/parsers/lalr_puppet.py +++ b/lark/parsers/lalr_puppet.py @@ -22,7 +22,7 @@ class ParserPuppet(object): Note that ``token`` has to be an instance of ``Token``. """ - return self.parser_state.feed_token(token) + return self.parser_state.feed_token(token, token.type == '$END') def __copy__(self): """Create a new puppet with a separate state. @@ -35,15 +35,18 @@ class ParserPuppet(object): copy(self.lexer_state), ) + def copy(self): + return copy(self) + def __eq__(self, other): if not isinstance(other, ParserPuppet): return False return self.parser_state == other.parser_state and self.lexer_state == other.lexer_state - # TODO Provide with an immutable puppet instance - # def __hash__(self): - # return hash((self.parser_state, self.lexer_state)) + def as_immutable(self): + p = copy(self) + return ImmutableParserPuppet(p.parser, p.parser_state, p.lexer_state) def pretty(self): """Print the output of ``choices()`` in a way that's easier to read.""" @@ -78,3 +81,16 @@ class ParserPuppet(object): def resume_parse(self): """Resume parsing from the current puppet state.""" return self.parser.parse_from_state(self.parser_state) + + + +class ImmutableParserPuppet(ParserPuppet): + result = None + + def __hash__(self): + return hash((self.parser_state, self.lexer_state)) + + def feed_token(self, token): + c = copy(self) + c.result = ParserPuppet.feed_token(c, token) + return c \ No newline at end of file diff --git a/lark/parsers/xearley.py b/lark/parsers/xearley.py index 256fc2c..79ac82f 100644 --- a/lark/parsers/xearley.py +++ b/lark/parsers/xearley.py @@ -63,9 +63,10 @@ class Parser(BaseParser): t = Token(item.expect.name, m.group(0), i, text_line, text_column) delayed_matches[i+m.end()].append( (item, i, t) ) - # Remove any items that successfully matched in this pass from the to_scan buffer. - # This ensures we don't carry over tokens that already matched, if we're ignoring below. - to_scan.remove(item) + # XXX The following 3 lines were commented out for causing a bug. See issue #768 + # # Remove any items that successfully matched in this pass from the to_scan buffer. + # # This ensures we don't carry over tokens that already matched, if we're ignoring below. + # to_scan.remove(item) # 3) Process any ignores. This is typically used for e.g. whitespace. # We carry over any unmatched items from the to_scan buffer to be matched again after @@ -113,7 +114,8 @@ class Parser(BaseParser): del delayed_matches[i+1] # No longer needed, so unburden memory if not next_set and not delayed_matches and not next_to_scan: - raise UnexpectedCharacters(stream, i, text_line, text_column, {item.expect.name for item in to_scan}, set(to_scan)) + raise UnexpectedCharacters(stream, i, text_line, text_column, {item.expect.name for item in to_scan}, + set(to_scan), state=frozenset(i.s for i in to_scan)) return next_to_scan diff --git a/lark/tools/nearley.py b/lark/tools/nearley.py index af2789e..faf2b69 100644 --- a/lark/tools/nearley.py +++ b/lark/tools/nearley.py @@ -35,7 +35,9 @@ nearley_grammar = r""" COMMENT: /#[^\n]*/ REGEXP: /\[.*?\]/ - %import common.ESCAPED_STRING -> STRING + STRING: _STRING "i"? + + %import common.ESCAPED_STRING -> _STRING %import common.WS %ignore WS %ignore COMMENT @@ -183,7 +185,7 @@ def main(fn, start, nearley_lib, es6=False): return create_code_for_nearley_grammar(grammar, start, os.path.join(nearley_lib, 'builtin'), os.path.abspath(os.path.dirname(fn)), es6=es6) def get_arg_parser(): - parser = argparse.ArgumentParser('Reads Nearley grammar (with js functions) outputs an equivalent lark parser.') + parser = argparse.ArgumentParser(description='Reads a Nearley grammar (with js functions), and outputs an equivalent lark parser.') parser.add_argument('nearley_grammar', help='Path to the file containing the nearley grammar') parser.add_argument('start_rule', help='Rule within the nearley grammar to make the base rule') parser.add_argument('nearley_lib', help='Path to root directory of nearley codebase (used for including builtins)') diff --git a/lark/tree.py b/lark/tree.py index 0b7114b..9d95015 100644 --- a/lark/tree.py +++ b/lark/tree.py @@ -46,14 +46,14 @@ class Tree(object): def _pretty(self, level, indent_str): if len(self.children) == 1 and not isinstance(self.children[0], Tree): - return [ indent_str*level, self._pretty_label(), '\t', '%s' % (self.children[0],), '\n'] + return [indent_str*level, self._pretty_label(), '\t', '%s' % (self.children[0],), '\n'] - l = [ indent_str*level, self._pretty_label(), '\n' ] + l = [indent_str*level, self._pretty_label(), '\n'] for n in self.children: if isinstance(n, Tree): l += n._pretty(level+1, indent_str) else: - l += [ indent_str*(level+1), '%s' % (n,), '\n' ] + l += [indent_str*(level+1), '%s' % (n,), '\n'] return l @@ -102,8 +102,8 @@ class Tree(object): ###} def expand_kids_by_index(self, *indices): - "Expand (inline) children at the given indices" - for i in sorted(indices, reverse=True): # reverse so that changing tail won't affect indices + """Expand (inline) children at the given indices""" + for i in sorted(indices, reverse=True): # reverse so that changing tail won't affect indices kid = self.children[i] self.children[i:i+1] = kid.children @@ -144,12 +144,15 @@ class Tree(object): @property def line(self): return self.meta.line + @property def column(self): return self.meta.column + @property def end_line(self): return self.meta.end_line + @property def end_column(self): return self.meta.end_column @@ -168,6 +171,7 @@ def pydot__tree_to_dot(tree, filename, rankdir="LR", **kwargs): graph = pydot__tree_to_graph(tree, rankdir, **kwargs) graph.write(filename) + def pydot__tree_to_graph(tree, rankdir="LR", **kwargs): """Creates a colorful image that represents the tree (data+children, without meta) @@ -196,7 +200,7 @@ def pydot__tree_to_graph(tree, rankdir="LR", **kwargs): subnodes = [_to_pydot(child) if isinstance(child, Tree) else new_leaf(child) for child in subtree.children] - node = pydot.Node(i[0], style="filled", fillcolor="#%x"%color, label=subtree.data) + node = pydot.Node(i[0], style="filled", fillcolor="#%x" % color, label=subtree.data) i[0] += 1 graph.add_node(node) diff --git a/lark/tree_matcher.py b/lark/tree_matcher.py index 8c1f17a..c9d9fde 100644 --- a/lark/tree_matcher.py +++ b/lark/tree_matcher.py @@ -69,6 +69,14 @@ def parse_rulename(s): return name, args + +class ChildrenLexer: + def __init__(self, children): + self.children = children + + def lex(self, parser_state): + return self.children + class TreeMatcher: """Match the elements of a tree node, based on an ontology provided by a Lark grammar. @@ -173,6 +181,6 @@ class TreeMatcher: self._parser_cache[rulename] = parser # find a full derivation - unreduced_tree = parser.parse(tree.children, rulename) + unreduced_tree = parser.parse(ChildrenLexer(tree.children), rulename) assert unreduced_tree.data == rulename return unreduced_tree diff --git a/lark/utils.py b/lark/utils.py index cfd4306..3b5b8a8 100644 --- a/lark/utils.py +++ b/lark/utils.py @@ -1,10 +1,9 @@ -import sys import os from functools import reduce -from ast import literal_eval from collections import deque ###{standalone +import sys, re import logging logger = logging.getLogger("lark") logger.addHandler(logging.StreamHandler()) @@ -12,6 +11,8 @@ logger.addHandler(logging.StreamHandler()) # By default, we should not output any log messages logger.setLevel(logging.CRITICAL) +Py36 = (sys.version_info[:2] >= (3, 6)) + def classify(seq, key=None, value=None): d = {} @@ -27,7 +28,7 @@ def classify(seq, key=None, value=None): def _deserialize(data, namespace, memo): if isinstance(data, dict): - if '__type__' in data: # Object + if '__type__' in data: # Object class_ = namespace[data['__type__']] return class_.deserialize(data, memo) elif '@' in data: @@ -105,7 +106,6 @@ class SerializeMemoizer(Serialize): return _deserialize(data, namespace, memo) - try: STRING_TYPE = basestring except NameError: # Python 3 @@ -118,10 +118,11 @@ from contextlib import contextmanager Str = type(u'') try: - classtype = types.ClassType # Python2 + classtype = types.ClassType # Python2 except AttributeError: classtype = type # Python3 + def smart_decorator(f, create_decorator): if isinstance(f, types.FunctionType): return wraps(f)(create_decorator(f, True)) @@ -139,17 +140,16 @@ def smart_decorator(f, create_decorator): else: return create_decorator(f.__func__.__call__, True) + try: import regex except ImportError: regex = None -import sys, re -Py36 = (sys.version_info[:2] >= (3, 6)) - import sre_parse import sre_constants categ_pattern = re.compile(r'\\p{[A-Za-z_]+}') + def get_regexp_width(expr): if regex: # Since `sre_parse` cannot deal with Unicode categories of the form `\p{Mn}`, we replace these with @@ -173,9 +173,7 @@ def dedup_list(l): preserving the original order of the list. Assumes that the list entries are hashable.""" dedup = set() - return [ x for x in l if not (x in dedup or dedup.add(x))] - - + return [x for x in l if not (x in dedup or dedup.add(x))] try: @@ -197,8 +195,6 @@ except ImportError: pass - - try: compare = cmp except NameError: @@ -210,7 +206,6 @@ except NameError: return -1 - class Enumerator(Serialize): def __init__(self): self.enums = {} @@ -229,31 +224,6 @@ class Enumerator(Serialize): return r -def eval_escaping(s): - w = '' - i = iter(s) - for n in i: - w += n - if n == '\\': - try: - n2 = next(i) - except StopIteration: - raise ValueError("Literal ended unexpectedly (bad escaping): `%r`" % s) - if n2 == '\\': - w += '\\\\' - elif n2 not in 'uxnftr': - w += '\\' - w += n2 - w = w.replace('\\"', '"').replace("'", "\\'") - - to_eval = "u'''%s'''" % w - try: - s = literal_eval(to_eval) - except SyntaxError as e: - raise ValueError(s, e) - - return s - def combine_alternatives(lists): """ @@ -332,4 +302,5 @@ def _serialize(value, memo): return list(value) # TODO reversible? elif isinstance(value, dict): return {key:_serialize(elem, memo) for key, elem in value.items()} + # assert value is None or isinstance(value, (int, float, str, tuple)), value return value diff --git a/lark/visitors.py b/lark/visitors.py index 14896e5..7e3bae4 100644 --- a/lark/visitors.py +++ b/lark/visitors.py @@ -8,6 +8,7 @@ from .lexer import Token ###{standalone from inspect import getmembers, getmro + class Discard(Exception): """When raising the Discard exception in a transformer callback, that node is discarded and won't appear in the parent. @@ -16,6 +17,7 @@ class Discard(Exception): # Transformers + class _Decoratable: "Provides support for decorating methods with @v_args" @@ -47,7 +49,7 @@ class _Decoratable: class Transformer(_Decoratable): """Transformers visit each node of the tree, and run the appropriate method on it according to the node's data. - Calls its methods (provided by user via inheritance) according to ``tree.data``. + Calls its methods (provided by the user via inheritance) according to ``tree.data``. The returned value replaces the old one in the structure. They work bottom-up (or depth-first), starting with the leaves and ending at the root of the tree. @@ -64,12 +66,11 @@ class Transformer(_Decoratable): - ``Transformer_InPlaceRecursive`` - Recursive. Changes the tree in-place instead of returning new instances Parameters: - visit_tokens: By default, transformers only visit rules. - visit_tokens=True will tell ``Transformer`` to visit tokens - as well. This is a slightly slower alternative to lexer_callbacks - but it's easier to maintain and works for all algorithms - (even when there isn't a lexer). + visit_tokens (bool, optional): Should the transformer visit tokens in addition to rules. + Setting this to ``False`` is slightly faster. Defaults to ``True``. + (For processing ignored tokens, use the ``lexer_callbacks`` options) + NOTE: A transformer without methods essentially performs a non-memoized deepcopy. """ __visit_tokens__ = True # For backwards compatibility @@ -108,7 +109,6 @@ class Transformer(_Decoratable): except Exception as e: raise VisitError(token.type, token, e) - def _transform_children(self, children): for c in children: try: @@ -126,29 +126,29 @@ class Transformer(_Decoratable): return self._call_userfunc(tree, children) def transform(self, tree): + "Transform the given tree, and return the final result" return self._transform_tree(tree) def __mul__(self, other): + """Chain two transformers together, returning a new transformer. + """ return TransformerChain(self, other) def __default__(self, data, children, meta): - """Default operation on tree (for override) + """Default function that is called if there is no attribute matching ``data`` - Function that is called on if a function with a corresponding name has not been found. - Defaults to reconstruct the Tree. + Can be overridden. Defaults to creating a new copy of the tree node (i.e. ``return Tree(data, children, meta)``) """ return Tree(data, children, meta) def __default_token__(self, token): - """Default operation on token (for override) + """Default function that is called if there is no attribute matching ``token.type`` - Function that is called on if a function with a corresponding name has not been found. - Defaults to just return the argument. + Can be overridden. Defaults to returning the token as-is. """ return token - class InlineTransformer(Transformer): # XXX Deprecated def _call_userfunc(self, tree, new_children=None): # Assumes tree is already transformed @@ -175,7 +175,10 @@ class TransformerChain(object): class Transformer_InPlace(Transformer): - "Non-recursive. Changes the tree in-place instead of returning new instances" + """Same as Transformer, but non-recursive, and changes the tree in-place instead of returning new instances + + Useful for huge trees. Conservative in memory. + """ def _transform_tree(self, tree): # Cancel recursion return self._call_userfunc(tree) @@ -187,7 +190,12 @@ class Transformer_InPlace(Transformer): class Transformer_NonRecursive(Transformer): - "Non-recursive. Doesn't change the original tree." + """Same as Transformer but non-recursive. + + Like Transformer, it doesn't change the original tree. + + Useful for huge trees. + """ def transform(self, tree): # Tree to postfix @@ -195,7 +203,7 @@ class Transformer_NonRecursive(Transformer): q = [tree] while q: t = q.pop() - rev_postfix.append( t ) + rev_postfix.append(t) if isinstance(t, Tree): q += t.children @@ -217,15 +225,13 @@ class Transformer_NonRecursive(Transformer): return t - class Transformer_InPlaceRecursive(Transformer): - "Recursive. Changes the tree in-place instead of returning new instances" + "Same as Transformer, recursive, but changes the tree in-place instead of returning new instances" def _transform_tree(self, tree): tree.children = list(self._transform_children(tree.children)) return self._call_userfunc(tree) - # Visitors class VisitorBase: @@ -233,7 +239,10 @@ class VisitorBase: return getattr(self, tree.data, self.__default__)(tree) def __default__(self, tree): - "Default operation on tree (for override)" + """Default function that is called if there is no attribute matching ``tree.data`` + + Can be overridden. Defaults to doing nothing. + """ return tree def __class_getitem__(cls, _): @@ -241,18 +250,19 @@ class VisitorBase: class Visitor(VisitorBase): - """Bottom-up visitor, non-recursive. + """Tree visitor, non-recursive (can handle huge trees). - Visits the tree, starting with the leaves and finally the root (bottom-up) - Calls its methods (provided by user via inheritance) according to ``tree.data`` + Visiting a node calls its methods (provided by the user via inheritance) according to ``tree.data`` """ def visit(self, tree): + "Visits the tree, starting with the leaves and finally the root (bottom-up)" for subtree in tree.iter_subtrees(): self._call_userfunc(subtree) return tree def visit_topdown(self,tree): + "Visit the tree, starting at the root, and ending at the leaves (top-down)" for subtree in tree.iter_subtrees_topdown(): self._call_userfunc(subtree) return tree @@ -261,11 +271,13 @@ class Visitor(VisitorBase): class Visitor_Recursive(VisitorBase): """Bottom-up visitor, recursive. - Visits the tree, starting with the leaves and finally the root (bottom-up) - Calls its methods (provided by user via inheritance) according to ``tree.data`` + Visiting a node calls its methods (provided by the user via inheritance) according to ``tree.data`` + + Slightly faster than the non-recursive version. """ def visit(self, tree): + "Visits the tree, starting with the leaves and finally the root (bottom-up)" for child in tree.children: if isinstance(child, Tree): self.visit(child) @@ -274,6 +286,7 @@ class Visitor_Recursive(VisitorBase): return tree def visit_topdown(self,tree): + "Visit the tree, starting at the root, and ending at the leaves (top-down)" self._call_userfunc(tree) for child in tree.children: @@ -283,7 +296,6 @@ class Visitor_Recursive(VisitorBase): return tree - def visit_children_decor(func): "See Interpreter" @wraps(func) @@ -324,8 +336,6 @@ class Interpreter(_Decoratable): return self.visit_children(tree) - - # Decorators def _apply_decorator(obj, decorator, **kwargs): @@ -337,7 +347,6 @@ def _apply_decorator(obj, decorator, **kwargs): return _apply(decorator, **kwargs) - def _inline_args__func(func): @wraps(func) def create_decorator(_f, with_self): @@ -356,7 +365,6 @@ def inline_args(obj): # XXX Deprecated return _apply_decorator(obj, _inline_args__func) - def _visitor_args_func_dec(func, visit_wrapper=None, static=False): def create_decorator(_f, with_self): if with_self: @@ -376,11 +384,11 @@ def _visitor_args_func_dec(func, visit_wrapper=None, static=False): return f -def _vargs_inline(f, data, children, meta): +def _vargs_inline(f, _data, children, _meta): return f(*children) -def _vargs_meta_inline(f, data, children, meta): +def _vargs_meta_inline(f, _data, children, meta): return f(meta, *children) -def _vargs_meta(f, data, children, meta): +def _vargs_meta(f, _data, children, meta): return f(children, meta) # TODO swap these for consistency? Backwards incompatible! def _vargs_tree(f, data, children, meta): return f(Tree(data, children, meta)) @@ -394,10 +402,14 @@ def v_args(inline=False, meta=False, tree=False, wrapper=None): ``v_args`` can modify this behavior. When used on a transformer/visitor class definition, it applies to all the callback methods inside it. + ``v_args`` can be applied to a single method, or to an entire class. When applied to both, + the options given to the method take precedence. + Parameters: - inline: Children are provided as ``*args`` instead of a list argument (not recommended for very long lists). - meta: Provides two arguments: ``children`` and ``meta`` (instead of just the first) - tree: Provides the entire tree as the argument, instead of the children. + inline (bool, optional): Children are provided as ``*args`` instead of a list argument (not recommended for very long lists). + meta (bool, optional): Provides two arguments: ``children`` and ``meta`` (instead of just the first) + tree (bool, optional): Provides the entire tree as the argument, instead of the children. + wrapper (function, optional): Provide a function to decorate all methods. Example: :: @@ -440,7 +452,7 @@ def v_args(inline=False, meta=False, tree=False, wrapper=None): ###} -#--- Visitor Utilities --- +# --- Visitor Utilities --- class CollapseAmbiguities(Transformer): """ @@ -454,7 +466,9 @@ class CollapseAmbiguities(Transformer): """ def _ambig(self, options): return sum(options, []) + def __default__(self, data, children_lists, meta): return [Tree(data, children, meta) for children in combine_alternatives(children_lists)] + def __default_token__(self, t): return [t] diff --git a/setup.py b/setup.py index 382943e..b3897c5 100644 --- a/setup.py +++ b/setup.py @@ -29,8 +29,8 @@ setup( description = "a modern parsing library", license = "MIT", keywords = "Earley LALR parser parsing ast", - url = "https://github.com/erezsh/lark", - download_url = "https://github.com/erezsh/lark/tarball/master", + url = "https://github.com/lark-parser/lark", + download_url = "https://github.com/lark-parser/lark/tarball/master", long_description=''' Lark is a modern general-purpose parsing library for Python. diff --git a/tests/__main__.py b/tests/__main__.py index 5ec89e3..b779457 100644 --- a/tests/__main__.py +++ b/tests/__main__.py @@ -9,6 +9,7 @@ from .test_tools import TestStandalone from .test_cache import TestCache from .test_grammar import TestGrammar from .test_reconstructor import TestReconstructor +from .test_tree_forest_transformer import TestTreeForestTransformer try: from .test_nearley.test_nearley import TestNearley @@ -20,20 +21,7 @@ except ImportError: from .test_logger import Testlogger -from .test_parser import ( - TestLalrStandard, - TestEarleyStandard, - TestCykStandard, - TestLalrContextual, - TestEarleyDynamic, - TestLalrCustom, - - # TestFullEarleyStandard, - TestFullEarleyDynamic, - TestFullEarleyDynamic_complete, - - TestParsers, - ) +from .test_parser import * # We define __all__ to list which TestSuites to run logger.setLevel(logging.INFO) diff --git a/tests/test_parser.py b/tests/test_parser.py index 38399cf..9b011f7 100644 --- a/tests/test_parser.py +++ b/tests/test_parser.py @@ -11,6 +11,7 @@ from copy import copy, deepcopy from lark.utils import Py36, isascii from lark import Token +from lark.load_grammar import FromPackageLoader try: from cStringIO import StringIO as cStringIO @@ -29,6 +30,7 @@ try: except ImportError: regex = None +import lark from lark import logger from lark.lark import Lark from lark.exceptions import GrammarError, ParseError, UnexpectedToken, UnexpectedInput, UnexpectedCharacters @@ -36,9 +38,9 @@ from lark.tree import Tree from lark.visitors import Transformer, Transformer_InPlace, v_args from lark.grammar import Rule from lark.lexer import TerminalDef, Lexer, TraditionalLexer +from lark.indenter import Indenter -logger.setLevel(logging.INFO) - +__all__ = ['TestParsers'] __path__ = os.path.dirname(__file__) def _read(n, *args): @@ -322,6 +324,22 @@ class TestParsers(unittest.TestCase): def test_alias(self): Lark("""start: ["a"] "b" ["c"] "e" ["f"] ["g"] ["h"] "x" -> d """) + def test_backwards_custom_lexer(self): + class OldCustomLexer(Lexer): + def __init__(self, lexer_conf): + pass + + def lex(self, text): + yield Token('A', 'A') + + p = Lark(""" + start: A + %declare A + """, parser='lalr', lexer=OldCustomLexer) + + r = p.parse('') + self.assertEqual(r, Tree('start', [Token('A', 'A')])) + def _make_full_earley_test(LEXER): @@ -745,6 +763,76 @@ def _make_full_earley_test(LEXER): tree = parser.parse(text) self.assertEqual(tree.children, ['foo', 'bar']) + def test_cycle(self): + grammar = """ + start: start? + """ + + l = Lark(grammar, ambiguity='resolve', lexer=LEXER) + tree = l.parse('') + self.assertEqual(tree, Tree('start', [])) + + l = Lark(grammar, ambiguity='explicit', lexer=LEXER) + tree = l.parse('') + self.assertEqual(tree, Tree('start', [])) + + def test_cycles(self): + grammar = """ + a: b + b: c* + c: a + """ + + l = Lark(grammar, start='a', ambiguity='resolve', lexer=LEXER) + tree = l.parse('') + self.assertEqual(tree, Tree('a', [Tree('b', [])])) + + l = Lark(grammar, start='a', ambiguity='explicit', lexer=LEXER) + tree = l.parse('') + self.assertEqual(tree, Tree('a', [Tree('b', [])])) + + def test_many_cycles(self): + grammar = """ + start: a? | start start + !a: "a" + """ + + l = Lark(grammar, ambiguity='resolve', lexer=LEXER) + tree = l.parse('a') + self.assertEqual(tree, Tree('start', [Tree('a', ['a'])])) + + l = Lark(grammar, ambiguity='explicit', lexer=LEXER) + tree = l.parse('a') + self.assertEqual(tree, Tree('start', [Tree('a', ['a'])])) + + def test_cycles_with_child_filter(self): + grammar = """ + a: _x + _x: _x? b + b: + """ + + grammar2 = """ + a: x + x: x? b + b: + """ + + l = Lark(grammar, start='a', ambiguity='resolve', lexer=LEXER) + tree = l.parse('') + self.assertEqual(tree, Tree('a', [Tree('b', [])])) + + l = Lark(grammar, start='a', ambiguity='explicit', lexer=LEXER) + tree = l.parse(''); + self.assertEqual(tree, Tree('a', [Tree('b', [])])) + + l = Lark(grammar2, start='a', ambiguity='resolve', lexer=LEXER) + tree = l.parse(''); + self.assertEqual(tree, Tree('a', [Tree('x', [Tree('b', [])])])) + + l = Lark(grammar2, start='a', ambiguity='explicit', lexer=LEXER) + tree = l.parse(''); + self.assertEqual(tree, Tree('a', [Tree('x', [Tree('b', [])])])) @@ -768,16 +856,32 @@ def _make_full_earley_test(LEXER): _NAME = "TestFullEarley" + LEXER.capitalize() _TestFullEarley.__name__ = _NAME globals()[_NAME] = _TestFullEarley + __all__.append(_NAME) -class CustomLexer(Lexer): +class CustomLexerNew(Lexer): """ Purpose of this custom lexer is to test the integration, so it uses the traditionalparser as implementation without custom lexing behaviour. """ def __init__(self, lexer_conf): self.lexer = TraditionalLexer(copy(lexer_conf)) - def lex(self, *args, **kwargs): - return self.lexer.lex(*args, **kwargs) + def lex(self, lexer_state, parser_state): + return self.lexer.lex(lexer_state, parser_state) + + __future_interface__ = True + +class CustomLexerOld(Lexer): + """ + Purpose of this custom lexer is to test the integration, + so it uses the traditionalparser as implementation without custom lexing behaviour. + """ + def __init__(self, lexer_conf): + self.lexer = TraditionalLexer(copy(lexer_conf)) + def lex(self, text): + ls = self.lexer.make_lexer_state(text) + return self.lexer.lex(ls, None) + + __future_interface__ = False def _tree_structure_check(a, b): """ @@ -851,12 +955,31 @@ class DualBytesLark: self.bytes_lark.load(f) def _make_parser_test(LEXER, PARSER): - lexer_class_or_name = CustomLexer if LEXER == 'custom' else LEXER + lexer_class_or_name = { + 'custom_new': CustomLexerNew, + 'custom_old': CustomLexerOld, + }.get(LEXER, LEXER) + def _Lark(grammar, **kwargs): return Lark(grammar, lexer=lexer_class_or_name, parser=PARSER, propagate_positions=True, **kwargs) def _Lark_open(gfilename, **kwargs): return Lark.open(gfilename, lexer=lexer_class_or_name, parser=PARSER, propagate_positions=True, **kwargs) + if (LEXER, PARSER) == ('standard', 'earley'): + # Check that the `lark.lark` grammar represents can parse every example used in these tests. + # Standard-Earley was an arbitrary choice, to make sure it only ran once. + lalr_parser = Lark.open(os.path.join(os.path.dirname(lark.__file__), 'grammars/lark.lark'), parser='lalr') + def wrap_with_test_grammar(f): + def _f(x, **kwargs): + inst = f(x, **kwargs) + lalr_parser.parse(inst.source_grammar) # Test after instance creation. When the grammar should fail, don't test it. + return inst + return _f + + _Lark = wrap_with_test_grammar(_Lark) + _Lark_open = wrap_with_test_grammar(_Lark_open) + + class _TestParser(unittest.TestCase): def test_basic1(self): g = _Lark("""start: a+ b a* "b" a* @@ -1412,7 +1535,7 @@ def _make_parser_test(LEXER, PARSER): %s""" % (' '.join(tokens), '\n'.join("%s: %s"%x for x in tokens.items()))) def test_float_without_lexer(self): - expected_error = UnexpectedCharacters if LEXER.startswith('dynamic') else UnexpectedToken + expected_error = UnexpectedCharacters if 'dynamic' in LEXER else UnexpectedToken if PARSER == 'cyk': expected_error = ParseError @@ -1545,13 +1668,13 @@ def _make_parser_test(LEXER, PARSER): self.assertEqual(d.line, 2) self.assertEqual(d.column, 2) - if LEXER != 'dynamic': - self.assertEqual(a.end_line, 1) - self.assertEqual(a.end_column, 2) - self.assertEqual(bc.end_line, 2) - self.assertEqual(bc.end_column, 2) - self.assertEqual(d.end_line, 2) - self.assertEqual(d.end_column, 3) + # if LEXER != 'dynamic': + self.assertEqual(a.end_line, 1) + self.assertEqual(a.end_column, 2) + self.assertEqual(bc.end_line, 2) + self.assertEqual(bc.end_column, 2) + self.assertEqual(d.end_line, 2) + self.assertEqual(d.end_column, 3) @@ -1782,7 +1905,7 @@ def _make_parser_test(LEXER, PARSER): """ self.assertRaises(IOError, _Lark, grammar) - @unittest.skipIf(LEXER=='dynamic', "%declare/postlex doesn't work with dynamic") + @unittest.skipIf('dynamic' in LEXER, "%declare/postlex doesn't work with dynamic") def test_postlex_declare(self): # Note: this test does a lot. maybe split it up? class TestPostLexer: def process(self, stream): @@ -1805,6 +1928,59 @@ def _make_parser_test(LEXER, PARSER): tree = parser.parse(test_file) self.assertEqual(tree.children, [Token('B', 'A')]) + @unittest.skipIf('dynamic' in LEXER, "%declare/postlex doesn't work with dynamic") + def test_postlex_indenter(self): + class CustomIndenter(Indenter): + NL_type = 'NEWLINE' + OPEN_PAREN_types = [] + CLOSE_PAREN_types = [] + INDENT_type = 'INDENT' + DEDENT_type = 'DEDENT' + tab_len = 8 + + grammar = r""" + start: "a" NEWLINE INDENT "b" NEWLINE DEDENT + + NEWLINE: ( /\r?\n */ )+ + + %ignore " "+ + %declare INDENT DEDENT + """ + + parser = _Lark(grammar, postlex=CustomIndenter()) + parser.parse("a\n b\n") + + def test_import_custom_sources(self): + custom_loader = FromPackageLoader('tests', ('grammars', )) + + grammar = """ + start: startab + + %import ab.startab + """ + + p = _Lark(grammar, import_paths=[custom_loader]) + self.assertEqual(p.parse('ab'), + Tree('start', [Tree('startab', [Tree('ab__expr', [Token('ab__A', 'a'), Token('ab__B', 'b')])])])) + + grammar = """ + start: rule_to_import + + %import test_relative_import_of_nested_grammar__grammar_to_import.rule_to_import + """ + p = _Lark(grammar, import_paths=[custom_loader]) + x = p.parse('N') + self.assertEqual(next(x.find_data('rule_to_import')).children, ['N']) + + custom_loader2 = FromPackageLoader('tests') + grammar = """ + %import .test_relative_import (start, WS) + %ignore WS + """ + p = _Lark(grammar, import_paths=[custom_loader2], source_path=__file__) # import relative to current file + x = p.parse('12 capybaras') + self.assertEqual(x.children, ['12', 'capybaras']) + @unittest.skipIf(PARSER == 'cyk', "Doesn't work for CYK") def test_prioritization(self): "Tests effect of priority on result" @@ -1849,7 +2025,7 @@ def _make_parser_test(LEXER, PARSER): - @unittest.skipIf(PARSER != 'earley' or LEXER == 'standard', "Currently only Earley supports priority sum in rules") + @unittest.skipIf(PARSER != 'earley' or 'dynamic' not in LEXER, "Currently only Earley supports priority sum in rules") def test_prioritization_sum(self): "Tests effect of priority on result" @@ -2060,9 +2236,9 @@ def _make_parser_test(LEXER, PARSER): self.assertEqual(tok, text) self.assertEqual(tok.line, 1) self.assertEqual(tok.column, 1) - if _LEXER != 'dynamic': - self.assertEqual(tok.end_line, 2) - self.assertEqual(tok.end_column, 6) + # if _LEXER != 'dynamic': + self.assertEqual(tok.end_line, 2) + self.assertEqual(tok.end_column, 6) @unittest.skipIf(PARSER=='cyk', "Empty rules") def test_empty_end(self): @@ -2153,7 +2329,7 @@ def _make_parser_test(LEXER, PARSER): parser = _Lark(grammar) - @unittest.skipIf(PARSER!='lalr' or LEXER=='custom', "Serialize currently only works for LALR parsers without custom lexers (though it should be easy to extend)") + @unittest.skipIf(PARSER!='lalr' or 'custom' in LEXER, "Serialize currently only works for LALR parsers without custom lexers (though it should be easy to extend)") def test_serialize(self): grammar = """ start: _ANY b "C" @@ -2199,6 +2375,31 @@ def _make_parser_test(LEXER, PARSER): self.assertEqual(a.line, 1) self.assertEqual(b.line, 2) + @unittest.skipIf(PARSER=='cyk' or LEXER=='custom_old', "match_examples() not supported for CYK/old custom lexer") + def test_match_examples(self): + p = _Lark(r""" + start: "a" "b" "c" + """) + + def match_error(s): + try: + _ = p.parse(s) + except UnexpectedInput as u: + return u.match_examples(p.parse, { + 0: ['abe'], + 1: ['ab'], + 2: ['cbc', 'dbc'], + }) + assert False + + assert match_error("abe") == 0 + assert match_error("ab") == 1 + assert match_error("bbc") == 2 + assert match_error("cbc") == 2 + self.assertEqual( match_error("dbc"), 2 ) + self.assertEqual( match_error("ebc"), 2 ) + + @unittest.skipIf(not regex or sys.version_info[0] == 2, 'Unicode and Python 2 do not place nicely together.') def test_unicode_class(self): "Tests that character classes from the `regex` module work correctly." @@ -2257,17 +2458,21 @@ def _make_parser_test(LEXER, PARSER): _TestParser.__name__ = _NAME _TestParser.__qualname__ = "tests.test_parser." + _NAME globals()[_NAME] = _TestParser + __all__.append(_NAME) -# Note: You still have to import them in __main__ for the tests to run _TO_TEST = [ ('standard', 'earley'), ('standard', 'cyk'), + ('standard', 'lalr'), + ('dynamic', 'earley'), ('dynamic_complete', 'earley'), - ('standard', 'lalr'), + ('contextual', 'lalr'), - ('custom', 'lalr'), - # (None, 'earley'), + + ('custom_new', 'lalr'), + ('custom_new', 'cyk'), + ('custom_old', 'earley'), ] for _LEXER, _PARSER in _TO_TEST: