Browse Source

Merge pull request #959 from lark-parser/v1.0-merge-master

gm/2021-09-23T00Z/github.com--lark-parser-lark/1.0b
Erez Shinan 4 years ago
committed by GitHub
parent
commit
b79c449dc3
No known key found for this signature in database GPG Key ID: 4AEE18F83AFDEB23
20 changed files with 510 additions and 236 deletions
  1. +1
    -1
      README.md
  2. +2
    -0
      docs/classes.rst
  3. +1
    -1
      docs/index.rst
  4. +4
    -4
      docs/json_tutorial.md
  5. +5
    -0
      docs/visitors.rst
  6. +88
    -53
      examples/advanced/python3.lark
  7. +3
    -1
      examples/standalone/json_parser_main.py
  8. +2
    -2
      lark/ast_utils.py
  9. +12
    -0
      lark/common.py
  10. +32
    -13
      lark/exceptions.py
  11. +15
    -9
      lark/lark.py
  12. +74
    -61
      lark/lexer.py
  13. +123
    -12
      lark/load_grammar.py
  14. +33
    -28
      lark/parse_tree_builder.py
  15. +9
    -10
      lark/parser_frontends.py
  16. +1
    -1
      lark/parsers/lalr_interactive_parser.py
  17. +2
    -2
      lark/parsers/lalr_parser.py
  18. +35
    -17
      lark/utils.py
  19. +48
    -1
      tests/test_grammar.py
  20. +20
    -20
      tests/test_parser.py

+ 1
- 1
README.md View File

@@ -26,7 +26,7 @@ Most importantly, Lark will save you time and prevent you from getting parsing h


- [Documentation @readthedocs](https://lark-parser.readthedocs.io/) - [Documentation @readthedocs](https://lark-parser.readthedocs.io/)
- [Cheatsheet (PDF)](/docs/_static/lark_cheatsheet.pdf) - [Cheatsheet (PDF)](/docs/_static/lark_cheatsheet.pdf)
- [Online IDE (very basic)](https://lark-parser.github.io/lark/ide/app.html)
- [Online IDE](https://lark-parser.github.io/ide)
- [Tutorial](/docs/json_tutorial.md) for writing a JSON parser. - [Tutorial](/docs/json_tutorial.md) for writing a JSON parser.
- Blog post: [How to write a DSL with Lark](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/) - Blog post: [How to write a DSL with Lark](http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/)
- [Gitter chat](https://gitter.im/lark-parser/Lobby) - [Gitter chat](https://gitter.im/lark-parser/Lobby)


+ 2
- 0
docs/classes.rst View File

@@ -66,6 +66,8 @@ UnexpectedInput


.. autoclass:: lark.exceptions.UnexpectedCharacters .. autoclass:: lark.exceptions.UnexpectedCharacters


.. autoclass:: lark.exceptions.UnexpectedEOF

InteractiveParser InteractiveParser
----------------- -----------------




+ 1
- 1
docs/index.rst View File

@@ -113,7 +113,7 @@ Resources


.. _Examples: https://github.com/lark-parser/lark/tree/master/examples .. _Examples: https://github.com/lark-parser/lark/tree/master/examples
.. _Third-party examples: https://github.com/ligurio/lark-grammars .. _Third-party examples: https://github.com/ligurio/lark-grammars
.. _Online IDE: https://lark-parser.github.io/lark/ide/app.html
.. _Online IDE: https://lark-parser.github.io/ide
.. _How to write a DSL: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/ .. _How to write a DSL: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/
.. _Program Synthesis is Possible: https://www.cs.cornell.edu/~asampson/blog/minisynth.html .. _Program Synthesis is Possible: https://www.cs.cornell.edu/~asampson/blog/minisynth.html
.. _Cheatsheet (PDF): _static/lark_cheatsheet.pdf .. _Cheatsheet (PDF): _static/lark_cheatsheet.pdf


+ 4
- 4
docs/json_tutorial.md View File

@@ -427,9 +427,9 @@ I measured memory consumption using a little script called [memusg](https://gist
| Lark - Earley *(with lexer)* | 42s | 4s | 1167M | 608M | | Lark - Earley *(with lexer)* | 42s | 4s | 1167M | 608M |
| Lark - LALR(1) | 8s | 1.53s | 453M | 266M | | Lark - LALR(1) | 8s | 1.53s | 453M | 266M |
| Lark - LALR(1) tree-less | 4.76s | 1.23s | 70M | 134M | | Lark - LALR(1) tree-less | 4.76s | 1.23s | 70M | 134M |
| PyParsing ([Parser](http://pyparsing.wikispaces.com/file/view/jsonParser.py)) | 32s | 3.53s | 443M | 225M |
| funcparserlib ([Parser](https://github.com/vlasovskikh/funcparserlib/blob/master/funcparserlib/tests/json.py)) | 8.5s | 1.3s | 483M | 293M |
| Parsimonious ([Parser](https://gist.githubusercontent.com/reclosedev/5222560/raw/5e97cf7eb62c3a3671885ec170577285e891f7d5/parsimonious_json.py)) | ? | 5.7s | ? | 1545M |
| PyParsing ([Parser](https://github.com/pyparsing/pyparsing/blob/master/examples/jsonParser.py)) | 32s | 3.53s | 443M | 225M |
| funcparserlib ([Parser](https://github.com/vlasovskikh/funcparserlib/blob/master/tests/json.py)) | 8.5s | 1.3s | 483M | 293M |
| Parsimonious ([Parser](https://gist.github.com/reclosedev/5222560)) | ? | 5.7s | ? | 1545M |




I added a few other parsers for comparison. PyParsing and funcparselib fair pretty well in their memory usage (they don't build a tree), but they can't compete with the run-time speed of LALR(1). I added a few other parsers for comparison. PyParsing and funcparselib fair pretty well in their memory usage (they don't build a tree), but they can't compete with the run-time speed of LALR(1).
@@ -442,7 +442,7 @@ Once again, shout-out to PyPy for being so effective.


This is the end of the tutorial. I hoped you liked it and learned a little about Lark. This is the end of the tutorial. I hoped you liked it and learned a little about Lark.


To see what else you can do with Lark, check out the [examples](examples).
To see what else you can do with Lark, check out the [examples](/examples).


For questions or any other subject, feel free to email me at erezshin at gmail dot com. For questions or any other subject, feel free to email me at erezshin at gmail dot com.



+ 5
- 0
docs/visitors.rst View File

@@ -107,3 +107,8 @@ Discard
------- -------


.. autoclass:: lark.visitors.Discard .. autoclass:: lark.visitors.Discard

VisitError
-------

.. autoclass:: lark.exceptions.VisitError

+ 88
- 53
examples/advanced/python3.lark View File

@@ -21,7 +21,7 @@ decorators: decorator+
decorated: decorators (classdef | funcdef | async_funcdef) decorated: decorators (classdef | funcdef | async_funcdef)


async_funcdef: "async" funcdef async_funcdef: "async" funcdef
funcdef: "def" NAME "(" parameters? ")" ["->" test] ":" suite
funcdef: "def" NAME "(" [parameters] ")" ["->" test] ":" suite


parameters: paramvalue ("," paramvalue)* ["," SLASH] ["," [starparams | kwparams]] parameters: paramvalue ("," paramvalue)* ["," SLASH] ["," [starparams | kwparams]]
| starparams | starparams
@@ -29,25 +29,36 @@ parameters: paramvalue ("," paramvalue)* ["," SLASH] ["," [starparams | kwparams


SLASH: "/" // Otherwise the it will completely disappear and it will be undisguisable in the result SLASH: "/" // Otherwise the it will completely disappear and it will be undisguisable in the result
starparams: "*" typedparam? ("," paramvalue)* ["," kwparams] starparams: "*" typedparam? ("," paramvalue)* ["," kwparams]
kwparams: "**" typedparam
kwparams: "**" typedparam ","?


?paramvalue: typedparam ["=" test]
?typedparam: NAME [":" test]
?paramvalue: typedparam ("=" test)?
?typedparam: NAME (":" test)?


varargslist: (vfpdef ["=" test] ("," vfpdef ["=" test])* ["," [ "*" [vfpdef] ("," vfpdef ["=" test])* ["," ["**" vfpdef [","]]] | "**" vfpdef [","]]]
| "*" [vfpdef] ("," vfpdef ["=" test])* ["," ["**" vfpdef [","]]]
| "**" vfpdef [","])


vfpdef: NAME
lambdef: "lambda" [lambda_params] ":" test
lambdef_nocond: "lambda" [lambda_params] ":" test_nocond
lambda_params: lambda_paramvalue ("," lambda_paramvalue)* ["," [lambda_starparams | lambda_kwparams]]
| lambda_starparams
| lambda_kwparams
?lambda_paramvalue: NAME ("=" test)?
lambda_starparams: "*" [NAME] ("," lambda_paramvalue)* ["," [lambda_kwparams]]
lambda_kwparams: "**" NAME ","?



?stmt: simple_stmt | compound_stmt ?stmt: simple_stmt | compound_stmt
?simple_stmt: small_stmt (";" small_stmt)* [";"] _NEWLINE ?simple_stmt: small_stmt (";" small_stmt)* [";"] _NEWLINE
?small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
?expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist)
| ("=" (yield_expr|testlist_star_expr))*)
annassign: ":" test ["=" test]
?testlist_star_expr: (test|star_expr) ("," (test|star_expr))* [","]
!augassign: ("+=" | "-=" | "*=" | "@=" | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//=")
?small_stmt: (expr_stmt | assign_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
expr_stmt: testlist_star_expr
assign_stmt: annassign | augassign | assign

annassign: testlist_star_expr ":" test ["=" test]
assign: testlist_star_expr ("=" (yield_expr|testlist_star_expr))+
augassign: testlist_star_expr augassign_op (yield_expr|testlist)
!augassign_op: "+=" | "-=" | "*=" | "@=" | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>=" | "**=" | "//="
?testlist_star_expr: test_or_star_expr
| test_or_star_expr ("," test_or_star_expr)+ ","? -> tuple
| test_or_star_expr "," -> tuple

// For normal and annotated assignments, additional restrictions enforced by the interpreter // For normal and annotated assignments, additional restrictions enforced by the interpreter
del_stmt: "del" exprlist del_stmt: "del" exprlist
pass_stmt: "pass" pass_stmt: "pass"
@@ -71,43 +82,52 @@ global_stmt: "global" NAME ("," NAME)*
nonlocal_stmt: "nonlocal" NAME ("," NAME)* nonlocal_stmt: "nonlocal" NAME ("," NAME)*
assert_stmt: "assert" test ["," test] assert_stmt: "assert" test ["," test]


compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
?compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
async_stmt: "async" (funcdef | with_stmt | for_stmt) async_stmt: "async" (funcdef | with_stmt | for_stmt)
if_stmt: "if" test ":" suite ("elif" test ":" suite)* ["else" ":" suite]
if_stmt: "if" test ":" suite elifs ["else" ":" suite]
elifs: elif_*
elif_: "elif" test ":" suite
while_stmt: "while" test ":" suite ["else" ":" suite] while_stmt: "while" test ":" suite ["else" ":" suite]
for_stmt: "for" exprlist "in" testlist ":" suite ["else" ":" suite] for_stmt: "for" exprlist "in" testlist ":" suite ["else" ":" suite]
try_stmt: ("try" ":" suite ((except_clause ":" suite)+ ["else" ":" suite] ["finally" ":" suite] | "finally" ":" suite))
with_stmt: "with" with_item ("," with_item)* ":" suite
try_stmt: "try" ":" suite except_clauses ["else" ":" suite] [finally]
| "try" ":" suite finally -> try_finally
finally: "finally" ":" suite
except_clauses: except_clause+
except_clause: "except" [test ["as" NAME]] ":" suite

with_stmt: "with" with_items ":" suite
with_items: with_item ("," with_item)*
with_item: test ["as" expr] with_item: test ["as" expr]
// NB compile.c makes sure that the default except clause is last // NB compile.c makes sure that the default except clause is last
except_clause: "except" [test ["as" NAME]]
suite: simple_stmt | _NEWLINE _INDENT stmt+ _DEDENT suite: simple_stmt | _NEWLINE _INDENT stmt+ _DEDENT


?test: or_test ("if" or_test "else" test)? | lambdef
?test: or_test ("if" or_test "else" test)?
| lambdef
?test_nocond: or_test | lambdef_nocond ?test_nocond: or_test | lambdef_nocond
lambdef: "lambda" [varargslist] ":" test
lambdef_nocond: "lambda" [varargslist] ":" test_nocond

?or_test: and_test ("or" and_test)* ?or_test: and_test ("or" and_test)*
?and_test: not_test ("and" not_test)* ?and_test: not_test ("and" not_test)*
?not_test: "not" not_test -> not
?not_test: "not" not_test -> not_test
| comparison | comparison
?comparison: expr (_comp_op expr)*
?comparison: expr (comp_op expr)*
star_expr: "*" expr star_expr: "*" expr
?expr: xor_expr ("|" xor_expr)*

?expr: or_expr
?or_expr: xor_expr ("|" xor_expr)*
?xor_expr: and_expr ("^" and_expr)* ?xor_expr: and_expr ("^" and_expr)*
?and_expr: shift_expr ("&" shift_expr)* ?and_expr: shift_expr ("&" shift_expr)*
?shift_expr: arith_expr (_shift_op arith_expr)* ?shift_expr: arith_expr (_shift_op arith_expr)*
?arith_expr: term (_add_op term)* ?arith_expr: term (_add_op term)*
?term: factor (_mul_op factor)* ?term: factor (_mul_op factor)*
?factor: _factor_op factor | power
?factor: _unary_op factor | power


!_factor_op: "+"|"-"|"~"
!_unary_op: "+"|"-"|"~"
!_add_op: "+"|"-" !_add_op: "+"|"-"
!_shift_op: "<<"|">>" !_shift_op: "<<"|">>"
!_mul_op: "*"|"@"|"/"|"%"|"//" !_mul_op: "*"|"@"|"/"|"%"|"//"
// <> isn't actually a valid comparison operator in Python. It's here for the // <> isn't actually a valid comparison operator in Python. It's here for the
// sake of a __future__ import described in PEP 401 (which really works :-) // sake of a __future__ import described in PEP 401 (which really works :-)
!_comp_op: "<"|">"|"=="|">="|"<="|"<>"|"!="|"in"|"not" "in"|"is"|"is" "not"
!comp_op: "<"|">"|"=="|">="|"<="|"<>"|"!="|"in"|"not" "in"|"is"|"is" "not"


?power: await_expr ("**" factor)? ?power: await_expr ("**" factor)?
?await_expr: AWAIT? atom_expr ?await_expr: AWAIT? atom_expr
@@ -118,61 +138,75 @@ AWAIT: "await"
| atom_expr "." NAME -> getattr | atom_expr "." NAME -> getattr
| atom | atom


?atom: "(" [yield_expr|tuplelist_comp] ")" -> tuple
| "[" [testlist_comp] "]" -> list
| "{" [dict_comp] "}" -> dict
| "{" set_comp "}" -> set
?atom: "(" yield_expr ")"
| "(" _tuple_inner? ")" -> tuple
| "(" comprehension{test_or_star_expr} ")" -> tuple_comprehension
| "[" _testlist_comp? "]" -> list
| "[" comprehension{test_or_star_expr} "]" -> list_comprehension
| "{" _dict_exprlist? "}" -> dict
| "{" comprehension{key_value} "}" -> dict_comprehension
| "{" _set_exprlist "}" -> set
| "{" comprehension{test} "}" -> set_comprehension
| NAME -> var | NAME -> var
| number | string+
| number
| string_concat
| "(" test ")" | "(" test ")"
| "..." -> ellipsis | "..." -> ellipsis
| "None" -> const_none | "None" -> const_none
| "True" -> const_true | "True" -> const_true
| "False" -> const_false | "False" -> const_false


?testlist_comp: test | tuplelist_comp
tuplelist_comp: (test|star_expr) (comp_for | ("," (test|star_expr))+ [","] | ",")

?string_concat: string+

_testlist_comp: test | _tuple_inner
_tuple_inner: test_or_star_expr (("," test_or_star_expr)+ [","] | ",")

?test_or_star_expr: test
| star_expr

?subscriptlist: subscript ?subscriptlist: subscript
| subscript (("," subscript)+ [","] | ",") -> subscript_tuple | subscript (("," subscript)+ [","] | ",") -> subscript_tuple
subscript: test | ([test] ":" [test] [sliceop]) -> slice
?subscript: test | ([test] ":" [test] [sliceop]) -> slice
sliceop: ":" [test] sliceop: ":" [test]
exprlist: (expr|star_expr)
| (expr|star_expr) (("," (expr|star_expr))+ [","]|",") -> exprlist_tuple
testlist: test | testlist_tuple
?exprlist: (expr|star_expr)
| (expr|star_expr) (("," (expr|star_expr))+ [","]|",")
?testlist: test | testlist_tuple
testlist_tuple: test (("," test)+ [","] | ",") testlist_tuple: test (("," test)+ [","] | ",")
dict_comp: key_value comp_for
| (key_value | "**" expr) ("," (key_value | "**" expr))* [","]
_dict_exprlist: (key_value | "**" expr) ("," (key_value | "**" expr))* [","]


key_value: test ":" test key_value: test ":" test


set_comp: test comp_for
| (test|star_expr) ("," (test | star_expr))* [","]
_set_exprlist: test_or_star_expr ("," test_or_star_expr)* [","]


classdef: "class" NAME ["(" [arguments] ")"] ":" suite classdef: "class" NAME ["(" [arguments] ")"] ":" suite




arguments: argvalue ("," argvalue)* ("," [ starargs | kwargs])? arguments: argvalue ("," argvalue)* ("," [ starargs | kwargs])?
| starargs | starargs
| kwargs | kwargs
| test comp_for
| comprehension{test}


starargs: "*" test ("," "*" test)* ("," argvalue)* ["," kwargs]
starargs: stararg ("," stararg)* ("," argvalue)* ["," kwargs]
stararg: "*" test
kwargs: "**" test kwargs: "**" test


?argvalue: test ("=" test)? ?argvalue: test ("=" test)?




comp_iter: comp_for | comp_if | async_for
async_for: "async" "for" exprlist "in" or_test [comp_iter]
comp_for: "for" exprlist "in" or_test [comp_iter]
comp_if: "if" test_nocond [comp_iter]
comprehension{comp_result}: comp_result comp_fors [comp_if]
comp_fors: comp_for+
comp_for: [ASYNC] "for" exprlist "in" or_test
ASYNC: "async"
?comp_if: "if" test_nocond


// not used in grammar, but may appear in "node" passed from Parser to Compiler // not used in grammar, but may appear in "node" passed from Parser to Compiler
encoding_decl: NAME encoding_decl: NAME


yield_expr: "yield" [yield_arg]
yield_arg: "from" test | testlist

yield_expr: "yield" [testlist]
| "yield" "from" test -> yield_from


number: DEC_NUMBER | HEX_NUMBER | BIN_NUMBER | OCT_NUMBER | FLOAT_NUMBER | IMAG_NUMBER number: DEC_NUMBER | HEX_NUMBER | BIN_NUMBER | OCT_NUMBER | FLOAT_NUMBER | IMAG_NUMBER
string: STRING | LONG_STRING string: STRING | LONG_STRING
@@ -181,6 +215,7 @@ string: STRING | LONG_STRING
%import python (NAME, COMMENT, STRING, LONG_STRING) %import python (NAME, COMMENT, STRING, LONG_STRING)
%import python (DEC_NUMBER, HEX_NUMBER, OCT_NUMBER, BIN_NUMBER, FLOAT_NUMBER, IMAG_NUMBER) %import python (DEC_NUMBER, HEX_NUMBER, OCT_NUMBER, BIN_NUMBER, FLOAT_NUMBER, IMAG_NUMBER)



// Other terminals // Other terminals


_NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+ _NEWLINE: ( /\r?\n[\t ]*/ | COMMENT )+


+ 3
- 1
examples/standalone/json_parser_main.py View File

@@ -10,7 +10,9 @@ Standalone Parser


import sys import sys


from json_parser import Lark_StandAlone, Transformer, inline_args
from json_parser import Lark_StandAlone, Transformer, v_args

inline_args = v_args(inline=True)


class TreeToJson(Transformer): class TreeToJson(Transformer):
@inline_args @inline_args


+ 2
- 2
lark/ast_utils.py View File

@@ -38,8 +38,8 @@ def create_transformer(ast_module: types.ModuleType, transformer: Optional[Trans
Classes starting with an underscore (`_`) will be skipped. Classes starting with an underscore (`_`) will be skipped.


Parameters: Parameters:
ast_module - A Python module containing all the subclasses of `ast_utils.Ast`
transformer (Optional[Transformer]) - An initial transformer. Its attributes may be overwritten.
ast_module: A Python module containing all the subclasses of ``ast_utils.Ast``
transformer (Optional[Transformer]): An initial transformer. Its attributes may be overwritten.
""" """
t = transformer or Transformer() t = transformer or Transformer()




+ 12
- 0
lark/common.py View File

@@ -1,4 +1,5 @@
from types import ModuleType from types import ModuleType
from copy import deepcopy


from .utils import Serialize from .utils import Serialize
from .lexer import TerminalDef, Token from .lexer import TerminalDef, Token
@@ -40,6 +41,17 @@ class LexerConf(Serialize):
def _deserialize(self): def _deserialize(self):
self.terminals_by_name = {t.name: t for t in self.terminals} self.terminals_by_name = {t.name: t for t in self.terminals}


def __deepcopy__(self, memo=None):
return type(self)(
deepcopy(self.terminals, memo),
self.re_module,
deepcopy(self.ignore, memo),
deepcopy(self.postlex, memo),
deepcopy(self.callbacks, memo),
deepcopy(self.g_regex_flags, memo),
deepcopy(self.skip_validation, memo),
deepcopy(self.use_bytes, memo),
)




class ParserConf(Serialize): class ParserConf(Serialize):


+ 32
- 13
lark/exceptions.py View File

@@ -41,8 +41,9 @@ class UnexpectedInput(LarkError):


Used as a base class for the following exceptions: Used as a base class for the following exceptions:


- ``UnexpectedToken``: The parser received an unexpected token
- ``UnexpectedCharacters``: The lexer encountered an unexpected string - ``UnexpectedCharacters``: The lexer encountered an unexpected string
- ``UnexpectedToken``: The parser received an unexpected token
- ``UnexpectedEOF``: The parser expected a token, but the input ended


After catching one of these exceptions, you may call the following helper methods to create a nicer error message. After catching one of these exceptions, you may call the following helper methods to create a nicer error message.
""" """
@@ -136,10 +137,13 @@ class UnexpectedInput(LarkError):




class UnexpectedEOF(ParseError, UnexpectedInput): class UnexpectedEOF(ParseError, UnexpectedInput):

"""An exception that is raised by the parser, when the input ends while it still expects a token.
"""
expected: 'List[Token]' expected: 'List[Token]'


def __init__(self, expected, state=None, terminals_by_name=None): def __init__(self, expected, state=None, terminals_by_name=None):
super(UnexpectedEOF, self).__init__()

self.expected = expected self.expected = expected
self.state = state self.state = state
from .lexer import Token from .lexer import Token
@@ -149,7 +153,6 @@ class UnexpectedEOF(ParseError, UnexpectedInput):
self.column = -1 self.column = -1
self._terminals_by_name = terminals_by_name self._terminals_by_name = terminals_by_name


super(UnexpectedEOF, self).__init__()


def __str__(self): def __str__(self):
message = "Unexpected end-of-input. " message = "Unexpected end-of-input. "
@@ -158,12 +161,17 @@ class UnexpectedEOF(ParseError, UnexpectedInput):




class UnexpectedCharacters(LexError, UnexpectedInput): class UnexpectedCharacters(LexError, UnexpectedInput):
"""An exception that is raised by the lexer, when it cannot match the next
string of characters to any of its terminals.
"""


allowed: Set[str] allowed: Set[str]
considered_tokens: Set[Any] considered_tokens: Set[Any]


def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None, def __init__(self, seq, lex_pos, line, column, allowed=None, considered_tokens=None, state=None, token_history=None,
terminals_by_name=None, considered_rules=None): terminals_by_name=None, considered_rules=None):
super(UnexpectedCharacters, self).__init__()

# TODO considered_tokens and allowed can be figured out using state # TODO considered_tokens and allowed can be figured out using state
self.line = line self.line = line
self.column = column self.column = column
@@ -182,7 +190,6 @@ class UnexpectedCharacters(LexError, UnexpectedInput):
self.char = seq[lex_pos] self.char = seq[lex_pos]
self._context = self.get_context(seq) self._context = self.get_context(seq)


super(UnexpectedCharacters, self).__init__()


def __str__(self): def __str__(self):
message = "No terminal matches '%s' in the current parser context, at line %d col %d" % (self.char, self.line, self.column) message = "No terminal matches '%s' in the current parser context, at line %d col %d" % (self.char, self.line, self.column)
@@ -198,10 +205,15 @@ class UnexpectedToken(ParseError, UnexpectedInput):
"""An exception that is raised by the parser, when the token it received """An exception that is raised by the parser, when the token it received
doesn't match any valid step forward. doesn't match any valid step forward.


The parser provides an interactive instance through `interactive_parser`,
which is initialized to the point of failture, and can be used for debugging and error handling.
Parameters:
token: The mismatched token
expected: The set of expected tokens
considered_rules: Which rules were considered, to deduce the expected tokens
state: A value representing the parser state. Do not rely on its value or type.
interactive_parser: An instance of ``InteractiveParser``, that is initialized to the point of failture,
and can be used for debugging and error handling.


see: ``InteractiveParser``.
Note: These parameters are available as attributes of the instance.
""" """


expected: Set[str] expected: Set[str]
@@ -209,6 +221,8 @@ class UnexpectedToken(ParseError, UnexpectedInput):
interactive_parser: 'InteractiveParser' interactive_parser: 'InteractiveParser'


def __init__(self, token, expected, considered_rules=None, state=None, interactive_parser=None, terminals_by_name=None, token_history=None): def __init__(self, token, expected, considered_rules=None, state=None, interactive_parser=None, terminals_by_name=None, token_history=None):
super(UnexpectedToken, self).__init__()
# TODO considered_rules and expected can be figured out using state # TODO considered_rules and expected can be figured out using state
self.line = getattr(token, 'line', '?') self.line = getattr(token, 'line', '?')
self.column = getattr(token, 'column', '?') self.column = getattr(token, 'column', '?')
@@ -223,7 +237,6 @@ class UnexpectedToken(ParseError, UnexpectedInput):
self._terminals_by_name = terminals_by_name self._terminals_by_name = terminals_by_name
self.token_history = token_history self.token_history = token_history


super(UnexpectedToken, self).__init__()


@property @property
def accepts(self) -> Set[str]: def accepts(self) -> Set[str]:
@@ -245,18 +258,24 @@ class VisitError(LarkError):
"""VisitError is raised when visitors are interrupted by an exception """VisitError is raised when visitors are interrupted by an exception


It provides the following attributes for inspection: It provides the following attributes for inspection:
- obj: the tree node or token it was processing when the exception was raised
- orig_exc: the exception that cause it to fail

Parameters:
rule: the name of the visit rule that failed
obj: the tree-node or token that was being processed
orig_exc: the exception that cause it to fail

Note: These parameters are available as attributes
""" """


obj: 'Union[Tree, Token]' obj: 'Union[Tree, Token]'
orig_exc: Exception orig_exc: Exception


def __init__(self, rule, obj, orig_exc): def __init__(self, rule, obj, orig_exc):
self.obj = obj
self.orig_exc = orig_exc

message = 'Error trying to process rule "%s":\n\n%s' % (rule, orig_exc) message = 'Error trying to process rule "%s":\n\n%s' % (rule, orig_exc)
super(VisitError, self).__init__(message) super(VisitError, self).__init__(message)


self.rule = rule
self.obj = obj
self.orig_exc = orig_exc

###} ###}

+ 15
- 9
lark/lark.py View File

@@ -79,7 +79,7 @@ class LarkOptions(Serialize):
Applies the transformer to every parse tree (equivalent to applying it after the parse, but faster) Applies the transformer to every parse tree (equivalent to applying it after the parse, but faster)
propagate_positions propagate_positions
Propagates (line, column, end_line, end_column) attributes into all tree branches. Propagates (line, column, end_line, end_column) attributes into all tree branches.
Accepts ``False``, ``True``, or "ignore_ws", which will trim the whitespace around your trees.
Accepts ``False``, ``True``, or a callable, which will filter which nodes to ignore when propagating.
maybe_placeholders maybe_placeholders
When ``True``, the ``[]`` operator returns ``None`` when not matched. When ``True``, the ``[]`` operator returns ``None`` when not matched.


@@ -137,7 +137,7 @@ class LarkOptions(Serialize):
A List of either paths or loader functions to specify from where grammars are imported A List of either paths or loader functions to specify from where grammars are imported
source_path source_path
Override the source of from where the grammar was loaded. Useful for relative imports and unconventional grammar loading Override the source of from where the grammar was loaded. Useful for relative imports and unconventional grammar loading
**=== End Options ===**
**=== End of Options ===**
""" """
if __doc__: if __doc__:
__doc__ += OPTIONS_DOC __doc__ += OPTIONS_DOC
@@ -195,7 +195,7 @@ class LarkOptions(Serialize):
assert_config(self.parser, ('earley', 'lalr', 'cyk', None)) assert_config(self.parser, ('earley', 'lalr', 'cyk', None))


if self.parser == 'earley' and self.transformer: if self.parser == 'earley' and self.transformer:
raise ConfigurationError('Cannot specify an embedded transformer when using the Earley algorithm.'
raise ConfigurationError('Cannot specify an embedded transformer when using the Earley algorithm. '
'Please use your transformer on the resulting parse tree, or use a different algorithm (i.e. LALR)') 'Please use your transformer on the resulting parse tree, or use a different algorithm (i.e. LALR)')


if o: if o:
@@ -484,11 +484,11 @@ class Lark(Serialize):
d = f d = f
else: else:
d = pickle.load(f) d = pickle.load(f)
memo = d['memo']
memo_json = d['memo']
data = d['data'] data = d['data']


assert memo
memo = SerializeMemoizer.deserialize(memo, {'Rule': Rule, 'TerminalDef': TerminalDef}, {})
assert memo_json
memo = SerializeMemoizer.deserialize(memo_json, {'Rule': Rule, 'TerminalDef': TerminalDef}, {})
options = dict(data['options']) options = dict(data['options'])
if (set(kwargs) - _LOAD_ALLOWED_OPTIONS) & set(LarkOptions._defaults): if (set(kwargs) - _LOAD_ALLOWED_OPTIONS) & set(LarkOptions._defaults):
raise ConfigurationError("Some options are not allowed when loading a Parser: {}" raise ConfigurationError("Some options are not allowed when loading a Parser: {}"
@@ -545,11 +545,11 @@ class Lark(Serialize):


Lark.open_from_package(__name__, "example.lark", ("grammars",), parser=...) Lark.open_from_package(__name__, "example.lark", ("grammars",), parser=...)
""" """
package = FromPackageLoader(package, search_paths)
full_path, text = package(None, grammar_path)
package_loader = FromPackageLoader(package, search_paths)
full_path, text = package_loader(None, grammar_path)
options.setdefault('source_path', full_path) options.setdefault('source_path', full_path)
options.setdefault('import_paths', []) options.setdefault('import_paths', [])
options['import_paths'].append(package)
options['import_paths'].append(package_loader)
return cls(text, **options) return cls(text, **options)


def __repr__(self): def __repr__(self):
@@ -560,6 +560,8 @@ class Lark(Serialize):
"""Only lex (and postlex) the text, without parsing it. Only relevant when lexer='standard' """Only lex (and postlex) the text, without parsing it. Only relevant when lexer='standard'


When dont_ignore=True, the lexer will return all tokens, even those marked for %ignore. When dont_ignore=True, the lexer will return all tokens, even those marked for %ignore.

:raises UnexpectedCharacters: In case the lexer cannot find a suitable match.
""" """
if not hasattr(self, 'lexer') or dont_ignore: if not hasattr(self, 'lexer') or dont_ignore:
lexer = self._build_lexer(dont_ignore) lexer = self._build_lexer(dont_ignore)
@@ -602,6 +604,10 @@ class Lark(Serialize):
If a transformer is supplied to ``__init__``, returns whatever is the If a transformer is supplied to ``__init__``, returns whatever is the
result of the transformation. Otherwise, returns a Tree instance. result of the transformation. Otherwise, returns a Tree instance.


:raises UnexpectedInput: On a parse error, one of these sub-exceptions will rise:
``UnexpectedCharacters``, ``UnexpectedToken``, or ``UnexpectedEOF``.
For convenience, these sub-exceptions also inherit from ``ParserError`` and ``LexerError``.

""" """
return self.parser.parse(text, start=start, on_error=on_error) return self.parser.parse(text, start=start, on_error=on_error)




+ 74
- 61
lark/lexer.py View File

@@ -158,20 +158,20 @@ class Token(str):


def __new__(cls, type_, value, start_pos=None, line=None, column=None, end_line=None, end_column=None, end_pos=None): def __new__(cls, type_, value, start_pos=None, line=None, column=None, end_line=None, end_column=None, end_pos=None):
try: try:
self = super(Token, cls).__new__(cls, value)
inst = super(Token, cls).__new__(cls, value)
except UnicodeDecodeError: except UnicodeDecodeError:
value = value.decode('latin1') value = value.decode('latin1')
self = super(Token, cls).__new__(cls, value)
self.type = type_
self.start_pos = start_pos
self.value = value
self.line = line
self.column = column
self.end_line = end_line
self.end_column = end_column
self.end_pos = end_pos
return self
inst = super(Token, cls).__new__(cls, value)
inst.type = type_
inst.start_pos = start_pos
inst.value = value
inst.line = line
inst.column = column
inst.end_line = end_line
inst.end_column = end_column
inst.end_pos = end_pos
return inst


def update(self, type_: Optional[str]=None, value: Optional[Any]=None) -> 'Token': def update(self, type_: Optional[str]=None, value: Optional[Any]=None) -> 'Token':
return Token.new_borrow_pos( return Token.new_borrow_pos(
@@ -234,15 +234,13 @@ class LineCounter:




class UnlessCallback: class UnlessCallback:
def __init__(self, mres):
self.mres = mres
def __init__(self, scanner):
self.scanner = scanner


def __call__(self, t): def __call__(self, t):
for mre, type_from_index in self.mres:
m = mre.match(t.value)
if m:
t.type = type_from_index[m.lastindex]
break
res = self.scanner.match(t.value, 0)
if res:
_value, t.type = res
return t return t




@@ -257,6 +255,11 @@ class CallChain:
return self.callback2(t) if self.cond(t2) else t2 return self.callback2(t) if self.cond(t2) else t2




def _get_match(re_, regexp, s, flags):
m = re_.match(regexp, s, flags)
if m:
return m.group(0)

def _create_unless(terminals, g_regex_flags, re_, use_bytes): def _create_unless(terminals, g_regex_flags, re_, use_bytes):
tokens_by_type = classify(terminals, lambda t: type(t.pattern)) tokens_by_type = classify(terminals, lambda t: type(t.pattern))
assert len(tokens_by_type) <= 2, tokens_by_type.keys() assert len(tokens_by_type) <= 2, tokens_by_type.keys()
@@ -268,40 +271,54 @@ def _create_unless(terminals, g_regex_flags, re_, use_bytes):
if strtok.priority > retok.priority: if strtok.priority > retok.priority:
continue continue
s = strtok.pattern.value s = strtok.pattern.value
m = re_.match(retok.pattern.to_regexp(), s, g_regex_flags)
if m and m.group(0) == s:
if s == _get_match(re_, retok.pattern.to_regexp(), s, g_regex_flags):
unless.append(strtok) unless.append(strtok)
if strtok.pattern.flags <= retok.pattern.flags: if strtok.pattern.flags <= retok.pattern.flags:
embedded_strs.add(strtok) embedded_strs.add(strtok)
if unless: if unless:
callback[retok.name] = UnlessCallback(build_mres(unless, g_regex_flags, re_, match_whole=True, use_bytes=use_bytes))

terminals = [t for t in terminals if t not in embedded_strs]
return terminals, callback


def _build_mres(terminals, max_size, g_regex_flags, match_whole, re_, use_bytes):
# Python sets an unreasonable group limit (currently 100) in its re module
# Worse, the only way to know we reached it is by catching an AssertionError!
# This function recursively tries less and less groups until it's successful.
postfix = '$' if match_whole else ''
mres = []
while terminals:
pattern = u'|'.join(u'(?P<%s>%s)' % (t.name, t.pattern.to_regexp() + postfix) for t in terminals[:max_size])
if use_bytes:
pattern = pattern.encode('latin-1')
try:
mre = re_.compile(pattern, g_regex_flags)
except AssertionError: # Yes, this is what Python provides us.. :/
return _build_mres(terminals, max_size//2, g_regex_flags, match_whole, re_, use_bytes)
callback[retok.name] = UnlessCallback(Scanner(unless, g_regex_flags, re_, match_whole=True, use_bytes=use_bytes))


mres.append((mre, {i: n for n, i in mre.groupindex.items()}))
terminals = terminals[max_size:]
return mres
new_terminals = [t for t in terminals if t not in embedded_strs]
return new_terminals, callback




def build_mres(terminals, g_regex_flags, re_, use_bytes, match_whole=False):
return _build_mres(terminals, len(terminals), g_regex_flags, match_whole, re_, use_bytes)

class Scanner:
def __init__(self, terminals, g_regex_flags, re_, use_bytes, match_whole=False):
self.terminals = terminals
self.g_regex_flags = g_regex_flags
self.re_ = re_
self.use_bytes = use_bytes
self.match_whole = match_whole

self.allowed_types = {t.name for t in self.terminals}

self._mres = self._build_mres(terminals, len(terminals))

def _build_mres(self, terminals, max_size):
# Python sets an unreasonable group limit (currently 100) in its re module
# Worse, the only way to know we reached it is by catching an AssertionError!
# This function recursively tries less and less groups until it's successful.
postfix = '$' if self.match_whole else ''
mres = []
while terminals:
pattern = u'|'.join(u'(?P<%s>%s)' % (t.name, t.pattern.to_regexp() + postfix) for t in terminals[:max_size])
if self.use_bytes:
pattern = pattern.encode('latin-1')
try:
mre = self.re_.compile(pattern, self.g_regex_flags)
except AssertionError: # Yes, this is what Python provides us.. :/
return self._build_mres(terminals, max_size//2)

mres.append((mre, {i: n for n, i in mre.groupindex.items()}))
terminals = terminals[max_size:]
return mres

def match(self, text, pos):
for mre, type_from_index in self._mres:
m = mre.match(text, pos)
if m:
return m.group(0), type_from_index[m.lastindex]




def _regexp_has_newline(r): def _regexp_has_newline(r):
@@ -390,9 +407,9 @@ class TraditionalLexer(Lexer):
self.use_bytes = conf.use_bytes self.use_bytes = conf.use_bytes
self.terminals_by_name = conf.terminals_by_name self.terminals_by_name = conf.terminals_by_name


self._mres = None
self._scanner = None


def _build(self) -> None:
def _build_scanner(self):
terminals, self.callback = _create_unless(self.terminals, self.g_regex_flags, self.re, self.use_bytes) terminals, self.callback = _create_unless(self.terminals, self.g_regex_flags, self.re, self.use_bytes)
assert all(self.callback.values()) assert all(self.callback.values())


@@ -403,20 +420,16 @@ class TraditionalLexer(Lexer):
else: else:
self.callback[type_] = f self.callback[type_] = f


self._mres = build_mres(terminals, self.g_regex_flags, self.re, self.use_bytes)
self._scanner = Scanner(terminals, self.g_regex_flags, self.re, self.use_bytes)


@property @property
def mres(self) -> List[Tuple[REPattern, Dict[int, str]]]:
if self._mres is None:
self._build()
assert self._mres is not None
return self._mres

def match(self, text: str, pos: int) -> Optional[Tuple[str, str]]:
for mre, type_from_index in self.mres:
m = mre.match(text, pos)
if m:
return m.group(0), type_from_index[m.lastindex]
def scanner(self):
if self._scanner is None:
self._build_scanner()
return self._scanner

def match(self, text, pos):
return self.scanner.match(text, pos)


def lex(self, state: LexerState, parser_state: Any) -> Iterator[Token]: def lex(self, state: LexerState, parser_state: Any) -> Iterator[Token]:
with suppress(EOFError): with suppress(EOFError):
@@ -428,7 +441,7 @@ class TraditionalLexer(Lexer):
while line_ctr.char_pos < len(lex_state.text): while line_ctr.char_pos < len(lex_state.text):
res = self.match(lex_state.text, line_ctr.char_pos) res = self.match(lex_state.text, line_ctr.char_pos)
if not res: if not res:
allowed = {v for m, tfi in self.mres for v in tfi.values()} - self.ignore_types
allowed = self.scanner.allowed_types - self.ignore_types
if not allowed: if not allowed:
allowed = {"<END-OF-FILE>"} allowed = {"<END-OF-FILE>"}
raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column, raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,


+ 123
- 12
lark/load_grammar.py View File

@@ -10,7 +10,7 @@ from numbers import Integral
from contextlib import suppress from contextlib import suppress
from typing import List, Tuple, Union, Callable, Dict, Optional from typing import List, Tuple, Union, Callable, Dict, Optional


from .utils import bfs, logger, classify_bool, is_id_continue, is_id_start, bfs_all_unique
from .utils import bfs, logger, classify_bool, is_id_continue, is_id_start, bfs_all_unique, small_factors
from .lexer import Token, TerminalDef, PatternStr, PatternRE from .lexer import Token, TerminalDef, PatternStr, PatternRE


from .parse_tree_builder import ParseTreeBuilder from .parse_tree_builder import ParseTreeBuilder
@@ -176,27 +176,136 @@ RULES = {
} }




# Value 5 keeps the number of states in the lalr parser somewhat minimal
# It isn't optimal, but close to it. See PR #949
SMALL_FACTOR_THRESHOLD = 5
# The Threshold whether repeat via ~ are split up into different rules
# 50 is chosen since it keeps the number of states low and therefore lalr analysis time low,
# while not being to overaggressive and unnecessarily creating rules that might create shift/reduce conflicts.
# (See PR #949)
REPEAT_BREAK_THRESHOLD = 50


@inline_args @inline_args
class EBNF_to_BNF(Transformer_InPlace): class EBNF_to_BNF(Transformer_InPlace):
def __init__(self): def __init__(self):
self.new_rules = [] self.new_rules = []
self.rules_by_expr = {}
self.rules_cache = {}
self.prefix = 'anon' self.prefix = 'anon'
self.i = 0 self.i = 0
self.rule_options = None self.rule_options = None


def _add_recurse_rule(self, type_, expr):
if expr in self.rules_by_expr:
return self.rules_by_expr[expr]

new_name = '__%s_%s_%d' % (self.prefix, type_, self.i)
def _name_rule(self, inner):
new_name = '__%s_%s_%d' % (self.prefix, inner, self.i)
self.i += 1 self.i += 1
t = NonTerminal(new_name)
tree = ST('expansions', [ST('expansion', [expr]), ST('expansion', [t, expr])])
self.new_rules.append((new_name, tree, self.rule_options))
self.rules_by_expr[expr] = t
return new_name

def _add_rule(self, key, name, expansions):
t = NonTerminal(name)
self.new_rules.append((name, expansions, self.rule_options))
self.rules_cache[key] = t
return t return t


def _add_recurse_rule(self, type_, expr):
try:
return self.rules_cache[expr]
except KeyError:
new_name = self._name_rule(type_)
t = NonTerminal(new_name)
tree = ST('expansions', [
ST('expansion', [expr]),
ST('expansion', [t, expr])
])
return self._add_rule(expr, new_name, tree)

def _add_repeat_rule(self, a, b, target, atom):
"""Generate a rule that repeats target ``a`` times, and repeats atom ``b`` times.

When called recursively (into target), it repeats atom for x(n) times, where:
x(0) = 1
x(n) = a(n) * x(n-1) + b

Example rule when a=3, b=4:

new_rule: target target target atom atom atom atom

"""
key = (a, b, target, atom)
try:
return self.rules_cache[key]
except KeyError:
new_name = self._name_rule('repeat_a%d_b%d' % (a, b))
tree = ST('expansions', [ST('expansion', [target] * a + [atom] * b)])
return self._add_rule(key, new_name, tree)

def _add_repeat_opt_rule(self, a, b, target, target_opt, atom):
"""Creates a rule that matches atom 0 to (a*n+b)-1 times.

When target matches n times atom, and target_opt 0 to n-1 times target_opt,

First we generate target * i followed by target_opt, for i from 0 to a-1
These match 0 to n*a - 1 times atom

Then we generate target * a followed by atom * i, for i from 0 to b-1
These match n*a to n*a + b-1 times atom

The created rule will not have any shift/reduce conflicts so that it can be used with lalr

Example rule when a=3, b=4:

new_rule: target_opt
| target target_opt
| target target target_opt

| target target target
| target target target atom
| target target target atom atom
| target target target atom atom atom

"""
key = (a, b, target, atom, "opt")
try:
return self.rules_cache[key]
except KeyError:
new_name = self._name_rule('repeat_a%d_b%d_opt' % (a, b))
tree = ST('expansions', [
ST('expansion', [target]*i + [target_opt]) for i in range(a)
] + [
ST('expansion', [target]*a + [atom]*i) for i in range(b)
])
return self._add_rule(key, new_name, tree)

def _generate_repeats(self, rule, mn, mx):
"""Generates a rule tree that repeats ``rule`` exactly between ``mn`` to ``mx`` times.
"""
# For a small number of repeats, we can take the naive approach
if mx < REPEAT_BREAK_THRESHOLD:
return ST('expansions', [ST('expansion', [rule] * n) for n in range(mn, mx + 1)])

# For large repeat values, we break the repetition into sub-rules.
# We treat ``rule~mn..mx`` as ``rule~mn rule~0..(diff=mx-mn)``.
# We then use small_factors to split up mn and diff up into values [(a, b), ...]
# This values are used with the help of _add_repeat_rule and _add_repeat_rule_opt
# to generate a complete rule/expression that matches the corresponding number of repeats
mn_target = rule
for a, b in small_factors(mn, SMALL_FACTOR_THRESHOLD):
mn_target = self._add_repeat_rule(a, b, mn_target, rule)
if mx == mn:
return mn_target

diff = mx - mn + 1 # We add one because _add_repeat_opt_rule generates rules that match one less
diff_factors = small_factors(diff, SMALL_FACTOR_THRESHOLD)
diff_target = rule # Match rule 1 times
diff_opt_target = ST('expansion', []) # match rule 0 times (e.g. up to 1 -1 times)
for a, b in diff_factors[:-1]:
diff_opt_target = self._add_repeat_opt_rule(a, b, diff_target, diff_opt_target, rule)
diff_target = self._add_repeat_rule(a, b, diff_target, rule)

a, b = diff_factors[-1]
diff_opt_target = self._add_repeat_opt_rule(a, b, diff_target, diff_opt_target, rule)

return ST('expansions', [ST('expansion', [mn_target] + [diff_opt_target])])

def expr(self, rule, op, *args): def expr(self, rule, op, *args):
if op.value == '?': if op.value == '?':
empty = ST('expansion', []) empty = ST('expansion', [])
@@ -221,7 +330,9 @@ class EBNF_to_BNF(Transformer_InPlace):
mn, mx = map(int, args) mn, mx = map(int, args)
if mx < mn or mn < 0: if mx < mn or mn < 0:
raise GrammarError("Bad Range for %s (%d..%d isn't allowed)" % (rule, mn, mx)) raise GrammarError("Bad Range for %s (%d..%d isn't allowed)" % (rule, mn, mx))
return ST('expansions', [ST('expansion', [rule] * n) for n in range(mn, mx+1)])

return self._generate_repeats(rule, mn, mx)

assert False, op assert False, op


def maybe(self, rule): def maybe(self, rule):


+ 33
- 28
lark/parse_tree_builder.py View File

@@ -22,54 +22,59 @@ class ExpandSingleChild:




class PropagatePositions: class PropagatePositions:
def __init__(self, node_builder):
def __init__(self, node_builder, node_filter=None):
self.node_builder = node_builder self.node_builder = node_builder
self.node_filter = node_filter


def __call__(self, children): def __call__(self, children):
res = self.node_builder(children) res = self.node_builder(children)


# local reference to Tree.meta reduces number of presence checks
if isinstance(res, Tree): if isinstance(res, Tree):
res_meta = res.meta
# Calculate positions while the tree is streaming, according to the rule:
# - nodes start at the start of their first child's container,
# and end at the end of their last child's container.
# Containers are nodes that take up space in text, but have been inlined in the tree.


src_meta = self._pp_get_meta(children)
if src_meta is not None:
res_meta.line = src_meta.line
res_meta.column = src_meta.column
res_meta.start_pos = src_meta.start_pos
res_meta.empty = False
res_meta = res.meta


src_meta = self._pp_get_meta(reversed(children))
if src_meta is not None:
res_meta.end_line = src_meta.end_line
res_meta.end_column = src_meta.end_column
res_meta.end_pos = src_meta.end_pos
res_meta.empty = False
first_meta = self._pp_get_meta(children)
if first_meta is not None:
if not hasattr(res_meta, 'line'):
# meta was already set, probably because the rule has been inlined (e.g. `?rule`)
res_meta.line = getattr(first_meta, 'container_line', first_meta.line)
res_meta.column = getattr(first_meta, 'container_column', first_meta.column)
res_meta.start_pos = getattr(first_meta, 'container_start_pos', first_meta.start_pos)
res_meta.empty = False

res_meta.container_line = getattr(first_meta, 'container_line', first_meta.line)
res_meta.container_column = getattr(first_meta, 'container_column', first_meta.column)

last_meta = self._pp_get_meta(reversed(children))
if last_meta is not None:
if not hasattr(res_meta, 'end_line'):
res_meta.end_line = getattr(last_meta, 'container_end_line', last_meta.end_line)
res_meta.end_column = getattr(last_meta, 'container_end_column', last_meta.end_column)
res_meta.end_pos = getattr(last_meta, 'container_end_pos', last_meta.end_pos)
res_meta.empty = False

res_meta.container_end_line = getattr(last_meta, 'container_end_line', last_meta.end_line)
res_meta.container_end_column = getattr(last_meta, 'container_end_column', last_meta.end_column)


return res return res


def _pp_get_meta(self, children): def _pp_get_meta(self, children):
for c in children: for c in children:
if self.node_filter is not None and not self.node_filter(c):
continue
if isinstance(c, Tree): if isinstance(c, Tree):
if not c.meta.empty: if not c.meta.empty:
return c.meta return c.meta
elif isinstance(c, Token): elif isinstance(c, Token):
return c return c


class PropagatePositions_IgnoreWs(PropagatePositions):
def _pp_get_meta(self, children):
for c in children:
if isinstance(c, Tree):
if not c.meta.empty:
return c.meta
elif isinstance(c, Token):
if c and not c.isspace(): # Disregard whitespace-only tokens
return c


def make_propagate_positions(option): def make_propagate_positions(option):
if option == "ignore_ws":
return PropagatePositions_IgnoreWs
if callable(option):
return partial(PropagatePositions, node_filter=option)
elif option is True: elif option is True:
return PropagatePositions return PropagatePositions
elif option is False: elif option is False:


+ 9
- 10
lark/parser_frontends.py View File

@@ -39,8 +39,7 @@ class MakeParsingFrontend:
lexer_conf.lexer_type = self.lexer_type lexer_conf.lexer_type = self.lexer_type
return ParsingFrontend(lexer_conf, parser_conf, options) return ParsingFrontend(lexer_conf, parser_conf, options)


@classmethod
def deserialize(cls, data, memo, lexer_conf, callbacks, options):
def deserialize(self, data, memo, lexer_conf, callbacks, options):
parser_conf = ParserConf.deserialize(data['parser_conf'], memo) parser_conf = ParserConf.deserialize(data['parser_conf'], memo)
parser = LALR_Parser.deserialize(data['parser'], memo, callbacks, options.debug) parser = LALR_Parser.deserialize(data['parser'], memo, callbacks, options.debug)
parser_conf.callbacks = callbacks parser_conf.callbacks = callbacks
@@ -92,26 +91,26 @@ class ParsingFrontend(Serialize):
def _verify_start(self, start=None): def _verify_start(self, start=None):
if start is None: if start is None:
start = self.parser_conf.start
if len(start) > 1:
raise ConfigurationError("Lark initialized with more than 1 possible start rule. Must specify which start rule to parse", start)
start ,= start
start_decls = self.parser_conf.start
if len(start_decls) > 1:
raise ConfigurationError("Lark initialized with more than 1 possible start rule. Must specify which start rule to parse", start_decls)
start ,= start_decls
elif start not in self.parser_conf.start: elif start not in self.parser_conf.start:
raise ConfigurationError("Unknown start rule %s. Must be one of %r" % (start, self.parser_conf.start)) raise ConfigurationError("Unknown start rule %s. Must be one of %r" % (start, self.parser_conf.start))
return start return start


def parse(self, text, start=None, on_error=None): def parse(self, text, start=None, on_error=None):
start = self._verify_start(start)
chosen_start = self._verify_start(start)
stream = text if self.skip_lexer else LexerThread(self.lexer, text) stream = text if self.skip_lexer else LexerThread(self.lexer, text)
kw = {} if on_error is None else {'on_error': on_error} kw = {} if on_error is None else {'on_error': on_error}
return self.parser.parse(stream, start, **kw)
return self.parser.parse(stream, chosen_start, **kw)
def parse_interactive(self, text=None, start=None): def parse_interactive(self, text=None, start=None):
start = self._verify_start(start)
chosen_start = self._verify_start(start)
if self.parser_conf.parser_type != 'lalr': if self.parser_conf.parser_type != 'lalr':
raise ConfigurationError("parse_interactive() currently only works with parser='lalr' ") raise ConfigurationError("parse_interactive() currently only works with parser='lalr' ")
stream = text if self.skip_lexer else LexerThread(self.lexer, text) stream = text if self.skip_lexer else LexerThread(self.lexer, text)
return self.parser.parse_interactive(stream, start)
return self.parser.parse_interactive(stream, chosen_start)




def get_frontend(parser, lexer): def get_frontend(parser, lexer):


+ 1
- 1
lark/parsers/lalr_interactive_parser.py View File

@@ -65,7 +65,7 @@ class InteractiveParser(object):
"""Print the output of ``choices()`` in a way that's easier to read.""" """Print the output of ``choices()`` in a way that's easier to read."""
out = ["Parser choices:"] out = ["Parser choices:"]
for k, v in self.choices().items(): for k, v in self.choices().items():
out.append('\t- %s -> %s' % (k, v))
out.append('\t- %s -> %r' % (k, v))
out.append('stack size: %s' % len(self.parser_state.state_stack)) out.append('stack size: %s' % len(self.parser_state.state_stack))
return '\n'.join(out) return '\n'.join(out)




+ 2
- 2
lark/parsers/lalr_parser.py View File

@@ -178,8 +178,8 @@ class _Parser(object):
for token in state.lexer.lex(state): for token in state.lexer.lex(state):
state.feed_token(token) state.feed_token(token)


token = Token.new_borrow_pos('$END', '', token) if token else Token('$END', '', 0, 1, 1)
return state.feed_token(token, True)
end_token = Token.new_borrow_pos('$END', '', token) if token else Token('$END', '', 0, 1, 1)
return state.feed_token(end_token, True)
except UnexpectedInput as e: except UnexpectedInput as e:
try: try:
e.interactive_parser = InteractiveParser(self, state, state.lexer) e.interactive_parser = InteractiveParser(self, state, state.lexer)


+ 35
- 17
lark/utils.py View File

@@ -61,14 +61,13 @@ class Serialize(object):
fields = getattr(self, '__serialize_fields__') fields = getattr(self, '__serialize_fields__')
res = {f: _serialize(getattr(self, f), memo) for f in fields} res = {f: _serialize(getattr(self, f), memo) for f in fields}
res['__type__'] = type(self).__name__ res['__type__'] = type(self).__name__
postprocess = getattr(self, '_serialize', None)
if postprocess:
postprocess(res, memo)
if hasattr(self, '_serialize'):
self._serialize(res, memo)
return res return res


@classmethod @classmethod
def deserialize(cls, data, memo): def deserialize(cls, data, memo):
namespace = getattr(cls, '__serialize_namespace__', {})
namespace = getattr(cls, '__serialize_namespace__', [])
namespace = {c.__name__:c for c in namespace} namespace = {c.__name__:c for c in namespace}


fields = getattr(cls, '__serialize_fields__') fields = getattr(cls, '__serialize_fields__')
@@ -82,9 +81,10 @@ class Serialize(object):
setattr(inst, f, _deserialize(data[f], namespace, memo)) setattr(inst, f, _deserialize(data[f], namespace, memo))
except KeyError as e: except KeyError as e:
raise KeyError("Cannot find key for class", cls, e) raise KeyError("Cannot find key for class", cls, e)
postprocess = getattr(inst, '_deserialize', None)
if postprocess:
postprocess()

if hasattr(inst, '_deserialize'):
inst._deserialize()

return inst return inst




@@ -163,7 +163,7 @@ def get_regexp_width(expr):
return 1, sre_constants.MAXREPEAT return 1, sre_constants.MAXREPEAT
else: else:
return 0, sre_constants.MAXREPEAT return 0, sre_constants.MAXREPEAT
###} ###}




@@ -198,14 +198,6 @@ def dedup_list(l):
return [x for x in l if not (x in dedup or dedup.add(x))] return [x for x in l if not (x in dedup or dedup.add(x))]




def compare(a, b):
if a == b:
return 0
elif a > b:
return 1
return -1


class Enumerator(Serialize): class Enumerator(Serialize):
def __init__(self): def __init__(self):
self.enums = {} self.enums = {}
@@ -253,7 +245,7 @@ except ImportError:


class FS: class FS:
exists = os.path.exists exists = os.path.exists
@staticmethod @staticmethod
def open(name, mode="r", **kwargs): def open(name, mode="r", **kwargs):
if atomicwrites and "w" in mode: if atomicwrites and "w" in mode:
@@ -324,3 +316,29 @@ def _serialize(value, memo):
return {key:_serialize(elem, memo) for key, elem in value.items()} return {key:_serialize(elem, memo) for key, elem in value.items()}
# assert value is None or isinstance(value, (int, float, str, tuple)), value # assert value is None or isinstance(value, (int, float, str, tuple)), value
return value return value




def small_factors(n, max_factor):
"""
Splits n up into smaller factors and summands <= max_factor.
Returns a list of [(a, b), ...]
so that the following code returns n:

n = 1
for a, b in values:
n = n * a + b

Currently, we also keep a + b <= max_factor, but that might change
"""
assert n >= 0
assert max_factor > 2
if n <= max_factor:
return [(n, 0)]

for a in range(max_factor, 1, -1):
r, b = divmod(n, a)
if a + b <= max_factor:
return small_factors(r, max_factor) + [(a, b)]
assert False, "Failed to factorize %s" % n

+ 48
- 1
tests/test_grammar.py View File

@@ -3,7 +3,7 @@ from __future__ import absolute_import
import sys import sys
from unittest import TestCase, main from unittest import TestCase, main


from lark import Lark, Token, Tree
from lark import Lark, Token, Tree, ParseError, UnexpectedInput
from lark.load_grammar import GrammarError, GRAMMAR_ERRORS, find_grammar_errors from lark.load_grammar import GrammarError, GRAMMAR_ERRORS, find_grammar_errors
from lark.load_grammar import FromPackageLoader from lark.load_grammar import FromPackageLoader


@@ -198,6 +198,53 @@ class TestGrammar(TestCase):
x = find_grammar_errors(text) x = find_grammar_errors(text)
assert [e.line for e, _s in find_grammar_errors(text)] == [2, 6] assert [e.line for e, _s in find_grammar_errors(text)] == [2, 6]


def test_ranged_repeat_terms(self):
g = u"""!start: AAA
AAA: "A"~3
"""
l = Lark(g, parser='lalr')
self.assertEqual(l.parse(u'AAA'), Tree('start', ["AAA"]))
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AA')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAA')

g = u"""!start: AABB CC
AABB: "A"~0..2 "B"~2
CC: "C"~1..2
"""
l = Lark(g, parser='lalr')
self.assertEqual(l.parse(u'AABBCC'), Tree('start', ['AABB', 'CC']))
self.assertEqual(l.parse(u'BBC'), Tree('start', ['BB', 'C']))
self.assertEqual(l.parse(u'ABBCC'), Tree('start', ['ABB', 'CC']))
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAABBB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'ABB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAABB')

def test_ranged_repeat_large(self):
g = u"""!start: "A"~60
"""
l = Lark(g, parser='lalr')
self.assertGreater(len(l.rules), 1, "Expected that more than one rule will be generated")
self.assertEqual(l.parse(u'A' * 60), Tree('start', ["A"] * 60))
self.assertRaises(ParseError, l.parse, u'A' * 59)
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'A' * 61)

g = u"""!start: "A"~15..100
"""
l = Lark(g, parser='lalr')
for i in range(0, 110):
if 15 <= i <= 100:
self.assertEqual(l.parse(u'A' * i), Tree('start', ['A']*i))
else:
self.assertRaises(UnexpectedInput, l.parse, u'A' * i)

# 8191 is a Mersenne prime
g = u"""start: "A"~8191
"""
l = Lark(g, parser='lalr')
self.assertEqual(l.parse(u'A' * 8191), Tree('start', []))
self.assertRaises(UnexpectedInput, l.parse, u'A' * 8190)
self.assertRaises(UnexpectedInput, l.parse, u'A' * 8192)




if __name__ == '__main__': if __name__ == '__main__':


+ 20
- 20
tests/test_parser.py View File

@@ -94,6 +94,26 @@ class TestParsers(unittest.TestCase):
r = g.parse('a') r = g.parse('a')
self.assertEqual( r.children[0].meta.line, 1 ) self.assertEqual( r.children[0].meta.line, 1 )


def test_propagate_positions2(self):
g = Lark("""start: a
a: b
?b: "(" t ")"
!t: "t"
""", propagate_positions=True)

start = g.parse("(t)")
a ,= start.children
t ,= a.children
assert t.children[0] == "t"

assert t.meta.column == 2
assert t.meta.end_column == 3

assert start.meta.column == a.meta.column == 1
assert start.meta.end_column == a.meta.end_column == 4



def test_expand1(self): def test_expand1(self):


g = Lark("""start: a g = Lark("""start: a
@@ -2183,27 +2203,7 @@ def _make_parser_test(LEXER, PARSER):
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAABB') self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAABB')




def test_ranged_repeat_terms(self):
g = u"""!start: AAA
AAA: "A"~3
"""
l = _Lark(g)
self.assertEqual(l.parse(u'AAA'), Tree('start', ["AAA"]))
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AA')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAA')


g = u"""!start: AABB CC
AABB: "A"~0..2 "B"~2
CC: "C"~1..2
"""
l = _Lark(g)
self.assertEqual(l.parse(u'AABBCC'), Tree('start', ['AABB', 'CC']))
self.assertEqual(l.parse(u'BBC'), Tree('start', ['BB', 'C']))
self.assertEqual(l.parse(u'ABBCC'), Tree('start', ['ABB', 'CC']))
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAABBB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'ABB')
self.assertRaises((ParseError, UnexpectedInput), l.parse, u'AAAABB')


@unittest.skipIf(PARSER=='earley', "Priority not handled correctly right now") # TODO XXX @unittest.skipIf(PARSER=='earley', "Priority not handled correctly right now") # TODO XXX
def test_priority_vs_embedded(self): def test_priority_vs_embedded(self):


Loading…
Cancel
Save