API reference#
The API surface is intentionally minimal. The package provides a simple token class, a couple exceptions, and the main TokenStream
abstraction. There are no third-party dependencies.
TokenStream#
- class tokenstream.stream.TokenStream(source, preprocessor=None, regex_module=<module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>)#
A versatile token stream for handwritten parsers.
The stream is iterable and will yield all the extracted tokens one after the other.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"\w+"): ... print([token.value for token in stream]) ['hello', 'world']
- source#
The input string.
>>> stream = TokenStream("hello world") >>> stream.source 'hello world'
- Type
- preprocessor#
A preprocessor that will emit source location mappings for the transformed input.
- Type
Optional[Callable[[str], tuple[str, Sequence[tokenstream.location.SourceLocation], Sequence[tokenstream.location.SourceLocation]]]]
- syntax_rules#
A tuple of
(token_type, pattern)
pairs that define the recognizable tokens.>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(stream.syntax_rules) (('word', '[a-z]+'),)
- regex#
The compiled regular expression generated from the syntax rules.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(stream.regex.pattern) (?P<word>[a-z]+)|(?P<newline>\r?\n)|(?P<whitespace>[ \t]+)|(?P<invalid>.+)
- Type
re.Pattern[str]
- index#
The index of the current token in the list of extracted tokens.
You can technically mutate this attribute directly if you want to reset the stream back to a specific token, but you should probably use the higher-level
checkpoint()
method for this.- Type
- tokens#
A list accumulating all the extracted tokens.
The list contains all the extracted tokens, even the ones ignored when using the
ignore()
method. For this reason you shouldn't try to index into the list directly. Use methods likeexpect()
,peek()
, orcollect()
instead.- Type
- indentation#
A list that keeps track of the indentation levels when indentation is enabled. The list is empty when indentation is disabled.
- indentation_skip#
A set of token types for which the token stream shouldn't emit indentation changes.
Can be set using the
skip
argument of theindent()
method.
- generator#
An instance of the
generate_tokens()
generator that the stream iterates iterates through to extract and emit tokens.Should be considered internal.
- Type
Iterator[tokenstream.token.Token]
- ignored_tokens#
A set of token types that the stream skips over when iterating, peeking, and expecting tokens.
- regex_module#
The module to use for compiling regex patterns. Uses the built-in
re
module by default. It's possible to swap it out for https://github.com/mrabarnett/mrab-regex by specifying the module as keyword argument when creating a newTokenStream
.- Type
Any
- regex_cache#
A cache that keeps a reference to the compiled regular expression associated to each set of syntax rules.
- regex_module: Any = <module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>#
- bake_regex()#
Compile the syntax rules.
Called automatically upon instantiation and when the syntax rules change. Should be considered internal.
- crop()#
Clear upcoming precomputed tokens.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... word = stream.expect("word") ... with stream.checkpoint(): ... word = stream.expect("word") ... print(stream.tokens[-1].value) ... stream.crop() ... print(stream.tokens[-1].value) world hello
Mostly used to ensure consistency in some of the provided context managers. Should be considered internal.
- syntax(**kwargs)#
Extend token syntax using regular expressions.
The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123'
Nesting multiple
syntax()
calls will combine the rules.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.syntax(number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123'
You can also disable a previous rule by using
None
.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.syntax(number=r"[0-9]+", word=None): ... stream.expect("word").value Traceback (most recent call last): UnexpectedToken: Expected word but got invalid 'hello world 123'.
- reset_syntax(**kwargs)#
Overwrite the existing syntax rules.
This method lets you temporarily overwrite the existing rules instead of extending them.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.reset_syntax(number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected word but got invalid 'hello world 123'.
- indent(enable=True, skip=None)#
Enable or disable indentation.
When indentation is enabled the token stream will track the current indentation level and emit
indent
anddedent
tokens when the indentation level changes. Theindent
anddedent
tokens are always balanced, everyindent
token will be ultimately paired with adedent
token.>>> stream = TokenStream("hello\n\tworld") >>> with stream.syntax(word=r"[a-z]+"), stream.indent(): ... stream.expect("word").value ... stream.expect("indent").value ... stream.expect("word").value ... stream.expect("dedent").value 'hello' '' 'world' ''
The
skip
argument allows you to prevent some types of tokens from triggering indentation changes. The most common use-case would be ignoring indentation introduced by comments.with stream.syntax(word=r"[a-z]+", comment=r"#.+$"), stream.indent(skip=["comment"]): stream.expect("word") stream.expect("indent") stream.expect("word") stream.expect("dedent")
You can also use the
indent()
method to temporarily disable indentation by specifyingenable=False
. This is different from simply ignoringindent
anddedent
tokens with theignore()
method because it clears the indentation stack and if you decide to re-enable indentation afterwards the indentation level will start back at 0.with stream.indent(enable=False): ...
- intercept(*token_types)#
Intercept tokens matching the given types.
This tells the stream to not skip over previously ignored tokens or tokens ignored by default like
newline
andwhitespace
.>>> stream = TokenStream("hello world\n") >>> with stream.syntax(word=r"[a-z]+"), stream.intercept("newline", "whitespace"): ... stream.expect("word").value ... stream.expect("whitespace").value ... stream.expect("word").value ... stream.expect("newline").value 'hello' ' ' 'world' '\n'
You can use the
ignore()
method to ignore previously intercepted tokens.
- ignore(*token_types)#
Ignore tokens matching the given types.
This tells the stream to skip over tokens matching any of the given types.
>>> stream = TokenStream("hello 123 world") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"), stream.ignore("number"): ... stream.expect("word").value ... stream.expect("word").value 'hello' 'world'
You can use the
intercept()
method to stop ignoring tokens.
- property current: Token#
The current token.
Can only be accessed if the stream started extracting tokens.
- property previous: Token#
The previous token.
This is the token extracted immediately before the current one, so it's not affected by the
ignore()
method.
- property leftover: str#
The remaining input.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.leftover 'hello' ' world'
- head(characters=50)#
Preview the characters ahead of the current token.
This is useful for error messages and visualizing the input following the current token.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.head() 'hello' ' world'
The generated string is truncated to 50 characters by default but you can change this with the
characters
argument.
- emit_token(token_type, value='')#
Generate a token in the token stream.
Should be considered internal. Used by the
generate_tokens()
method.
- emit_error(exc)#
Add location information to invalid syntax exceptions.
>>> stream = TokenStream("hello world") >>> raise stream.emit_error(InvalidSyntax("foo")) Traceback (most recent call last): InvalidSyntax: foo >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect().value 'hello' >>> exc = stream.emit_error(InvalidSyntax("foo")) >>> exc.location SourceLocation(pos=5, lineno=1, colno=6)
- generate_tokens()#
Extract tokens from the input string.
Should be considered internal. This is the underlying generator being driven by the stream.
- peek(n=1)#
Peek around the current token.
The method returns the next token in the stream without advancing the stream to the next token.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.peek().value ... stream.expect("word").value 'hello' 'hello'
You can also peek multiple tokens ahead.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.peek(2).value ... stream.expect("word").value 'world' 'hello'
Negative values will let you peek backwards. It's generally better to use
peek(-1)
over theprevious
attribute because thepeek()
method will take ignored tokens into account.>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.peek(-1).value 'hello' 'world' 'hello' >>> stream.previous.value ' '
- peek_until(*patterns)#
Collect tokens until one of the given patterns matches.
>>> stream = TokenStream("hello world; foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.peek_until("semi"): ... stream.expect("word").value ... stream.current.value ... stream.leftover 'hello' 'world' ';' ' foo'
The method will raise and error if the end of the stream is reached before encountering any of the given patterns.
>>> stream = TokenStream("hello world foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.peek_until("semi"): ... stream.expect("word").value Traceback (most recent call last): UnexpectedEOF: Expected semi but reached end of file.
If the method is called without any pattern the iterator will yield tokens until the end of the stream.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print([stream.expect("word").value for _ in stream.peek_until()]) ['hello', 'world']
- collect() Iterator[Token] #
- collect(pattern: str | tuple[str, str], /) Iterator[Token]
- collect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) Iterator[list[tokenstream.token.Token | None]]
Collect tokens matching the given patterns.
Calling the method without any arguments is similar to iterating over the stream directly. If you provide one or more arguments the iterator will stop if it encounters a token that doesn't match any of the given patterns.
>>> stream = TokenStream("hello world; foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.collect("word"): ... token.value ... stream.leftover 'hello' 'world' '; foo'
If you provide more than one pattern the method will yield a sequence of the same size where the token will be at the index of the pattern that matched the token.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... for word, number in stream.collect("word", "number"): ... if word: ... print("word", word.value) ... elif number: ... print("number", number.value) word hello word world number 123
There is one small difference between iterating over the stream directly and using the method without any argument. The
collect()
method will raise an exception if it encounters an invalid token.>>> stream = TokenStream("foo") >>> with stream.syntax(number=r"[0-9]+"): ... for token in stream.collect(): ... token Traceback (most recent call last): UnexpectedToken: Expected anything but got invalid 'foo'.
When you iterate over the stream directly the tokens are unfiltered.
>>> stream = TokenStream("foo") >>> with stream.syntax(number=r"[0-9]+"): ... for token in stream: ... token Token(type='invalid', value='foo', ...)
- collect_any(*patterns)#
Collect tokens matching one of the given patterns.
The method is similar to
collect()
but will always return a single value. This works pretty nicely with Python 3.10+ match statements.for token in stream.collect_any("word", "number"): match token: case Token(type="word"): print("word", token.value) case Token(type="number"): print("number", token.value)
- expect() Token #
- expect(pattern: str | tuple[str, str], /) Token
- expect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) list[tokenstream.token.Token | None]
Match the given patterns and raise an exception if the next token doesn't match.
The
expect()
method lets you retrieve tokens one at a time.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect().value ... stream.expect().value ... stream.expect().value 'hello' 'world' '123'
You can provide a pattern and if the extracted token doesn't match the method will raise an exception.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected number but got word 'hello'.
The method will also raise and exception if the stream ended.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("word").value Traceback (most recent call last): UnexpectedEOF: Expected word but reached end of file.
The method works a bit like
collect()
and lets you know which pattern matched the extracted token if you provide more than one pattern.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... word, number = stream.expect("word", "number") ... if word: ... print("word", word.value) ... elif number: ... print("number", number.value) word hello
- get(*patterns)#
Return the next token if it matches any of the given patterns.
The method works a bit like
expect()
but will returnNone
instead of raising an exception if none of the given patterns match. If there are no more tokens the method will also returnNone
.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.get("word").value ... stream.get("number") is None ... stream.get("word").value ... stream.get("number").value ... stream.get() is None 'hello' True 'world' '123' True
- expect_any(*patterns)#
Make sure that the next token matches one of the given patterns or raise an exception.
The method is similar to
expect()
but will always return a single value. This works pretty nicely with Python 3.10+ match statements.match stream.expect_any("word", "number"): case Token(type="word") as word: print("word", word.value) case Token(type="number") as number: print("number", number.value)
- expect_eof()#
Raise an exception if there is leftover input.
>>> stream = TokenStream("hello world 123 foo") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... for token in stream.collect("word"): ... token.value ... stream.expect("number").value 'hello' 'world' '123' >>> stream.expect_eof() Traceback (most recent call last): UnexpectedToken: Expected eof but got invalid 'foo'.
- checkpoint()#
Reset the stream to the current token at the end of the
with
statement.>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.checkpoint(): ... stream.expect("word").value ... stream.expect("word").value 'hello' 'hello'
You can use the returned handle to keep the state of the stream at the end of the
with
statement. For more details check outCheckpointCommit
.>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.checkpoint() as commit: ... stream.expect("word").value ... commit() ... stream.expect("word").value 'hello' 'world'
The context manager will swallow syntax errors until the handle commits the checkpoint.
- alternative(active=True)#
Keep going if the code within the
with
statement raises a syntax error.>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... with stream.alternative(): ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123'
You can optionally provide a boolean to deactivate the context manager dynamically.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... with stream.alternative(False): ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected number but got word 'hello'.
- choose(*args)#
Iterate over each argument until one of the alternative succeeds.
>>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... while stream.peek(): ... for token_type, alternative in stream.choose("word", "number"): ... with alternative: ... stream.expect(token_type).value 'hello' 'world' '123'
- provide(**data)#
Provide arbitrary user data.
>>> stream = TokenStream("hello world") >>> with stream.provide(foo=123): ... stream.data["foo"] 123
- reset(*args)#
Temporarily reset arbitrary user data.
>>> stream = TokenStream("hello world") >>> with stream.provide(foo=123): ... stream.data["foo"] ... with stream.reset("foo"): ... stream.data ... stream.data 123 {} {'foo': 123}
- copy()#
Return a copy of the stream.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream_copy = stream.copy() ... stream.expect("word").value 'hello' 'world' >>> with stream_copy.syntax(letter=r"[a-z]"): ... [token.value for token in stream_copy] ['w', 'o', 'r', 'l', 'd']
- class tokenstream.stream.CheckpointCommit(index, rollback=True)#
Handle for managing checkpoints.
Token#
- class tokenstream.token.Token(type, value, location, end_location)#
Class representing a token.
- location: SourceLocation#
Alias for field number 2
- end_location: SourceLocation#
Alias for field number 3
- match(*patterns)#
Match the token against one or more patterns.
Each argument can be either a string corresponding to a token type or a tuple with a token type and a token value.
>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(f"{stream.expect().match(('word', 'hello')) = }") ... print(f"{stream.expect().match('word') = }") stream.expect().match(('word', 'hello')) = True stream.expect().match('word') = True
- emit_error(exc)#
Add location information to invalid syntax exceptions.
This works exactly like
tokenstream.stream.TokenStream.emit_error()
but it associates the location of the token with the syntax error instead of the head of the stream.>>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... token = stream.expect() ... exc = token.emit_error(InvalidSyntax("goodbye")) ... raise exc Traceback (most recent call last): InvalidSyntax: goodbye >>> exc.location SourceLocation(pos=0, lineno=1, colno=1) >>> exc.end_location SourceLocation(pos=5, lineno=1, colno=6)
Location#
- class tokenstream.location.SourceLocation(pos, lineno, colno)#
Class representing a location within an input string.
- property unknown: bool#
Whether the location is unknown.
>>> location = UNKNOWN_LOCATION >>> location.unknown True
- format(filename, message)#
Return a message formatted with the given filename and the current location.
>>> SourceLocation(42, 3, 12).format("path/to/file.txt", "Some error message") 'path/to/file.txt:3:12: Some error message'
- with_horizontal_offset(offset)#
Create a modified source location along the horizontal axis.
>>> INITIAL_LOCATION.with_horizontal_offset(41) SourceLocation(pos=41, lineno=1, colno=42)
- skip_over(value)#
Return the source location after skipping over a piece of text.
>>> INITIAL_LOCATION.skip_over("hello\nworld") SourceLocation(pos=11, lineno=2, colno=6)
- map(input_mappings, output_mappings)#
Map a source location.
The mappings must contain corresponding source locations in order.
>>> INITIAL_LOCATION.map([], []) SourceLocation(pos=0, lineno=1, colno=1) >>> mappings1 = [SourceLocation(16, 2, 27), SourceLocation(19, 2, 30)] >>> mappings2 = [SourceLocation(24, 3, 8), SourceLocation(67, 4, 12)] >>> INITIAL_LOCATION.map(mappings1, mappings2) SourceLocation(pos=0, lineno=1, colno=1) >>> SourceLocation(15, 2, 26).map(mappings1, mappings2) SourceLocation(pos=15, lineno=2, colno=26) >>> SourceLocation(16, 2, 27).map(mappings1, mappings2) SourceLocation(pos=24, lineno=3, colno=8) >>> SourceLocation(18, 2, 29).map(mappings1, mappings2) SourceLocation(pos=26, lineno=3, colno=10) >>> SourceLocation(19, 2, 30).map(mappings1, mappings2) SourceLocation(pos=67, lineno=4, colno=12) >>> SourceLocation(31, 3, 6).map(mappings1, mappings2) SourceLocation(pos=79, lineno=5, colno=6)
- relocate(base_location, target_location)#
Return the current location transformed relative to the target location.
- tokenstream.location.set_location(obj, location=SourceLocation(pos=-1, lineno=0, colno=0), end_location=SourceLocation(pos=-1, lineno=0, colno=0))#
Set the location and end_location attributes.
The function returns the given object or a new instance if the object is a namedtuple or a frozen dataclass. The location can be copied from another object with location and end_location attributes.
>>> token = Token("number", "123", UNKNOWN_LOCATION, UNKNOWN_LOCATION) >>> updated_token = set_location(token, SourceLocation(15, 6, 1)) >>> updated_token Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=15, lineno=6, colno=1)) >>> updated_token = set_location( ... updated_token, ... end_location=updated_token.location.with_horizontal_offset(len(updated_token.value)), ... ) >>> set_location(token, updated_token) Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=18, lineno=6, colno=4))
Exceptions#
- class tokenstream.error.InvalidSyntax(*args)#
Bases:
Exception
Raised when the input contains invalid syntax.
- location#
The location of the error.
- end_location#
The end location of the error.
- alternatives#
A dictionary holding other alternative errors associated with the exception.
- format(filename)#
Return a string representing the error and its location in a given file.
>>> try: ... TokenStream("hello").expect() ... except InvalidSyntax as exc: ... print(exc.format("path/to/my_file.txt")) path/to/my_file.txt:1:1: Expected anything but got invalid 'hello'.
- add_alternative(exc)#
Associate an alternative error.
- class tokenstream.error.UnexpectedEOF(expected_patterns=())#
Bases:
InvalidSyntax
Raised when the input ends unexpectedly.
- expected_patterns#
The patterns that the parser was expecting instead of reaching end of the file.
- add_alternative(exc)#
Associate an alternative error.
- class tokenstream.error.UnexpectedToken(token, expected_patterns=())#
Bases:
InvalidSyntax
Raised when the input contains an unexpected token.
- token#
The unexpected token that was encountered.
- expected_patterns#
The patterns that the parser was expecting instead.
- add_alternative(exc)#
Associate an alternative error.