API reference#

The API surface is intentionally minimal. The package provides a simple token class, a couple exceptions, and the main TokenStream abstraction. There are no third-party dependencies.

TokenStream#

class tokenstream.stream.TokenStream(source, preprocessor=None, regex_module=<module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>)#

A versatile token stream for handwritten parsers.

The stream is iterable and will yield all the extracted tokens one after the other.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"\w+"):
...     print([token.value for token in stream])
['hello', 'world']

source#

The input string.

>>> stream = TokenStream("hello world")
>>> stream.source
'hello world'

Type: str

preprocessor#

A preprocessor that will emit source location mappings for the transformed input.

Type: Optional[Callable[[str], tuple[str, Sequence[tokenstream.location.SourceLocation], Sequence[tokenstream.location.SourceLocation]]]]

syntax_rules#

A tuple of (token_type, pattern) pairs that define the recognizable tokens.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     print(stream.syntax_rules)
(('word', '[a-z]+'),)

Type: tuple[tuple[str, str], ...]

regex#

The compiled regular expression generated from the syntax rules.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     print(stream.regex.pattern)
(?P<word>[a-z]+)|(?P<newline>\r?\n)|(?P<whitespace>[ \t]+)|(?P<invalid>.+)

Type: re.Pattern[str]

index#

The index of the current token in the list of extracted tokens.

You can technically mutate this attribute directly if you want to reset the stream back to a specific token, but you should probably use the higher-level checkpoint() method for this.

Type: int

tokens#

A list accumulating all the extracted tokens.

The list contains all the extracted tokens, even the ones ignored when using the ignore() method. For this reason you shouldn't try to index into the list directly. Use methods like expect(), peek(), or collect() instead.

Type: list[tokenstream.token.Token]

indentation#

A list that keeps track of the indentation levels when indentation is enabled. The list is empty when indentation is disabled.

Type: list[int]

indentation_skip#

A set of token types for which the token stream shouldn't emit indentation changes.

Can be set using the skip argument of the indent() method.

Type: set[str]

generator#

An instance of the generate_tokens() generator that the stream iterates iterates through to extract and emit tokens.

Should be considered internal.

Type: Iterator[tokenstream.token.Token]

ignored_tokens#

A set of token types that the stream skips over when iterating, peeking, and expecting tokens.

Type: set[str]

data#

A dictionary holding arbitrary user data.

Type: dict[str, Any]

regex_module#

The module to use for compiling regex patterns. Uses the built-in re module by default. It's possible to swap it out for https://github.com/mrabarnett/mrab-regex by specifying the module as keyword argument when creating a new TokenStream.

Type: Any

regex_cache#

A cache that keeps a reference to the compiled regular expression associated to each set of syntax rules.

Type: dict[tuple[tuple[str, str], ...], re.Pattern[str]]

regex_module: Any = <module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>#

bake_regex()#

Compile the syntax rules.

Called automatically upon instantiation and when the syntax rules change. Should be considered internal.

crop()#

Clear upcoming precomputed tokens.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     word = stream.expect("word")
...     with stream.checkpoint():
...         word = stream.expect("word")
...     print(stream.tokens[-1].value)
...     stream.crop()
...     print(stream.tokens[-1].value)
world
hello

Mostly used to ensure consistency in some of the provided context managers. Should be considered internal.

syntax(**kwargs)#

Extend token syntax using regular expressions.

The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     stream.expect("word").value
...     stream.expect("word").value
...     stream.expect("number").value
'hello'
'world'
'123'

Nesting multiple syntax() calls will combine the rules.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+"):
...     with stream.syntax(number=r"[0-9]+"):
...         stream.expect("word").value
...         stream.expect("word").value
...         stream.expect("number").value
'hello'
'world'
'123'

You can also disable a previous rule by using None.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+"):
...     with stream.syntax(number=r"[0-9]+", word=None):
...         stream.expect("word").value
Traceback (most recent call last):
UnexpectedToken: Expected word but got invalid 'hello world 123'.

reset_syntax(**kwargs)#

Overwrite the existing syntax rules.

This method lets you temporarily overwrite the existing rules instead of extending them.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+"):
...     with stream.reset_syntax(number=r"[0-9]+"):
...         stream.expect("word").value
...         stream.expect("word").value
...         stream.expect("number").value
Traceback (most recent call last):
UnexpectedToken: Expected word but got invalid 'hello world 123'.

indent(enable=True, skip=None)#

Enable or disable indentation.

When indentation is enabled the token stream will track the current indentation level and emit indent and dedent tokens when the indentation level changes. The indent and dedent tokens are always balanced, every indent token will be ultimately paired with a dedent token.

>>> stream = TokenStream("hello\n\tworld")
>>> with stream.syntax(word=r"[a-z]+"), stream.indent():
...     stream.expect("word").value
...     stream.expect("indent").value
...     stream.expect("word").value
...     stream.expect("dedent").value
'hello'
''
'world'
''

The skip argument allows you to prevent some types of tokens from triggering indentation changes. The most common use-case would be ignoring indentation introduced by comments.

with stream.syntax(word=r"[a-z]+", comment=r"#.+$"), stream.indent(skip=["comment"]):
    stream.expect("word")
    stream.expect("indent")
    stream.expect("word")
    stream.expect("dedent")

You can also use the indent() method to temporarily disable indentation by specifying enable=False. This is different from simply ignoring indent and dedent tokens with the ignore() method because it clears the indentation stack and if you decide to re-enable indentation afterwards the indentation level will start back at 0.

with stream.indent(enable=False):
    ...

intercept(*token_types)#

Intercept tokens matching the given types.

This tells the stream to not skip over previously ignored tokens or tokens ignored by default like newline and whitespace.

>>> stream = TokenStream("hello world\n")
>>> with stream.syntax(word=r"[a-z]+"), stream.intercept("newline", "whitespace"):
...     stream.expect("word").value
...     stream.expect("whitespace").value
...     stream.expect("word").value
...     stream.expect("newline").value
'hello'
' '
'world'
'\n'

You can use the ignore() method to ignore previously intercepted tokens.

ignore(*token_types)#

Ignore tokens matching the given types.

This tells the stream to skip over tokens matching any of the given types.

>>> stream = TokenStream("hello 123 world")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"), stream.ignore("number"):
...     stream.expect("word").value
...     stream.expect("word").value
'hello'
'world'

You can use the intercept() method to stop ignoring tokens.

property current: Token#

The current token.

Can only be accessed if the stream started extracting tokens.

property previous: Token#

The previous token.

This is the token extracted immediately before the current one, so it's not affected by the ignore() method.

property leftover: str#

The remaining input.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect("word").value
...     stream.leftover
'hello'
' world'

head(characters=50)#

Preview the characters ahead of the current token.

This is useful for error messages and visualizing the input following the current token.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect("word").value
...     stream.head()
'hello'
' world'

The generated string is truncated to 50 characters by default but you can change this with the characters argument.

emit_token(token_type, value='')#

Generate a token in the token stream.

Should be considered internal. Used by the generate_tokens() method.

emit_error(exc)#

Add location information to invalid syntax exceptions.

>>> stream = TokenStream("hello world")
>>> raise stream.emit_error(InvalidSyntax("foo"))
Traceback (most recent call last):
InvalidSyntax: foo
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect().value
'hello'
>>> exc = stream.emit_error(InvalidSyntax("foo"))
>>> exc.location
SourceLocation(pos=5, lineno=1, colno=6)

generate_tokens()#

Extract tokens from the input string.

Should be considered internal. This is the underlying generator being driven by the stream.

peek(n=1)#

Peek around the current token.

The method returns the next token in the stream without advancing the stream to the next token.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.peek().value
...     stream.expect("word").value
'hello'
'hello'

You can also peek multiple tokens ahead.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.peek(2).value
...     stream.expect("word").value
'world'
'hello'

Negative values will let you peek backwards. It's generally better to use peek(-1) over the previous attribute because the peek() method will take ignored tokens into account.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect("word").value
...     stream.expect("word").value
...     stream.peek(-1).value
'hello'
'world'
'hello'
>>> stream.previous.value
' '

peek_until(*patterns)#

Collect tokens until one of the given patterns matches.

>>> stream = TokenStream("hello world; foo")
>>> with stream.syntax(word=r"[a-z]+", semi=r";"):
...     for token in stream.peek_until("semi"):
...         stream.expect("word").value
...     stream.current.value
...     stream.leftover
'hello'
'world'
';'
' foo'

The method will raise and error if the end of the stream is reached before encountering any of the given patterns.

>>> stream = TokenStream("hello world foo")
>>> with stream.syntax(word=r"[a-z]+", semi=r";"):
...     for token in stream.peek_until("semi"):
...         stream.expect("word").value
Traceback (most recent call last):
UnexpectedEOF: Expected semi but reached end of file.

If the method is called without any pattern the iterator will yield tokens until the end of the stream.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     print([stream.expect("word").value for _ in stream.peek_until()])
['hello', 'world']

collect() → Iterator[Token]#

collect(pattern: str | tuple[str, str], /) → Iterator[Token]

collect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) → Iterator[list[tokenstream.token.Token | None]]

Collect tokens matching the given patterns.

Calling the method without any arguments is similar to iterating over the stream directly. If you provide one or more arguments the iterator will stop if it encounters a token that doesn't match any of the given patterns.

>>> stream = TokenStream("hello world; foo")
>>> with stream.syntax(word=r"[a-z]+", semi=r";"):
...     for token in stream.collect("word"):
...         token.value
...     stream.leftover
'hello'
'world'
'; foo'

If you provide more than one pattern the method will yield a sequence of the same size where the token will be at the index of the pattern that matched the token.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     for word, number in stream.collect("word", "number"):
...         if word:
...             print("word", word.value)
...         elif number:
...             print("number", number.value)
word hello
word world
number 123

There is one small difference between iterating over the stream directly and using the method without any argument. The collect() method will raise an exception if it encounters an invalid token.

>>> stream = TokenStream("foo")
>>> with stream.syntax(number=r"[0-9]+"):
...     for token in stream.collect():
...         token
Traceback (most recent call last):
UnexpectedToken: Expected anything but got invalid 'foo'.

When you iterate over the stream directly the tokens are unfiltered.

>>> stream = TokenStream("foo")
>>> with stream.syntax(number=r"[0-9]+"):
...     for token in stream:
...         token
Token(type='invalid', value='foo', ...)

collect_any(*patterns)#

Collect tokens matching one of the given patterns.

The method is similar to collect() but will always return a single value. This works pretty nicely with Python 3.10+ match statements.

for token in stream.collect_any("word", "number"):
    match token:
        case Token(type="word"):
            print("word", token.value)
        case Token(type="number"):
            print("number", token.value)

expect() → Token#

expect(pattern: str | tuple[str, str], /) → Token

expect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) → list[tokenstream.token.Token | None]

Match the given patterns and raise an exception if the next token doesn't match.

The expect() method lets you retrieve tokens one at a time.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     stream.expect().value
...     stream.expect().value
...     stream.expect().value
'hello'
'world'
'123'

You can provide a pattern and if the extracted token doesn't match the method will raise an exception.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     stream.expect("number").value
Traceback (most recent call last):
UnexpectedToken: Expected number but got word 'hello'.

The method will also raise and exception if the stream ended.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect("word").value
...     stream.expect("word").value
...     stream.expect("word").value
Traceback (most recent call last):
UnexpectedEOF: Expected word but reached end of file.

The method works a bit like collect() and lets you know which pattern matched the extracted token if you provide more than one pattern.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     word, number = stream.expect("word", "number")
...     if word:
...         print("word", word.value)
...     elif number:
...         print("number", number.value)
word hello

get(*patterns)#

Return the next token if it matches any of the given patterns.

The method works a bit like expect() but will return None instead of raising an exception if none of the given patterns match. If there are no more tokens the method will also return None.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     stream.get("word").value
...     stream.get("number") is None
...     stream.get("word").value
...     stream.get("number").value
...     stream.get() is None
'hello'
True
'world'
'123'
True

expect_any(*patterns)#

Make sure that the next token matches one of the given patterns or raise an exception.

The method is similar to expect() but will always return a single value. This works pretty nicely with Python 3.10+ match statements.

match stream.expect_any("word", "number"):
    case Token(type="word") as word:
        print("word", word.value)
    case Token(type="number") as number:
        print("number", number.value)

expect_eof()#

Raise an exception if there is leftover input.

>>> stream = TokenStream("hello world 123 foo")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     for token in stream.collect("word"):
...         token.value
...     stream.expect("number").value
'hello'
'world'
'123'
>>> stream.expect_eof()
Traceback (most recent call last):
UnexpectedToken: Expected eof but got invalid 'foo'.

checkpoint()#

Reset the stream to the current token at the end of the with statement.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     with stream.checkpoint():
...         stream.expect("word").value
...     stream.expect("word").value
'hello'
'hello'

You can use the returned handle to keep the state of the stream at the end of the with statement. For more details check out CheckpointCommit.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     with stream.checkpoint() as commit:
...         stream.expect("word").value
...         commit()
...     stream.expect("word").value
'hello'
'world'

The context manager will swallow syntax errors until the handle commits the checkpoint.

alternative(active=True)#

Keep going if the code within the with statement raises a syntax error.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     stream.expect("word").value
...     stream.expect("word").value
...     with stream.alternative():
...         stream.expect("word").value
...     stream.expect("number").value
'hello'
'world'
'123'

You can optionally provide a boolean to deactivate the context manager dynamically.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     with stream.alternative(False):
...         stream.expect("number").value
Traceback (most recent call last):
UnexpectedToken: Expected number but got word 'hello'.

choose(*args)#

Iterate over each argument until one of the alternative succeeds.

>>> stream = TokenStream("hello world 123")
>>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"):
...     while stream.peek():
...         for token_type, alternative in stream.choose("word", "number"):
...             with alternative:
...                 stream.expect(token_type).value
'hello'
'world'
'123'

provide(**data)#

Provide arbitrary user data.

>>> stream = TokenStream("hello world")
>>> with stream.provide(foo=123):
...     stream.data["foo"]
123

reset(*args)#

Temporarily reset arbitrary user data.

>>> stream = TokenStream("hello world")
>>> with stream.provide(foo=123):
...     stream.data["foo"]
...     with stream.reset("foo"):
...         stream.data
...     stream.data
123
{}
{'foo': 123}

copy()#

Return a copy of the stream.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     stream.expect("word").value
...     stream_copy = stream.copy()
...     stream.expect("word").value
'hello'
'world'
>>> with stream_copy.syntax(letter=r"[a-z]"):
...     [token.value for token in stream_copy]
['w', 'o', 'r', 'l', 'd']

class tokenstream.stream.CheckpointCommit(index, rollback=True)#

Handle for managing checkpoints.

index#

The index of the stream when the checkpoint was created.

Type: int

rollback#

Whether the checkpoint should be rolled back or not. This attribute is set to False when the handle is invoked as a function.

Type: bool

Token#

class tokenstream.token.Token(type, value, location, end_location)#

Class representing a token.

type: str#: Alias for field number 0

value: str#: Alias for field number 1

location: SourceLocation#: Alias for field number 2

end_location: SourceLocation#: Alias for field number 3

match(*patterns)#

Match the token against one or more patterns.

Each argument can be either a string corresponding to a token type or a tuple with a token type and a token value.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     print(f"{stream.expect().match(('word', 'hello')) = }")
...     print(f"{stream.expect().match('word') = }")
stream.expect().match(('word', 'hello')) = True
stream.expect().match('word') = True

emit_error(exc)#

Add location information to invalid syntax exceptions.

This works exactly like tokenstream.stream.TokenStream.emit_error() but it associates the location of the token with the syntax error instead of the head of the stream.

>>> stream = TokenStream("hello world")
>>> with stream.syntax(word=r"[a-z]+"):
...     token = stream.expect()
...     exc = token.emit_error(InvalidSyntax("goodbye"))
...     raise exc
Traceback (most recent call last):
InvalidSyntax: goodbye
>>> exc.location
SourceLocation(pos=0, lineno=1, colno=1)
>>> exc.end_location
SourceLocation(pos=5, lineno=1, colno=6)

Location#

class tokenstream.location.SourceLocation(pos, lineno, colno)#

Class representing a location within an input string.

pos: int#: Alias for field number 0

lineno: int#: Alias for field number 1

colno: int#: Alias for field number 2

property unknown: bool#

Whether the location is unknown.

>>> location = UNKNOWN_LOCATION
>>> location.unknown
True

format(filename, message)#

Return a message formatted with the given filename and the current location.

>>> SourceLocation(42, 3, 12).format("path/to/file.txt", "Some error message")
'path/to/file.txt:3:12: Some error message'

with_horizontal_offset(offset)#

Create a modified source location along the horizontal axis.

>>> INITIAL_LOCATION.with_horizontal_offset(41)
SourceLocation(pos=41, lineno=1, colno=42)

skip_over(value)#

Return the source location after skipping over a piece of text.

>>> INITIAL_LOCATION.skip_over("hello\nworld")
SourceLocation(pos=11, lineno=2, colno=6)

map(input_mappings, output_mappings)#

Map a source location.

The mappings must contain corresponding source locations in order.

>>> INITIAL_LOCATION.map([], [])
SourceLocation(pos=0, lineno=1, colno=1)
>>> mappings1 = [SourceLocation(16, 2, 27), SourceLocation(19, 2, 30)]
>>> mappings2 = [SourceLocation(24, 3, 8), SourceLocation(67, 4, 12)]
>>> INITIAL_LOCATION.map(mappings1, mappings2)
SourceLocation(pos=0, lineno=1, colno=1)
>>> SourceLocation(15, 2, 26).map(mappings1, mappings2)
SourceLocation(pos=15, lineno=2, colno=26)
>>> SourceLocation(16, 2, 27).map(mappings1, mappings2)
SourceLocation(pos=24, lineno=3, colno=8)
>>> SourceLocation(18, 2, 29).map(mappings1, mappings2)
SourceLocation(pos=26, lineno=3, colno=10)
>>> SourceLocation(19, 2, 30).map(mappings1, mappings2)
SourceLocation(pos=67, lineno=4, colno=12)
>>> SourceLocation(31, 3, 6).map(mappings1, mappings2)
SourceLocation(pos=79, lineno=5, colno=6)

relocate(base_location, target_location)#

Return the current location transformed relative to the target location.

tokenstream.location.set_location(obj, location=SourceLocation(pos=-1, lineno=0, colno=0), end_location=SourceLocation(pos=-1, lineno=0, colno=0))#

Set the location and end_location attributes.

The function returns the given object or a new instance if the object is a namedtuple or a frozen dataclass. The location can be copied from another object with location and end_location attributes.

>>> token = Token("number", "123", UNKNOWN_LOCATION, UNKNOWN_LOCATION)
>>> updated_token = set_location(token, SourceLocation(15, 6, 1))
>>> updated_token
Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=15, lineno=6, colno=1))
>>> updated_token = set_location(
...     updated_token,
...     end_location=updated_token.location.with_horizontal_offset(len(updated_token.value)),
... )
>>> set_location(token, updated_token)
Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=18, lineno=6, colno=4))

Exceptions#

class tokenstream.error.InvalidSyntax(*args)#

Bases: Exception

Raised when the input contains invalid syntax.

location#

The location of the error.

Type: tokenstream.location.SourceLocation

end_location#

The end location of the error.

Type: tokenstream.location.SourceLocation

alternatives#

A dictionary holding other alternative errors associated with the exception.

Type: dict[type[tokenstream.error.InvalidSyntax], list[tokenstream.error.InvalidSyntax]]

notes#

A list of notes associated with the exception.

Type: list[str]

format(filename)#

Return a string representing the error and its location in a given file.

>>> try:
...     TokenStream("hello").expect()
... except InvalidSyntax as exc:
...     print(exc.format("path/to/my_file.txt"))
path/to/my_file.txt:1:1: Expected anything but got invalid 'hello'.

add_alternative(exc)#

Associate an alternative error.

class tokenstream.error.UnexpectedEOF(expected_patterns=())#

Bases: InvalidSyntax

Raised when the input ends unexpectedly.

expected_patterns#

The patterns that the parser was expecting instead of reaching end of the file.

Type: tuple[str | tuple[str, str], ...]

add_alternative(exc)#

Associate an alternative error.

class tokenstream.error.UnexpectedToken(token, expected_patterns=())#

Bases: InvalidSyntax

Raised when the input contains an unexpected token.

token#

The unexpected token that was encountered.

Type: tokenstream.token.Token

expected_patterns#

The patterns that the parser was expecting instead.

Type: tuple[str | tuple[str, str], ...]

add_alternative(exc)#

Associate an alternative error.