API reference#
The API surface is intentionally minimal. The package provides a simple token class, a couple exceptions, and the main TokenStream abstraction. There are no third-party dependencies.
TokenStream#
- class tokenstream.stream.TokenStream(source, preprocessor=None, regex_module=<module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>)#
- A versatile token stream for handwritten parsers. - The stream is iterable and will yield all the extracted tokens one after the other. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"\w+"): ... print([token.value for token in stream]) ['hello', 'world'] - source#
- The input string. - >>> stream = TokenStream("hello world") >>> stream.source 'hello world' - Type
 
 - preprocessor#
- A preprocessor that will emit source location mappings for the transformed input. - Type
- Optional[Callable[[str], tuple[str, Sequence[tokenstream.location.SourceLocation], Sequence[tokenstream.location.SourceLocation]]]] 
 
 - syntax_rules#
- A tuple of - (token_type, pattern)pairs that define the recognizable tokens.- >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(stream.syntax_rules) (('word', '[a-z]+'),) 
 - regex#
- The compiled regular expression generated from the syntax rules. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(stream.regex.pattern) (?P<word>[a-z]+)|(?P<newline>\r?\n)|(?P<whitespace>[ \t]+)|(?P<invalid>.+) - Type
- re.Pattern[str] 
 
 - index#
- The index of the current token in the list of extracted tokens. - You can technically mutate this attribute directly if you want to reset the stream back to a specific token, but you should probably use the higher-level - checkpoint()method for this.- Type
 
 - tokens#
- A list accumulating all the extracted tokens. - The list contains all the extracted tokens, even the ones ignored when using the - ignore()method. For this reason you shouldn't try to index into the list directly. Use methods like- expect(),- peek(), or- collect()instead.- Type
 
 - indentation#
- A list that keeps track of the indentation levels when indentation is enabled. The list is empty when indentation is disabled. 
 - indentation_skip#
- A set of token types for which the token stream shouldn't emit indentation changes. - Can be set using the - skipargument of the- indent()method.
 - generator#
- An instance of the - generate_tokens()generator that the stream iterates iterates through to extract and emit tokens.- Should be considered internal. - Type
- Iterator[tokenstream.token.Token] 
 
 - ignored_tokens#
- A set of token types that the stream skips over when iterating, peeking, and expecting tokens. 
 - regex_module#
- The module to use for compiling regex patterns. Uses the built-in - remodule by default. It's possible to swap it out for https://github.com/mrabarnett/mrab-regex by specifying the module as keyword argument when creating a new- TokenStream.- Type
- Any 
 
 - regex_cache#
- A cache that keeps a reference to the compiled regular expression associated to each set of syntax rules. 
 - regex_module: Any = <module 're' from '/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/re.py'>#
 - bake_regex()#
- Compile the syntax rules. - Called automatically upon instantiation and when the syntax rules change. Should be considered internal. 
 - crop()#
- Clear upcoming precomputed tokens. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... word = stream.expect("word") ... with stream.checkpoint(): ... word = stream.expect("word") ... print(stream.tokens[-1].value) ... stream.crop() ... print(stream.tokens[-1].value) world hello - Mostly used to ensure consistency in some of the provided context managers. Should be considered internal. 
 - syntax(**kwargs)#
- Extend token syntax using regular expressions. - The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123' - Nesting multiple - syntax()calls will combine the rules.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.syntax(number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123' - You can also disable a previous rule by using - None.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.syntax(number=r"[0-9]+", word=None): ... stream.expect("word").value Traceback (most recent call last): UnexpectedToken: Expected word but got invalid 'hello world 123'. 
 - reset_syntax(**kwargs)#
- Overwrite the existing syntax rules. - This method lets you temporarily overwrite the existing rules instead of extending them. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.reset_syntax(number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected word but got invalid 'hello world 123'. 
 - indent(enable=True, skip=None)#
- Enable or disable indentation. - When indentation is enabled the token stream will track the current indentation level and emit - indentand- dedenttokens when the indentation level changes. The- indentand- dedenttokens are always balanced, every- indenttoken will be ultimately paired with a- dedenttoken.- >>> stream = TokenStream("hello\n\tworld") >>> with stream.syntax(word=r"[a-z]+"), stream.indent(): ... stream.expect("word").value ... stream.expect("indent").value ... stream.expect("word").value ... stream.expect("dedent").value 'hello' '' 'world' '' - The - skipargument allows you to prevent some types of tokens from triggering indentation changes. The most common use-case would be ignoring indentation introduced by comments.- with stream.syntax(word=r"[a-z]+", comment=r"#.+$"), stream.indent(skip=["comment"]): stream.expect("word") stream.expect("indent") stream.expect("word") stream.expect("dedent") - You can also use the - indent()method to temporarily disable indentation by specifying- enable=False. This is different from simply ignoring- indentand- dedenttokens with the- ignore()method because it clears the indentation stack and if you decide to re-enable indentation afterwards the indentation level will start back at 0.- with stream.indent(enable=False): ... 
 - intercept(*token_types)#
- Intercept tokens matching the given types. - This tells the stream to not skip over previously ignored tokens or tokens ignored by default like - newlineand- whitespace.- >>> stream = TokenStream("hello world\n") >>> with stream.syntax(word=r"[a-z]+"), stream.intercept("newline", "whitespace"): ... stream.expect("word").value ... stream.expect("whitespace").value ... stream.expect("word").value ... stream.expect("newline").value 'hello' ' ' 'world' '\n' - You can use the - ignore()method to ignore previously intercepted tokens.
 - ignore(*token_types)#
- Ignore tokens matching the given types. - This tells the stream to skip over tokens matching any of the given types. - >>> stream = TokenStream("hello 123 world") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"), stream.ignore("number"): ... stream.expect("word").value ... stream.expect("word").value 'hello' 'world' - You can use the - intercept()method to stop ignoring tokens.
 - property current: Token#
- The current token. - Can only be accessed if the stream started extracting tokens. 
 - property previous: Token#
- The previous token. - This is the token extracted immediately before the current one, so it's not affected by the - ignore()method.
 - property leftover: str#
- The remaining input. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.leftover 'hello' ' world' 
 - head(characters=50)#
- Preview the characters ahead of the current token. - This is useful for error messages and visualizing the input following the current token. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.head() 'hello' ' world' - The generated string is truncated to 50 characters by default but you can change this with the - charactersargument.
 - emit_token(token_type, value='')#
- Generate a token in the token stream. - Should be considered internal. Used by the - generate_tokens()method.
 - emit_error(exc)#
- Add location information to invalid syntax exceptions. - >>> stream = TokenStream("hello world") >>> raise stream.emit_error(InvalidSyntax("foo")) Traceback (most recent call last): InvalidSyntax: foo >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect().value 'hello' >>> exc = stream.emit_error(InvalidSyntax("foo")) >>> exc.location SourceLocation(pos=5, lineno=1, colno=6) 
 - generate_tokens()#
- Extract tokens from the input string. - Should be considered internal. This is the underlying generator being driven by the stream. 
 - peek(n=1)#
- Peek around the current token. - The method returns the next token in the stream without advancing the stream to the next token. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.peek().value ... stream.expect("word").value 'hello' 'hello' - You can also peek multiple tokens ahead. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.peek(2).value ... stream.expect("word").value 'world' 'hello' - Negative values will let you peek backwards. It's generally better to use - peek(-1)over the- previousattribute because the- peek()method will take ignored tokens into account.- >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.peek(-1).value 'hello' 'world' 'hello' >>> stream.previous.value ' ' 
 - peek_until(*patterns)#
- Collect tokens until one of the given patterns matches. - >>> stream = TokenStream("hello world; foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.peek_until("semi"): ... stream.expect("word").value ... stream.current.value ... stream.leftover 'hello' 'world' ';' ' foo' - The method will raise and error if the end of the stream is reached before encountering any of the given patterns. - >>> stream = TokenStream("hello world foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.peek_until("semi"): ... stream.expect("word").value Traceback (most recent call last): UnexpectedEOF: Expected semi but reached end of file. - If the method is called without any pattern the iterator will yield tokens until the end of the stream. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print([stream.expect("word").value for _ in stream.peek_until()]) ['hello', 'world'] 
 - collect() Iterator[Token]#
- collect(pattern: str | tuple[str, str], /) Iterator[Token]
- collect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) Iterator[list[tokenstream.token.Token | None]]
- Collect tokens matching the given patterns. - Calling the method without any arguments is similar to iterating over the stream directly. If you provide one or more arguments the iterator will stop if it encounters a token that doesn't match any of the given patterns. - >>> stream = TokenStream("hello world; foo") >>> with stream.syntax(word=r"[a-z]+", semi=r";"): ... for token in stream.collect("word"): ... token.value ... stream.leftover 'hello' 'world' '; foo' - If you provide more than one pattern the method will yield a sequence of the same size where the token will be at the index of the pattern that matched the token. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... for word, number in stream.collect("word", "number"): ... if word: ... print("word", word.value) ... elif number: ... print("number", number.value) word hello word world number 123 - There is one small difference between iterating over the stream directly and using the method without any argument. The - collect()method will raise an exception if it encounters an invalid token.- >>> stream = TokenStream("foo") >>> with stream.syntax(number=r"[0-9]+"): ... for token in stream.collect(): ... token Traceback (most recent call last): UnexpectedToken: Expected anything but got invalid 'foo'. - When you iterate over the stream directly the tokens are unfiltered. - >>> stream = TokenStream("foo") >>> with stream.syntax(number=r"[0-9]+"): ... for token in stream: ... token Token(type='invalid', value='foo', ...) 
 - collect_any(*patterns)#
- Collect tokens matching one of the given patterns. - The method is similar to - collect()but will always return a single value. This works pretty nicely with Python 3.10+ match statements.- for token in stream.collect_any("word", "number"): match token: case Token(type="word"): print("word", token.value) case Token(type="number"): print("number", token.value) 
 - expect() Token#
- expect(pattern: str | tuple[str, str], /) Token
- expect(pattern1: str | tuple[str, str], pattern2: str | tuple[str, str], /, *patterns: str | tuple[str, str]) list[tokenstream.token.Token | None]
- Match the given patterns and raise an exception if the next token doesn't match. - The - expect()method lets you retrieve tokens one at a time.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect().value ... stream.expect().value ... stream.expect().value 'hello' 'world' '123' - You can provide a pattern and if the extracted token doesn't match the method will raise an exception. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected number but got word 'hello'. - The method will also raise and exception if the stream ended. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream.expect("word").value ... stream.expect("word").value Traceback (most recent call last): UnexpectedEOF: Expected word but reached end of file. - The method works a bit like - collect()and lets you know which pattern matched the extracted token if you provide more than one pattern.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... word, number = stream.expect("word", "number") ... if word: ... print("word", word.value) ... elif number: ... print("number", number.value) word hello 
 - get(*patterns)#
- Return the next token if it matches any of the given patterns. - The method works a bit like - expect()but will return- Noneinstead of raising an exception if none of the given patterns match. If there are no more tokens the method will also return- None.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.get("word").value ... stream.get("number") is None ... stream.get("word").value ... stream.get("number").value ... stream.get() is None 'hello' True 'world' '123' True 
 - expect_any(*patterns)#
- Make sure that the next token matches one of the given patterns or raise an exception. - The method is similar to - expect()but will always return a single value. This works pretty nicely with Python 3.10+ match statements.- match stream.expect_any("word", "number"): case Token(type="word") as word: print("word", word.value) case Token(type="number") as number: print("number", number.value) 
 - expect_eof()#
- Raise an exception if there is leftover input. - >>> stream = TokenStream("hello world 123 foo") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... for token in stream.collect("word"): ... token.value ... stream.expect("number").value 'hello' 'world' '123' >>> stream.expect_eof() Traceback (most recent call last): UnexpectedToken: Expected eof but got invalid 'foo'. 
 - checkpoint()#
- Reset the stream to the current token at the end of the - withstatement.- >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.checkpoint(): ... stream.expect("word").value ... stream.expect("word").value 'hello' 'hello' - You can use the returned handle to keep the state of the stream at the end of the - withstatement. For more details check out- CheckpointCommit.- >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... with stream.checkpoint() as commit: ... stream.expect("word").value ... commit() ... stream.expect("word").value 'hello' 'world' - The context manager will swallow syntax errors until the handle commits the checkpoint. 
 - alternative(active=True)#
- Keep going if the code within the - withstatement raises a syntax error.- >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... stream.expect("word").value ... stream.expect("word").value ... with stream.alternative(): ... stream.expect("word").value ... stream.expect("number").value 'hello' 'world' '123' - You can optionally provide a boolean to deactivate the context manager dynamically. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... with stream.alternative(False): ... stream.expect("number").value Traceback (most recent call last): UnexpectedToken: Expected number but got word 'hello'. 
 - choose(*args)#
- Iterate over each argument until one of the alternative succeeds. - >>> stream = TokenStream("hello world 123") >>> with stream.syntax(word=r"[a-z]+", number=r"[0-9]+"): ... while stream.peek(): ... for token_type, alternative in stream.choose("word", "number"): ... with alternative: ... stream.expect(token_type).value 'hello' 'world' '123' 
 - provide(**data)#
- Provide arbitrary user data. - >>> stream = TokenStream("hello world") >>> with stream.provide(foo=123): ... stream.data["foo"] 123 
 - reset(*args)#
- Temporarily reset arbitrary user data. - >>> stream = TokenStream("hello world") >>> with stream.provide(foo=123): ... stream.data["foo"] ... with stream.reset("foo"): ... stream.data ... stream.data 123 {} {'foo': 123} 
 - copy()#
- Return a copy of the stream. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... stream.expect("word").value ... stream_copy = stream.copy() ... stream.expect("word").value 'hello' 'world' >>> with stream_copy.syntax(letter=r"[a-z]"): ... [token.value for token in stream_copy] ['w', 'o', 'r', 'l', 'd'] 
 
- class tokenstream.stream.CheckpointCommit(index, rollback=True)#
- Handle for managing checkpoints. 
Token#
- class tokenstream.token.Token(type, value, location, end_location)#
- Class representing a token. - location: SourceLocation#
- Alias for field number 2 
 - end_location: SourceLocation#
- Alias for field number 3 
 - match(*patterns)#
- Match the token against one or more patterns. - Each argument can be either a string corresponding to a token type or a tuple with a token type and a token value. - >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... print(f"{stream.expect().match(('word', 'hello')) = }") ... print(f"{stream.expect().match('word') = }") stream.expect().match(('word', 'hello')) = True stream.expect().match('word') = True 
 - emit_error(exc)#
- Add location information to invalid syntax exceptions. - This works exactly like - tokenstream.stream.TokenStream.emit_error()but it associates the location of the token with the syntax error instead of the head of the stream.- >>> stream = TokenStream("hello world") >>> with stream.syntax(word=r"[a-z]+"): ... token = stream.expect() ... exc = token.emit_error(InvalidSyntax("goodbye")) ... raise exc Traceback (most recent call last): InvalidSyntax: goodbye >>> exc.location SourceLocation(pos=0, lineno=1, colno=1) >>> exc.end_location SourceLocation(pos=5, lineno=1, colno=6) 
 
Location#
- class tokenstream.location.SourceLocation(pos, lineno, colno)#
- Class representing a location within an input string. - property unknown: bool#
- Whether the location is unknown. - >>> location = UNKNOWN_LOCATION >>> location.unknown True 
 - format(filename, message)#
- Return a message formatted with the given filename and the current location. - >>> SourceLocation(42, 3, 12).format("path/to/file.txt", "Some error message") 'path/to/file.txt:3:12: Some error message' 
 - with_horizontal_offset(offset)#
- Create a modified source location along the horizontal axis. - >>> INITIAL_LOCATION.with_horizontal_offset(41) SourceLocation(pos=41, lineno=1, colno=42) 
 - skip_over(value)#
- Return the source location after skipping over a piece of text. - >>> INITIAL_LOCATION.skip_over("hello\nworld") SourceLocation(pos=11, lineno=2, colno=6) 
 - map(input_mappings, output_mappings)#
- Map a source location. - The mappings must contain corresponding source locations in order. - >>> INITIAL_LOCATION.map([], []) SourceLocation(pos=0, lineno=1, colno=1) >>> mappings1 = [SourceLocation(16, 2, 27), SourceLocation(19, 2, 30)] >>> mappings2 = [SourceLocation(24, 3, 8), SourceLocation(67, 4, 12)] >>> INITIAL_LOCATION.map(mappings1, mappings2) SourceLocation(pos=0, lineno=1, colno=1) >>> SourceLocation(15, 2, 26).map(mappings1, mappings2) SourceLocation(pos=15, lineno=2, colno=26) >>> SourceLocation(16, 2, 27).map(mappings1, mappings2) SourceLocation(pos=24, lineno=3, colno=8) >>> SourceLocation(18, 2, 29).map(mappings1, mappings2) SourceLocation(pos=26, lineno=3, colno=10) >>> SourceLocation(19, 2, 30).map(mappings1, mappings2) SourceLocation(pos=67, lineno=4, colno=12) >>> SourceLocation(31, 3, 6).map(mappings1, mappings2) SourceLocation(pos=79, lineno=5, colno=6) 
 - relocate(base_location, target_location)#
- Return the current location transformed relative to the target location. 
 
- tokenstream.location.set_location(obj, location=SourceLocation(pos=-1, lineno=0, colno=0), end_location=SourceLocation(pos=-1, lineno=0, colno=0))#
- Set the location and end_location attributes. - The function returns the given object or a new instance if the object is a namedtuple or a frozen dataclass. The location can be copied from another object with location and end_location attributes. - >>> token = Token("number", "123", UNKNOWN_LOCATION, UNKNOWN_LOCATION) >>> updated_token = set_location(token, SourceLocation(15, 6, 1)) >>> updated_token Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=15, lineno=6, colno=1)) >>> updated_token = set_location( ... updated_token, ... end_location=updated_token.location.with_horizontal_offset(len(updated_token.value)), ... ) >>> set_location(token, updated_token) Token(type='number', value='123', location=SourceLocation(pos=15, lineno=6, colno=1), end_location=SourceLocation(pos=18, lineno=6, colno=4)) 
Exceptions#
- class tokenstream.error.InvalidSyntax(*args)#
- Bases: - Exception- Raised when the input contains invalid syntax. - location#
- The location of the error. 
 - end_location#
- The end location of the error. 
 - alternatives#
- A dictionary holding other alternative errors associated with the exception. 
 - format(filename)#
- Return a string representing the error and its location in a given file. - >>> try: ... TokenStream("hello").expect() ... except InvalidSyntax as exc: ... print(exc.format("path/to/my_file.txt")) path/to/my_file.txt:1:1: Expected anything but got invalid 'hello'. 
 - add_alternative(exc)#
- Associate an alternative error. 
 
- class tokenstream.error.UnexpectedEOF(expected_patterns=())#
- Bases: - InvalidSyntax- Raised when the input ends unexpectedly. - expected_patterns#
- The patterns that the parser was expecting instead of reaching end of the file. 
 - add_alternative(exc)#
- Associate an alternative error. 
 
- class tokenstream.error.UnexpectedToken(token, expected_patterns=())#
- Bases: - InvalidSyntax- Raised when the input contains an unexpected token. - token#
- The unexpected token that was encountered. 
 - expected_patterns#
- The patterns that the parser was expecting instead. 
 - add_alternative(exc)#
- Associate an alternative error.