A Deep Dive into Python's Tokenizer (2024)

September 18, 2019

Python’s reference implementation, CPython, makes a concerted effortto minimize complexity. While this has some consequences aroundperformance of the language and short-term feature development, it meansthat CPython’s source code is surprisingly approachable for a languageof its size and stature.

By pulling apart tokenization, the first stage in the execution ofany Python program, I hope to show just how approachable CPython’sinternals are.

The job of a tokenizer, lexer, or scanner is to convert a stream ofcharacters or bytes into a stream of words, or “tokens”.

Some compilers don’t explicitly separate the processes oftokenization and parsing. Certain parsers are capable of implicitlyforming tokens themselves. Python uses the separation of these twostages to its advantage, both to simplify the parser, and to apply a few“lexer hacks”, which we’ll discusslater.

While it’s undocumented and unsupported, CPython exports its internalC-based tokenizer API. This is the same tokenizer used to parse Pythonsource code prior to execution. The complete implementation lives in Parser/tokenizer.c.

Let’s start by making sure we have CPython’s development headersinstalled on our machine. Some operating systems or distributions mayinstall this by default, but on Debian and Ubuntu, we have to install aseparate package.

$ apt install python3-dev

Once that’s done, we can cross-reference Parser/tokenizer.hto find the functions we care about, and call them.

#include <stdio.h>#include <Python.h>#include <token.h>typedef void tok_state_t;// We need to declare these ourselves since they're undocumented and CPython// doesn't expose any header files with their signatures.extern tok_state_t *PyTokenizer_FromFile(FILE *, const char*, const char *, const char *);extern void PyTokenizer_Free(tok_state_t *);extern int PyTokenizer_Get(tok_state_t *, char **, char **);// This is a toy. Don't use this in production environments.int main() { tok_state_t *tok = PyTokenizer_FromFile(stdin, NULL, NULL, NULL); char *start, *end; // Fetch each token and print out its type. while (1) { int tok_type = PyTokenizer_Get(tok, &start, &end); printf("%s\n", _PyParser_TokenNames[tok_type]); if (tok_type == ENDMARKER) { break; } } PyTokenizer_Free(tok);}

We can compile this with gcc or clang. I’m usingpkg-config in this example to generate the flags needed formy c compiler.

$ cc $(pkg-config --cflags --libs python3) main.c -o pytokenize

Once that’s done, we’re left with a pytokenize binarythat will accept source code on stdin, and will print a sequence oftoken types to stdout. 1

$ ./pytokenize <<< 'if test: fn() # comment'

NAMENAMECOLONNAMELPARRPARNEWLINEENDMARKER

You’ll notice that this tokenizer drops comments2 andwhitespace. Neither of these artifacts matter to the execution of theprogram, so discarding them now will make the later step of parsingeasier.

The tokenizer’s central function function, tok_getfetches a single token, and advances the tokenizer’s position to the endof that token. It’s implemented as a state machine written with a seriesof conditionals and gotos.

The function takes the current state and returns a token type(represented with an int). It also takes two pointers whichit will update to point to the start and end positions of the currenttoken.

static inttok_get(struct tok_state *tok, char **p_start, char **p_end){

A variable, “c”, is declared. This will store ourcurrent byte. tok_get processes a single UTF-8 byte at atime, but an int is used here instead of a single-byteunsigned char so that it can optionally store a sentinelEOF (end of file) value encoded as -1.

 int c; /* ... */

Before attempting to find tokens, tok_get first stripsout whitespace and comments until the first tokenizable character isfound.

 tok->start = NULL; /* Skip spaces */ do { c = tok_nextc(tok); } while (c == ' ' || c == '\t' || c == '\014'); /* Set start of current token */ tok->start = tok->cur - 1; /* Skip comment */ if (c == '#') { while (c != EOF && c != '\n') { c = tok_nextc(tok); } } /* ... */

This is followed by a series of top-level conditional statementswhere the next character, c, is tested to guess the nexttype of token that needs to be produced.

Token-Based Refactoring

The tokenize module has a cool trick up it’s sleeve: untokenize.This hidden gem can enable simple refactoring operations. Let’s say thatwe wanted to rename old_function tonew_function. We just need to extract all of theNAME tokens and modify them before callinguntokenize.

from io import BytesIOfrom tokenize import NAME, tokenize, untokenizeexample_program = """def old_function(): # should be updated to `new_function` ...def another_old_function(): # should not get touched ...old_function() # should be updated to `new_function`another_old_function() # should not get touched"""old_tokens = tokenize(BytesIO(example_program.encode("utf-8")).readline)new_tokens = []for tok in old_tokens: if tok.type == NAME and tok.string == "old_function": new_tokens.append(tok._replace(string="new_function")) else: new_tokens.append(tok)print(untokenize(new_tokens).decode("utf-8"))

Running this code gives us the updated program:

Indent/Dedent Tokens

The first thing new developers tend to notice about Python is itsindentation-sensitive syntax. While most languages use explicit markers(usually curly braces) to denote the beginning and end of a logicalblock, Python infers it from indentation.

This simple act transforms Python’s grammar from “context-free”,which is relatively easy and fast to parse, to one that is“context-sensitive”. However, CPython uses an LL(1) parser, acontext-free parsing algorithm. What gives?

CPython moves the complexity of managing the indentation’s contextout of the parser and into the tokenizer, where the separation ofconcerns allows it to be handled much more easily. If we take a programwith indentation and process it with the tokenizer module,we can see that it emits fake INDENT andDEDENT tokens.

$ python3 -m tokenize --exact <<EOFif test: fn()EOF

1,0-1,2: NAME 'if'1,3-1,7: NAME 'test'1,7-1,8: COLON ':'1,8-1,9: NEWLINE '\n'2,0-2,4: INDENT ' '2,4-2,6: NAME 'fn'2,6-2,7: LPAR '('2,7-2,8: RPAR ')'2,8-2,9: NEWLINE '\n'3,0-3,0: DEDENT ''3,0-3,0: ENDMARKER ''

The tokenizer implements this by pushing and popping indentationlevels onto and off of a stack. The top of the stack(indents[-1]) always represents the current indentationlevel. We can see this in tokenize.py’simplementation:

if column > indents[-1]: # count indents or dedents indents.append(column) yield TokenInfo(INDENT, line[:pos], (lnum, 0), (lnum, pos), line)while column < indents[-1]: ... indents = indents[:-1] yield TokenInfo(DEDENT, '', (lnum, pos), (lnum, pos), line)

Because indentation is only pertinent to statements, and not toexpressions, this behavior is disabled inside of parentheses. Theopening and closing parenthesis are tracked by incrementing anddecrementing a simple counter. When the counter is non-zero, indentationis not tracked and INDENT and DEDENT tokensare omitted.

`async` and `await`

Python 3.5 introduced async andawait keywords:

async def read_data(db): data = await db.fetch('SELECT ...') ...

This created a compatibility problem. Existing code could useasync and await as variable names, but anidentifier can’t be both a keyword and a variable name. For example,while this code clearly uses async and awaitas variable names:

async = db.fetch('SELECT ...')await = async.run()

This code is ambiguous:

await(db.fetch('SELECT ...'))

It’s not possible to tell if await should be functioncall, or a keyword.

This was mitigated with a tokenizer hack. await is onlyrecognized as a keyword inside of an async function, andasync is only recognized as a keyword when immediately infront of def.

This requires the tokenizer to be able to lookahead a single token for def when async isencountered. It also requires tracking indentation to determine when afunction starts and ends, but that’s not a problem, since the tokenizeralready special-cases indentation.

Because async wasn’t valid in front of adef keyword in older releases of Python, this change wasperfectly backwards compatible.

Let’s tokenize an example that uses async andawait:

$ python3.6 -m tokenize <<< 'async def fn(): await value'

1,0-1,5: ASYNC 'async'1,6-1,9: NAME 'def'1,10-1,12: NAME 'fn'1,12-1,13: OP '('1,13-1,14: OP ')'1,14-1,15: OP ':'1,16-1,21: AWAIT 'await'1,22-1,27: NAME 'value'1,27-1,28: NEWLINE '\n'2,0-2,0: ENDMARKER ''

You’ll notice that async and await generatecorresponding ASYNC and AWAIT tokens.

Let’s try a different example where async andawait aren’t used inside of an async function:

$ python3.6 -m tokenize <<< 'await = async.run()'

1,0-1,5: NAME 'await'1,6-1,7: OP '='1,8-1,13: NAME 'async'1,13-1,14: OP '.'1,14-1,17: NAME 'run'1,17-1,18: OP '('1,18-1,19: OP ')'1,19-1,20: NEWLINE '\n'2,0-2,0: ENDMARKER ''

Here, async and await just generate normalNAME nodes!

We ran these examples with Python 3.6. This hack was removed inPython 3.7, where async and await are alwaystreated as keywords. This breaks backwards compatibility, but it alsoallows async and await to be used in moreplaces, such as asyncgenerators inside synchronous functions.

Python 3.8 adds back some of the code needed to supportasync and await as keywords, but only when a specialfeature_version flag is passed to the astmodule. This enables tools (such as linters and type-checkers)running on Python 3.8 to inspect code written for an older Pythonrelease.

We already mentioned two tokenizers in Python’s referenceimplementation, but it turns out that there’s a third. The undocumentedbut popular lib2to3 library usesa fork on the pure-Python tokenizer we saw earlier.

This tokenizer generates tokens objects in a slightly differentformat, and is designed to support Python 2 syntax in addition to somePython 3 syntax. lib2to3’s tokenizer isn’t as well supported as the standard library’s tokenizer, so unless you needto work with Python 2 or lib2to3, you should steer clear of it.

Outside of CPython, there’s a ton of different tokenizers, many ofwhich are forks of each other. Here’s just a small selection ofinteresting Python tokenizers:

PyPy uses atokenizer written in RPython (a fast subset of Python compiled toC).
Blackand Parsoboth use forks of lib2to3, and contain forks of lib2to3’s tokenizer.However, both make substantial improvements to the originaltokenizer.
Baron uses a unique regular-expression-heavytokenizer. Instead of generating INDENT andDEDENT tokens inside the tokenizer, it instead generatesthem using a separatepost-processing pass.

The tokenizer is just the first in a series of steps. Like thetokenizer, each step in this process is designed to iteratively simplifythe daunting problem of program execution.

Now that our source code has been converted to a sequence of tokens,CPython can use that to form a parse tree whichwill eventually be transformed into a similar but higher-level abstract syntaxtree. The parse tree is generated using an LL(1) parsingalgorithm that forms the basis of another state machine.

A symboltable is generated by stepping through the abstract syntax tree. Thesymbol table step handles the logic required for dealing with scopes,tracking where a given local variable name is stored.

The abstract syntax tree and symbol table then serve as input to acompiler that generates a sequence of bytecodeinstructions that can be executed inside of CPython’s virtualmachine.

If you’re unfamiliar with the<<< syntax used here, that’s because it’s a here string.The examples in this post make heavy use them along with heredocuments. Here strings make it easy to provide input to a process’sstdin.↩︎
If you’re using Python 3.8 (or alater release), it’s possibleto configure the tokenizer and parser to retain type comments. Thisfeature makes it easier for static type checkers like mypy to re-use CPython’s parser.↩︎

The views expressed on this site are my own and do not reflect thoseof my employer.

A Deep Dive into Python's Tokenizer (2024)

Token-Based Refactoring

Indent/Dedent Tokens

async and await

`async` and `await`