A Deep Dive into Python's Tokenizer (2024)

September 18, 2019

Python’s reference implementation, CPython, makes a concerted effortto minimize complexity. While this has some consequences aroundperformance of the language and short-term feature development, it meansthat CPython’s source code is surprisingly approachable for a languageof its size and stature.

By pulling apart tokenization, the first stage in the execution ofany Python program, I hope to show just how approachable CPython’sinternals are.

The job of a tokenizer, lexer, or scanner is to convert a stream ofcharacters or bytes into a stream of words, or “tokens”.

A Deep Dive into Python's Tokenizer (1)

Some compilers don’t explicitly separate the processes oftokenization and parsing. Certain parsers are capable of implicitlyforming tokens themselves. Python uses the separation of these twostages to its advantage, both to simplify the parser, and to apply a few“lexer hacks”, which we’ll discusslater.

While it’s undocumented and unsupported, CPython exports its internalC-based tokenizer API. This is the same tokenizer used to parse Pythonsource code prior to execution. The complete implementation lives in Parser/tokenizer.c.

Let’s start by making sure we have CPython’s development headersinstalled on our machine. Some operating systems or distributions mayinstall this by default, but on Debian and Ubuntu, we have to install aseparate package.

$ apt install python3-dev

Once that’s done, we can cross-reference Parser/tokenizer.hto find the functions we care about, and call them.

#include <stdio.h>#include <Python.h>#include <token.h>typedef void tok_state_t;// We need to declare these ourselves since they're undocumented and CPython// doesn't expose any header files with their signatures.extern tok_state_t *PyTokenizer_FromFile(FILE *, const char*, const char *, const char *);extern void PyTokenizer_Free(tok_state_t *);extern int PyTokenizer_Get(tok_state_t *, char **, char **);// This is a toy. Don't use this in production environments.int main() { tok_state_t *tok = PyTokenizer_FromFile(stdin, NULL, NULL, NULL); char *start, *end; // Fetch each token and print out its type. while (1) { int tok_type = PyTokenizer_Get(tok, &start, &end); printf("%s\n", _PyParser_TokenNames[tok_type]); if (tok_type == ENDMARKER) { break; } } PyTokenizer_Free(tok);}

We can compile this with gcc or clang. I’m usingpkg-config in this example to generate the flags needed formy c compiler.

$ cc $(pkg-config --cflags --libs python3) main.c -o pytokenize

Once that’s done, we’re left with a pytokenize binarythat will accept source code on stdin, and will print a sequence oftoken types to stdout. 1

$ ./pytokenize <<< 'if test: fn() # comment'
NAMENAMECOLONNAMELPARRPARNEWLINEENDMARKER

You’ll notice that this tokenizer drops comments2 andwhitespace. Neither of these artifacts matter to the execution of theprogram, so discarding them now will make the later step of parsingeasier.

The tokenizer’s central function function, tok_getfetches a single token, and advances the tokenizer’s position to the endof that token. It’s implemented as a state machine written with a seriesof conditionals and gotos.

The function takes the current state and returns a token type(represented with an int). It also takes two pointers whichit will update to point to the start and end positions of the currenttoken.

static inttok_get(struct tok_state *tok, char **p_start, char **p_end){

A variable, “c”, is declared. This will store ourcurrent byte. tok_get processes a single UTF-8 byte at atime, but an int is used here instead of a single-byteunsigned char so that it can optionally store a sentinelEOF (end of file) value encoded as -1.

 int c; /* ... */

Before attempting to find tokens, tok_get first stripsout whitespace and comments until the first tokenizable character isfound.

 tok->start = NULL; /* Skip spaces */ do { c = tok_nextc(tok); } while (c == ' ' || c == '\t' || c == '\014'); /* Set start of current token */ tok->start = tok->cur - 1; /* Skip comment */ if (c == '#') { while (c != EOF && c != '\n') { c = tok_nextc(tok); } } /* ... */

This is followed by a series of top-level conditional statementswhere the next character, c, is tested to guess the nexttype of token that needs to be produced.

The order of these conditionals roughly matches the frequency of eachtoken type in a normal piece of source code, reducing the average numberof branches that need to be evaluated. Since identifiers (names) are themost common type of token, that test comes first.

A Deep Dive into Python's Tokenizer (2)

 if (is_potential_identifier_start(c)) { /* ... */ } /* ... */ if (c == '\n') { /* ... */ } /* ... */ if (c == '\'' || c == '"') { /* ... */ }

However, a single character isn’t always enough to determine the typeof a token. Let’s process the byte-string b"abc".

The first character, b, is a valid starting characterfor an identifier, so we optimistically expect an identifier and takethe first branch. Once we encounter the second character, a quote, werealize that b"abc" is actually a byte-string, and we mustjump to the string handling logic, breaking out of the identifierprocessing block.

That gives us a state machine that looks something like this (zoomed in version):

These jumps between conditional blocks are implemented withgotos. While goto statements are often frownedupon, they’re perfect for representing a state machine, and their usemakes the code easier to follow.

 if (is_potential_identifier_start(c)) { /* ... */ while (1) { /* ... */ c = tok_nextc(tok); if (c == '"' || c == '\'') { goto letter_quote; } } return NAME; } /* ... */ letter_quote: /* String */ if (c == '\'' || c == '"') { /* ... */ return STRING; }

For convenience, tok_get is wrapped in PyTokenizer_Get,which we saw earlier in our usage example.

intPyTokenizer_Get(struct tok_state *tok, char **p_start, char **p_end){ int result = tok_get(tok, p_start, p_end); if (tok->decoding_erred) { result = ERRORTOKEN; tok->done = E_DECODE; } return result;}

PyTokenizer_Get is then called in a loop until everytoken in the file is extracted.

Reading the C implementation is useful for understanding how CPythonworks, but because it’s undocumented, unsupported, and only accessiblethrough a C API, it’s impractical for most real-world uses.

Fortunately, Python’s standard library contains a pure-Pythonreimplementation of the C-based tokenize.Itssource code may be an easier read, since it replaces some of the Ccode’s state management with regular expressions.

We can run this tokenizer over our examplefrom earlier:

$ python3 -m tokenize --exact <<< 'if test: fn() # comment'
1,0-1,2: NAME 'if'1,3-1,7: NAME 'test'1,7-1,8: COLON ':'1,9-1,11: NAME 'fn'1,11-1,12: LPAR '('1,12-1,13: RPAR ')'1,15-1,24: COMMENT '# comment'1,24-1,25: NEWLINE '\n'2,0-2,0: ENDMARKER ''

You may notice that while it still excludes whitespace, thistokenizer emits COMMENT tokens. The pure-Pythontokenize module aims to be useful as a standalone library,whereas the internal tokenizer.c implementation is onlydesigned to track the semantic details of code.

Tools that read information from comments will sometimes use thepure-Python tokenize module to fetch those comments. Forexample, Flake8uses it to find “# noqa” comments, which tells Flake8to ignore errors found on those lines.

Token-Based Refactoring

The tokenize module has a cool trick up it’s sleeve: untokenize.This hidden gem can enable simple refactoring operations. Let’s say thatwe wanted to rename old_function tonew_function. We just need to extract all of theNAME tokens and modify them before callinguntokenize.

from io import BytesIOfrom tokenize import NAME, tokenize, untokenizeexample_program = """def old_function(): # should be updated to `new_function` ...def another_old_function(): # should not get touched ...old_function() # should be updated to `new_function`another_old_function() # should not get touched"""old_tokens = tokenize(BytesIO(example_program.encode("utf-8")).readline)new_tokens = []for tok in old_tokens: if tok.type == NAME and tok.string == "old_function": new_tokens.append(tok._replace(string="new_function")) else: new_tokens.append(tok)print(untokenize(new_tokens).decode("utf-8"))

Running this code gives us the updated program:

def new_function(): # should be updated to `new_function` ...def another_old_function(): # should not get touched ...new_function() # should be updated to `new_function`another_old_function() # should not get touched

This is safer than a basic textual find-replace operation, because itwon’t accidentally touch strings, comments, or other identifiers withold_function in their name, likeanother_old_function.

However, because the token stream is low-level, and it’s hard toderive a proper understanding of a program’s meaning from it, moreserious refactoring operations should be left to higher-level libraries,such as LibCST, Bowler, or RedBaron. You can readmore about LibCST on the Instagram Engineering blog.

I mentioned earlier that CPython uses the separation of tokenizationand parsing to it’s advantage to simplify the parser. This is evident insome of CPython’s tokenizer “hacks”.

Indent/Dedent Tokens

The first thing new developers tend to notice about Python is itsindentation-sensitive syntax. While most languages use explicit markers(usually curly braces) to denote the beginning and end of a logicalblock, Python infers it from indentation.

This simple act transforms Python’s grammar from “context-free”,which is relatively easy and fast to parse, to one that is“context-sensitive”. However, CPython uses an LL(1) parser, acontext-free parsing algorithm. What gives?

CPython moves the complexity of managing the indentation’s contextout of the parser and into the tokenizer, where the separation ofconcerns allows it to be handled much more easily. If we take a programwith indentation and process it with the tokenizer module,we can see that it emits fake INDENT andDEDENT tokens.

$ python3 -m tokenize --exact <<EOFif test: fn()EOF
1,0-1,2: NAME 'if'1,3-1,7: NAME 'test'1,7-1,8: COLON ':'1,8-1,9: NEWLINE '\n'2,0-2,4: INDENT ' '2,4-2,6: NAME 'fn'2,6-2,7: LPAR '('2,7-2,8: RPAR ')'2,8-2,9: NEWLINE '\n'3,0-3,0: DEDENT ''3,0-3,0: ENDMARKER ''

The tokenizer implements this by pushing and popping indentationlevels onto and off of a stack. The top of the stack(indents[-1]) always represents the current indentationlevel. We can see this in tokenize.py’simplementation:

if column > indents[-1]: # count indents or dedents indents.append(column) yield TokenInfo(INDENT, line[:pos], (lnum, 0), (lnum, pos), line)while column < indents[-1]: ... indents = indents[:-1] yield TokenInfo(DEDENT, '', (lnum, pos), (lnum, pos), line)

Because indentation is only pertinent to statements, and not toexpressions, this behavior is disabled inside of parentheses. Theopening and closing parenthesis are tracked by incrementing anddecrementing a simple counter. When the counter is non-zero, indentationis not tracked and INDENT and DEDENT tokensare omitted.

async and await

Python 3.5 introduced async andawait keywords:

async def read_data(db): data = await db.fetch('SELECT ...') ...

This created a compatibility problem. Existing code could useasync and await as variable names, but anidentifier can’t be both a keyword and a variable name. For example,while this code clearly uses async and awaitas variable names:

async = db.fetch('SELECT ...')await = async.run()

This code is ambiguous:

await(db.fetch('SELECT ...'))

It’s not possible to tell if await should be functioncall, or a keyword.

This was mitigated with a tokenizer hack. await is onlyrecognized as a keyword inside of an async function, andasync is only recognized as a keyword when immediately infront of def.

This requires the tokenizer to be able to lookahead a single token for def when async isencountered. It also requires tracking indentation to determine when afunction starts and ends, but that’s not a problem, since the tokenizeralready special-cases indentation.

Because async wasn’t valid in front of adef keyword in older releases of Python, this change wasperfectly backwards compatible.

Let’s tokenize an example that uses async andawait:

$ python3.6 -m tokenize <<< 'async def fn(): await value'
1,0-1,5: ASYNC 'async'1,6-1,9: NAME 'def'1,10-1,12: NAME 'fn'1,12-1,13: OP '('1,13-1,14: OP ')'1,14-1,15: OP ':'1,16-1,21: AWAIT 'await'1,22-1,27: NAME 'value'1,27-1,28: NEWLINE '\n'2,0-2,0: ENDMARKER ''

You’ll notice that async and await generatecorresponding ASYNC and AWAIT tokens.

Let’s try a different example where async andawait aren’t used inside of an async function:

$ python3.6 -m tokenize <<< 'await = async.run()'
1,0-1,5: NAME 'await'1,6-1,7: OP '='1,8-1,13: NAME 'async'1,13-1,14: OP '.'1,14-1,17: NAME 'run'1,17-1,18: OP '('1,18-1,19: OP ')'1,19-1,20: NEWLINE '\n'2,0-2,0: ENDMARKER ''

Here, async and await just generate normalNAME nodes!

We ran these examples with Python 3.6. This hack was removed inPython 3.7, where async and await are alwaystreated as keywords. This breaks backwards compatibility, but it alsoallows async and await to be used in moreplaces, such as asyncgenerators inside synchronous functions.

Python 3.8 adds back some of the code needed to supportasync and await as keywords, but only when a specialfeature_version flag is passed to the astmodule. This enables tools (such as linters and type-checkers)running on Python 3.8 to inspect code written for an older Pythonrelease.

We already mentioned two tokenizers in Python’s referenceimplementation, but it turns out that there’s a third. The undocumentedbut popular lib2to3 library usesa fork on the pure-Python tokenizer we saw earlier.

This tokenizer generates tokens objects in a slightly differentformat, and is designed to support Python 2 syntax in addition to somePython 3 syntax. lib2to3’s tokenizer isn’t as well supported as the standard library’s tokenizer, so unless you needto work with Python 2 or lib2to3, you should steer clear of it.

Outside of CPython, there’s a ton of different tokenizers, many ofwhich are forks of each other. Here’s just a small selection ofinteresting Python tokenizers:

The tokenizer is just the first in a series of steps. Like thetokenizer, each step in this process is designed to iteratively simplifythe daunting problem of program execution.

Now that our source code has been converted to a sequence of tokens,CPython can use that to form a parse tree whichwill eventually be transformed into a similar but higher-level abstract syntaxtree. The parse tree is generated using an LL(1) parsingalgorithm that forms the basis of another state machine.

A symboltable is generated by stepping through the abstract syntax tree. Thesymbol table step handles the logic required for dealing with scopes,tracking where a given local variable name is stored.

The abstract syntax tree and symbol table then serve as input to acompiler that generates a sequence of bytecodeinstructions that can be executed inside of CPython’s virtualmachine.

  1. If you’re unfamiliar with the<<< syntax used here, that’s because it’s a here string.The examples in this post make heavy use them along with heredocuments. Here strings make it easy to provide input to a process’sstdin.↩︎

  2. If you’re using Python 3.8 (or alater release), it’s possibleto configure the tokenizer and parser to retain type comments. Thisfeature makes it easier for static type checkers like mypy to re-use CPython’s parser.↩︎

The views expressed on this site are my own and do not reflect thoseof my employer.

A Deep Dive into Python's Tokenizer (2024)
Top Articles
Latest Posts
Article information

Author: Domingo Moore

Last Updated:

Views: 6186

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Domingo Moore

Birthday: 1997-05-20

Address: 6485 Kohler Route, Antonioton, VT 77375-0299

Phone: +3213869077934

Job: Sales Analyst

Hobby: Kayaking, Roller skating, Cabaret, Rugby, Homebrewing, Creative writing, amateur radio

Introduction: My name is Domingo Moore, I am a attractive, gorgeous, funny, jolly, spotless, nice, fantastic person who loves writing and wants to share my knowledge and understanding with you.