The very first thing the compiler does is take the program (in UTF-8 Unicode text) and turn it into a data format the compiler can work with more conveniently than strings. This happens in two stages: Lexing and Parsing.
foo.bar + buz
would be turned into the tokens foo
, .
, bar
, +
, and buz
. This is implemented in rustc_lexer
.The AST mirrors the structure of a Rust program in memory, using a Span
to link a particular AST node back to its source text. The AST is defined in rustc_ast
, along with some definitions for tokens and token streams, data structures/traits for mutating ASTs, and shared definitions for other AST-related parts of the compiler (like the lexer and macro-expansion).
Every node in the AST has its own NodeId
, including top-level items such as structs, but also individual statements and expressions. A NodeId
is an identifier number that uniquely identifies an AST node within a crate.
However, because they are absolute within a crate, adding or removing a single node in the AST causes all the subsequent NodeId
s to change. This renders NodeId
s pretty much useless for incremental compilation, where you want as few things as possible to change.
NodeId
s are used in all the rustc
bits that operate directly on the AST, like macro expansion and name resolution (more on these over the next couple chapters).
The parser is defined in rustc_parse
, along with a high-level interface to the lexer and some validation routines that run after macro expansion. In particular, the rustc_parse::parser
contains the parser implementation.
The main entrypoint to the parser is via the various parse_*
functions and others in rustc_parse. They let you do things like turn a SourceFile
(e.g. the source in a single file) into a token stream, create a parser from the token stream, and then execute the parser to get a Crate
(the root AST node).
To minimize the amount of copying that is done, both StringReader
and Parser
have lifetimes which bind them to the parent ParseSess
. This contains all the information needed while parsing, as well as the SourceMap
itself.
Note that while parsing, we may encounter macro definitions or invocations. We set these aside to be expanded (see Macro Expansion). Expansion itself may require parsing the output of a macro, which may reveal more macros to be expanded, and so on.
Code for lexical analysis is split between two crates:
rustc_lexer
crate is responsible for breaking a &str
into chunks constituting tokens. Although it is popular to implement lexers as generated finite state machines, the lexer in rustc_lexer
is hand-written.
StringReader
integrates rustc_lexer
with data structures specific to rustc
. Specifically, it adds Span
information to tokens returned by rustc_lexer
and interns identifiers.