src/queries/incremental-compilation.md - rust-lang/rustc-dev-guide - Git at Google

 # Incremental compilation

 The incremental compilation scheme is, in essence, a surprisingly
 simple extension to the overall query system. We'll start by describing
 a slightly simplified variant of the real thing – the "basic algorithm" –
 and then describe some possible improvements.

 ## The basic algorithm

 The basic algorithm is
 called the **red-green** algorithm[^salsa]. The high-level idea is
 that, after each run of the compiler, we will save the results of all
 the queries that we do, as well as the **query DAG**. The
 **query DAG** is a [DAG] that indexes which queries executed which
 other queries. So, for example, there would be an [edge] from a query Q1
 to another query Q2 if computing Q1 required computing Q2 (note that
 because queries cannot depend on themselves, this results in a DAG and
 not a general graph).

 [DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph

 > **NOTE**: You might think of a query as simply the definition of a query.
 > A thing that you can invoke, a bit like a function,
 > and which either returns a cached result or actually executes the code.
 >
 > If that's the way you think about queries,
 > it's good to know that in the following text, queries will be said to have colours.
 > Keep in mind though, that here the word query also refers to a certain invocation of
 > the query for a certain input. As you will read later, queries are fingerprinted based
 > on their arguments. The result of a query might change when we give it one argument
 > and be coloured red, while it stays the same for another argument and is thus green.
 >
 > In short, the word query is here not just used to mean the definition of a query,
 > but also for a specific instance of that query with given arguments.

 On the next run of the compiler, then, we can sometimes reuse these
 query results to avoid re-executing a query. We do this by assigning
 every query a **color**:

 - If a query is colored **red**, that means that its result during
   this compilation has **changed** from the previous compilation.
 - If a query is colored **green**, that means that its result is
   the **same** as the previous compilation.

 There are two key insights here:

 - First, if all the inputs to query Q are colored green, then the
   query Q **must** result in the same value as last time and hence
   need not be re-executed (or else the compiler is not deterministic).
 - Second, even if some inputs to a query changes, it may be that it
   **still** produces the same result as the previous compilation. In
   particular, the query may only use part of its input.
   - Therefore, after executing a query, we always check whether it
     produced the same result as the previous time. **If it did,** we
     can still mark the query as green, and hence avoid re-executing
     dependent queries.

 ### The try-mark-green algorithm

 At the core of incremental compilation is an algorithm called
 "try-mark-green". It has the job of determining the color of a given
 query Q (which must not have yet been executed). In cases where Q has
 red inputs, determining Q's color may involve re-executing Q so that
 we can compare its output, but if all of Q's inputs are green, then we
 can conclude that Q must be green without re-executing it or inspecting
 its value at all. In the compiler, this allows us to avoid
 deserializing the result from disk when we don't need it, and in fact
 enables us to sometimes skip *serializing* the result as well
 (see the refinements section below).

 Try-mark-green works as follows:

 - First check if the query Q was executed during the previous compilation.
   - If not, we can just re-execute the query as normal, and assign it the
     color of red.
 - If yes, then load the 'dependent queries' of Q.
 - If there is a saved result, then we load the `reads(Q)` vector from the
   query DAG. The "reads" is the set of queries that Q executed during
   its execution.
   - For each query R in `reads(Q)`, we recursively demand the color
     of R using try-mark-green.
     - Note: it is important that we visit each node in `reads(Q)` in same order
       as they occurred in the original compilation. See [the section on the
       query DAG below](#dag).
     - If **any** of the nodes in `reads(Q)` wind up colored **red**, then Q is
       dirty.
       - We re-execute Q and compare the hash of its result to the hash of the
         result from the previous compilation.
       - If the hash has not changed, we can mark Q as **green** and return.
     - Otherwise, **all** of the nodes in `reads(Q)` must be **green**. In that
       case, we can color Q as **green** and return.

 <a id="dag"></a>

 ### The query DAG

 The query DAG code is stored in
 [`compiler/rustc_middle/src/dep_graph`][dep_graph]. Construction of the DAG is done
 by instrumenting the query execution.

 One key point is that the query DAG also tracks ordering; that is, for
 each query Q, we not only track the queries that Q reads, we track the
 **order** in which they were read.  This allows try-mark-green to walk
 those queries back in the same order. This is important because once a
 subquery comes back as red, we can no longer be sure that Q will continue
 along the same path as before. That is, imagine a query like this:

 ```rust,ignore
 fn main_query(tcx) {
     if tcx.subquery1() {
         tcx.subquery2()
     } else {
         tcx.subquery3()
     }
 }
 ```

 Now imagine that in the first compilation, `main_query` starts by
 executing `subquery1`, and this returns true. In that case, the next
 query `main_query` executes will be `subquery2`, and `subquery3` will
 not be executed at all.

 But now imagine that in the **next** compilation, the input has
 changed such that `subquery1` returns **false**. In this case, `subquery2`
 would never execute. If try-mark-green were to visit `reads(main_query)` out
 of order, however, it might visit `subquery2` before `subquery1`, and hence
 execute it.
 This can lead to ICEs and other problems in the compiler.

 [dep_graph]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/dep_graph/index.html

 ## Improvements to the basic algorithm

 In the description of the basic algorithm, we said that at the end of
 compilation we would save the results of all the queries that were
 performed.  In practice, this can be quite wasteful – many of those
 results are very cheap to recompute, and serializing and deserializing
 them is not a particular win. In practice, what we would do is to save
 **the hashes** of all the subqueries that we performed. Then, in select cases,
 we **also** save the results.

 This is why the incremental algorithm separates computing the
 **color** of a node, which often does not require its value, from
 computing the **result** of a node. Computing the result is done via a simple
 algorithm like so:

 - Check if a saved result for Q is available. If so, compute the color of Q.
   If Q is green, deserialize and return the saved result.
 - Otherwise, execute Q.
   - We can then compare the hash of the result and color Q as green if
     it did not change.

 ## Resources
 The initial design document can be found [here][initial-design], which expands
 on the memoization details, provides more high-level overview and motivation
 for this system.

 # Footnotes

 [^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis

 [edge]: https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#edge
 [initial-design]: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md
	# Incremental compilation

	The incremental compilation scheme is, in essence, a surprisingly
	simple extension to the overall query system. We'll start by describing
	a slightly simplified variant of the real thing – the "basic algorithm" –
	and then describe some possible improvements.

	## The basic algorithm

	The basic algorithm is
	called the red-green algorithm[^salsa]. The high-level idea is
	that, after each run of the compiler, we will save the results of all
	the queries that we do, as well as the query DAG. The
	query DAG is a [DAG] that indexes which queries executed which
	other queries. So, for example, there would be an [edge] from a query Q1
	to another query Q2 if computing Q1 required computing Q2 (note that
	because queries cannot depend on themselves, this results in a DAG and
	not a general graph).

	[DAG]: https://en.wikipedia.org/wiki/Directed_acyclic_graph

	> NOTE: You might think of a query as simply the definition of a query.
	> A thing that you can invoke, a bit like a function,
	> and which either returns a cached result or actually executes the code.
	>
	> If that's the way you think about queries,
	> it's good to know that in the following text, queries will be said to have colours.
	> Keep in mind though, that here the word query also refers to a certain invocation of
	> the query for a certain input. As you will read later, queries are fingerprinted based
	> on their arguments. The result of a query might change when we give it one argument
	> and be coloured red, while it stays the same for another argument and is thus green.
	>
	> In short, the word query is here not just used to mean the definition of a query,
	> but also for a specific instance of that query with given arguments.

	On the next run of the compiler, then, we can sometimes reuse these
	query results to avoid re-executing a query. We do this by assigning
	every query a color:

	- If a query is colored red, that means that its result during
	this compilation has changed from the previous compilation.
	- If a query is colored green, that means that its result is
	the same as the previous compilation.

	There are two key insights here:

	- First, if all the inputs to query Q are colored green, then the
	query Q must result in the same value as last time and hence
	need not be re-executed (or else the compiler is not deterministic).
	- Second, even if some inputs to a query changes, it may be that it
	still produces the same result as the previous compilation. In
	particular, the query may only use part of its input.
	- Therefore, after executing a query, we always check whether it
	produced the same result as the previous time. If it did, we
	can still mark the query as green, and hence avoid re-executing
	dependent queries.

	### The try-mark-green algorithm

	At the core of incremental compilation is an algorithm called
	"try-mark-green". It has the job of determining the color of a given
	query Q (which must not have yet been executed). In cases where Q has
	red inputs, determining Q's color may involve re-executing Q so that
	we can compare its output, but if all of Q's inputs are green, then we
	can conclude that Q must be green without re-executing it or inspecting
	its value at all. In the compiler, this allows us to avoid
	deserializing the result from disk when we don't need it, and in fact
	enables us to sometimes skip serializing the result as well
	(see the refinements section below).

	Try-mark-green works as follows:

	- First check if the query Q was executed during the previous compilation.
	- If not, we can just re-execute the query as normal, and assign it the
	color of red.
	- If yes, then load the 'dependent queries' of Q.
	- If there is a saved result, then we load the `reads(Q)` vector from the
	query DAG. The "reads" is the set of queries that Q executed during
	its execution.
	- For each query R in `reads(Q)`, we recursively demand the color
	of R using try-mark-green.
	- Note: it is important that we visit each node in `reads(Q)` in same order
	as they occurred in the original compilation. See [the section on the
	query DAG below](#dag).
	- If any of the nodes in `reads(Q)` wind up colored red, then Q is
	dirty.
	- We re-execute Q and compare the hash of its result to the hash of the
	result from the previous compilation.
	- If the hash has not changed, we can mark Q as green and return.
	- Otherwise, all of the nodes in `reads(Q)` must be green. In that
	case, we can color Q as green and return.

	<a id="dag"></a>

	### The query DAG

	The query DAG code is stored in
	[`compiler/rustc_middle/src/dep_graph`][dep_graph]. Construction of the DAG is done
	by instrumenting the query execution.

	One key point is that the query DAG also tracks ordering; that is, for
	each query Q, we not only track the queries that Q reads, we track the
	order in which they were read. This allows try-mark-green to walk
	those queries back in the same order. This is important because once a
	subquery comes back as red, we can no longer be sure that Q will continue
	along the same path as before. That is, imagine a query like this:

	```rust,ignore
	fn main_query(tcx) {
	if tcx.subquery1() {
	tcx.subquery2()
	} else {
	tcx.subquery3()
	}
	}
	```

	Now imagine that in the first compilation, `main_query` starts by
	executing `subquery1`, and this returns true. In that case, the next
	query `main_query` executes will be `subquery2`, and `subquery3` will
	not be executed at all.

	But now imagine that in the next compilation, the input has
	changed such that `subquery1` returns false. In this case, `subquery2`
	would never execute. If try-mark-green were to visit `reads(main_query)` out
	of order, however, it might visit `subquery2` before `subquery1`, and hence
	execute it.
	This can lead to ICEs and other problems in the compiler.

	[dep_graph]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/dep_graph/index.html

	## Improvements to the basic algorithm

	In the description of the basic algorithm, we said that at the end of
	compilation we would save the results of all the queries that were
	performed. In practice, this can be quite wasteful – many of those
	results are very cheap to recompute, and serializing and deserializing
	them is not a particular win. In practice, what we would do is to save
	the hashes of all the subqueries that we performed. Then, in select cases,
	we also save the results.

	This is why the incremental algorithm separates computing the
	color of a node, which often does not require its value, from
	computing the result of a node. Computing the result is done via a simple
	algorithm like so:

	- Check if a saved result for Q is available. If so, compute the color of Q.
	If Q is green, deserialize and return the saved result.
	- Otherwise, execute Q.
	- We can then compare the hash of the result and color Q as green if
	it did not change.

	## Resources
	The initial design document can be found [here][initial-design], which expands
	on the memoization details, provides more high-level overview and motivation
	for this system.

	# Footnotes

	[^salsa]: I have long wanted to rename it to the Salsa algorithm, but it never caught on. -@nikomatsakis

	[edge]: https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#edge
	[initial-design]: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md