src/input-format.md - rust-lang/reference - Git at Google

 r[input]
 # Input format

 r[input.syntax]
 ```grammar,lexer
 CHAR -> [U+0000-U+D7FF U+E000-U+10FFFF] // a Unicode scalar value

 ASCII -> [U+0000-U+007F]

 NUL -> U+0000

 EOF -> !CHAR  // End of file or input
 ```

 r[input.intro]
 This chapter describes how a source file is interpreted as a sequence of tokens.

 See [Crates and source files] for a description of how programs are organised into files.

 r[input.encoding]
 ## Source encoding

 r[input.encoding.utf8]
 Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.

 r[input.encoding.invalid]
 It is an error if the file is not valid UTF-8.

 r[input.byte-order-mark]
 ## Byte order mark removal

 If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.

 r[input.crlf]
 ## CRLF normalization

 Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").

 Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).

 r[input.shebang]
 ## Shebang removal

 r[input.shebang.removal]
 If a [shebang] is present, it is removed from the input sequence (and is therefore ignored).

 r[input.frontmatter]
 ## Frontmatter removal

 r[input.frontmatter.removal]
 If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed.

 For example, given the following file:

 <!-- ignore: test runner doesn't support frontmatter -->
 ```rust,ignore
 --- cargo
 package.edition = "2024"
 ---

 fn main() {}
 ```

 The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`.

 r[input.tokenization]
 ## Tokenization

 The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.

 > [!NOTE]
 > The standard library [`include!`] macro applies the following transformations to the file it reads:
 >
 > - Byte order mark removal.
 > - CRLF normalization.
 > - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts).
 >
 > The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations.

 [BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
 [Crates and source files]: crates-and-source-files.md
 [frontmatter]: frontmatter.md
 [shebang]: shebang.md
 [whitespace]: whitespace.md
	r[input]
	# Input format

	r[input.syntax]
	```grammar,lexer
	CHAR -> [U+0000-U+D7FF U+E000-U+10FFFF] // a Unicode scalar value

	ASCII -> [U+0000-U+007F]

	NUL -> U+0000

	EOF -> !CHAR // End of file or input
	```

	r[input.intro]
	This chapter describes how a source file is interpreted as a sequence of tokens.

	See [Crates and source files] for a description of how programs are organised into files.

	r[input.encoding]
	## Source encoding

	r[input.encoding.utf8]
	Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.

	r[input.encoding.invalid]
	It is an error if the file is not valid UTF-8.

	r[input.byte-order-mark]
	## Byte order mark removal

	If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.

	r[input.crlf]
	## CRLF normalization

	Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").

	Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).

	r[input.shebang]
	## Shebang removal

	r[input.shebang.removal]
	If a [shebang] is present, it is removed from the input sequence (and is therefore ignored).

	r[input.frontmatter]
	## Frontmatter removal

	r[input.frontmatter.removal]
	If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed.

	For example, given the following file:

	<!-- ignore: test runner doesn't support frontmatter -->
	```rust,ignore
	--- cargo
	package.edition = "2024"
	---

	fn main() {}
	```

	The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`.

	r[input.tokenization]
	## Tokenization

	The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.

	> [!NOTE]
	> The standard library [`include!`] macro applies the following transformations to the file it reads:
	>
	> - Byte order mark removal.
	> - CRLF normalization.
	> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts).
	>
	> The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations.

	[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
	[Crates and source files]: crates-and-source-files.md
	[frontmatter]: frontmatter.md
	[shebang]: shebang.md
	[whitespace]: whitespace.md