blob: 1e5432feff5a1a962d8bb8716c5b292629437e61 [file] [view] [edit]
r[input]
# Input format
r[input.syntax]
```grammar,lexer
CHAR -> [U+0000-U+D7FF U+E000-U+10FFFF] // a Unicode scalar value
ASCII -> [U+0000-U+007F]
NUL -> U+0000
EOF -> !CHAR // End of file or input
```
r[input.intro]
This chapter describes how a source file is interpreted as a sequence of tokens.
See [Crates and source files] for a description of how programs are organised into files.
r[input.encoding]
## Source encoding
r[input.encoding.utf8]
Each source file is interpreted as a sequence of Unicode characters encoded in UTF-8.
r[input.encoding.invalid]
It is an error if the file is not valid UTF-8.
r[input.byte-order-mark]
## Byte order mark removal
If the first character in the sequence is `U+FEFF` ([BYTE ORDER MARK]), it is removed.
r[input.crlf]
## CRLF normalization
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF). This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
r[input.shebang]
## Shebang removal
r[input.shebang.removal]
If a [shebang] is present, it is removed from the input sequence (and is therefore ignored).
r[input.frontmatter]
## Frontmatter removal
r[input.frontmatter.removal]
If the remaining input begins with a [frontmatter] fence, optionally preceded by lines containing only [whitespace], the [frontmatter] and any preceding whitespace are removed.
For example, given the following file:
<!-- ignore: test runner doesn't support frontmatter -->
```rust,ignore
--- cargo
package.edition = "2024"
---
fn main() {}
```
The first three lines (the opening fence, body, and closing fence) would be removed, leaving an empty line followed by `fn main() {}`.
r[input.tokenization]
## Tokenization
The resulting sequence of characters is then converted into tokens as described in the remainder of this chapter.
> [!NOTE]
> The standard library [`include!`] macro applies the following transformations to the file it reads:
>
> - Byte order mark removal.
> - CRLF normalization.
> - Shebang and frontmatter removal when invoked in an item context (as opposed to expression or statement contexts).
>
> The [`include_str!`] and [`include_bytes!`] macros do not apply these transformations.
[BYTE ORDER MARK]: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
[Crates and source files]: crates-and-source-files.md
[frontmatter]: frontmatter.md
[shebang]: shebang.md
[whitespace]: whitespace.md