src/debuginfo/rust-codegen.md - rust-lang/rustc-dev-guide - Git at Google

 # Rust Codegen

 The first phase in debug info generation requires Rust to inspect the MIR of the program and
 communicate it to LLVM. This is primarily done in [`rustc_codegen_llvm/debuginfo`][llvm_di], though
 some type-name processing exists in [`rustc_codegen_ssa/debuginfo`][ssa_di]. Rust communicates to
 LLVM via the `DIBuilder` API - a thin wrapper around LLVM's internals that exists in
 [rustc_llvm][rustc_llvm].

 [llvm_di]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_llvm/src/debuginfo
 [ssa_di]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_ssa/src/debuginfo
 [rustc_llvm]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_llvm

 # Type Information

 Type information typically consists of the type name, size, alignment, as well as things like
 fields, generic parameters, and storage modifiers if they are relevant. Much of this work happens in
 [rustc_codegen_llvm/src/debuginfo/metadata][di_metadata].

 [di_metadata]: https://github.com/rust-lang/rust/blob/main/compiler/rustc_codegen_llvm/src/debuginfo/metadata.rs

 It is important to keep in mind that the goal is not necessarily "represent types exactly how they
 appear in Rust", rather it is to represent them in a way that allows debuggers to most accurately
 reconstruct the data during debugging. This distinction is vital to understanding the core work that
 occurs on this layer; many changes made here will be for the purpose of working around debugger
 limitations when no other option will work.

 ## Quirks

 Rust's generated DI nodes "pretend" to be C/C++ for both CDB and LLDB's sake. This can result in
 some unintuitive and non-idiomatic debug info.

 ### Pointers and Reference

 Wide pointers/references/`Box` are treated as a struct with 2 fields: `data_ptr` and `length`.

 All non-wide pointers, references, and `Box` pointers are output as pointer nodes, and no
 distinction is made between `mut` and non-`mut`. Several attempts have been made to rectify this,
 but unfortunately there is not a straightforward solution. Using the `reference` DI nodes of the
 respective formats has pitfalls. There is a semantic difference between C++ references and Rust
 references that is unreconcilable.

 >From [cppreference](https://en.cppreference.com/w/cpp/language/reference.html):
 >
 >References are not objects; **they do not necessarily occupy storage**, although the compiler may
 >allocate storage if it is necessary to implement the desired semantics (e.g. a non-static data
 >member of reference type usually increases the size of the class by the amount necessary to store
 >a memory address).
 >
 >Because references are not objects, **there are no arrays of references, no pointers to references, and no references to references**

 The current proposed solution is to simply [typedef the pointer nodes][issue_144394].

 [issue_144394]: https://github.com/rust-lang/rust/pull/144394

 Using the `const` qualifier to denote non-`mut` poses potential issues due to LLDB's internal
 optimizations. In short, LLDB attempts to cache the child-values of variables (e.g. struct fields,
 array elements) when stepping through code. A heuristic is used to determine which values are safely
 cache-able, and `const` is part of that heuristic. Research has not been done into how this would
 interact with things like Rust's interior mutability constructs.

 ### DWARF vs PDB

 While most of the type information is fairly straight forward, one notable issue is the debug info
 format of the target. Each format has different semantics and limitations, as such they require
 slightly different debug info in some cases. This is gated by calls to
 [`cpp_like_debuginfo`][cpp_like].

 [cpp_like]: https://github.com/rust-lang/rust/blob/main/compiler/rustc_codegen_ssa/src/debuginfo/type_names.rs#L813

 ### Naming

 Rust attempts to communicate type names as accurately as possible, but debuggers and debug info
 formats do not always respect that.

 Due to limitations in MSVC's expression parser, the following name transformations are made for PDB
 debug info:

 | Rust name | MSVC name |
 | --- | --- |
 | `&str`/`&mut str` | `ref$<str$>`/`ref_mut$<str$>` |
 | `&[T]`/`&mut [T]` | `ref$<slice$<T> >`/`ref_mut$<slice$<T> >`[^1] |
 | `[T; N]` | `array$<T, N>` |
 | `RustEnum` | `enum2$<RustEnum>` |
 | `(T1, T2)` | `tuple$<T1, T2>`|
 | `*const T` | `ptr_const$<T>` |
 | `*mut T` | `ptr_mut$<T>` |
 | `usize` | `size_t`[^2] |
 | `isize` | `ptrdiff_t`[^2] |
 | `uN` | `unsigned __intN`[^2] |
 | `iN` | `__intN`[^2] |
 | `f32` | `float`[^2] |
 | `f64` | `double`[^2] |
 | `f128` | `fp128`[^2] |

 [^1]: MSVC's expression parser will treat `>>` as a right shift. It is necessary to separate
 consecutive `>`'s with a space (`> >`) in type names.

 [^2]: While these type names are generated as part of the debug info node (which is then wrapped in
 a typedef node with the Rust name), once the LLVM-IR node is converted to a CodeView node, the type
 name information is lost. This is because CodeView has special shorthand nodes for primitive types,
 and those shorthand nodes to not have a "name" field.

 ### Generics

 Rust outputs generic *type* information (`T` in `ArrayVec<T, N: usize>`), but not generic *value*
 information (`N` in `ArrayVec<T, N: usize>`).

 CodeView does not have a leaf node for generics/C++ templates, so all generic information is lost
 when generating PDB debug info. There are workarounds that allow the debugger to retrieve the
 generic arguments via the type name, but it is fragile solution at best. Efforts are being made to
 contact Microsoft to correct this deficiency, and/or to use one of the unused CodeView node types as
 a suitable equivalent.

 ### Type aliases

 Rust outputs typedef nodes in several cases to help account for debugger limitiations, but it does
 not currently output nodes for [type aliases in the source code][type_aliases].

 [type_aliases]: https://doc.rust-lang.org/reference/items/type-aliases.html

 ### Enums

 Enum DI nodes are generated in [rustc_codegen_llvm/src/debuginfo/metadata/enums][di_metadata_enums]

 [di_metadata_enums]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_llvm/src/debuginfo/metadata/enums

 #### DWARF

 DWARF has a dedicated node for discriminated unions: `DW_TAG_variant`. It is a container that
 references `DW_TAG_variant_part` nodes that may or may not contain a discriminant value. The
 hierarchy looks as follows:

 ```txt
 DW_TAG_structure_type      (top-level type for the coroutine)
   DW_TAG_variant_part      (variant part)
     DW_AT_discr            (reference to discriminant DW_TAG_member)
     DW_TAG_member          (discriminant member)
     DW_TAG_variant         (variant 1)
     DW_TAG_variant         (variant 2)
     DW_TAG_variant         (variant 3)
   DW_TAG_structure_type    (type of variant 1)
   DW_TAG_structure_type    (type of variant 2)
   DW_TAG_structure_type    (type of variant 3)
 ```

 #### PDB
 PDB does not have a dedicated node, so it generates the C equivalent of a discriminated union:

 ```c
 union enum2$<RUST_ENUM_NAME> {
     enum VariantNames {
         First,
         Second
     };
     struct Variant0 {
         struct First {
             // fields
         };
         static const enum2$<RUST_ENUM_NAME>::VariantNames NAME;
         static const unsigned long DISCR_EXACT;
         enum2$<RUST_ENUM_NAME>::Variant0::First value;
     };
     struct Variant1 {
         struct Second {
             // fields
         };
         static enum2$<RUST_ENUM_NAME>::VariantNames NAME;
         static unsigned long DISCR_EXACT;
         enum2$<RUST_ENUM_NAME>::Variant1::Second value;
     };
     enum2$<RUST_ENUM_NAME>::Variant0 variant0;
     enum2$<RUST_ENUM_NAME>::Variant1 variant1;
     unsigned long tag;
 }
 ```

 An important note is that due to limitations in LLDB, the `DISCR_*` value generated is always a
 `u64` even if the value is not `#[repr(u64)]`. This is largely a non-issue for LLDB because the
 `DISCR_*` value and the `tag` are read into `uint64_t` values regardless of their type.

 # Source Information

 TODO
	# Rust Codegen

	The first phase in debug info generation requires Rust to inspect the MIR of the program and
	communicate it to LLVM. This is primarily done in [`rustc_codegen_llvm/debuginfo`][llvm_di], though
	some type-name processing exists in [`rustc_codegen_ssa/debuginfo`][ssa_di]. Rust communicates to
	LLVM via the `DIBuilder` API - a thin wrapper around LLVM's internals that exists in
	[rustc_llvm][rustc_llvm].

	[llvm_di]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_llvm/src/debuginfo
	[ssa_di]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_ssa/src/debuginfo
	[rustc_llvm]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_llvm

	# Type Information

	Type information typically consists of the type name, size, alignment, as well as things like
	fields, generic parameters, and storage modifiers if they are relevant. Much of this work happens in
	[rustc_codegen_llvm/src/debuginfo/metadata][di_metadata].

	[di_metadata]: https://github.com/rust-lang/rust/blob/main/compiler/rustc_codegen_llvm/src/debuginfo/metadata.rs

	It is important to keep in mind that the goal is not necessarily "represent types exactly how they
	appear in Rust", rather it is to represent them in a way that allows debuggers to most accurately
	reconstruct the data during debugging. This distinction is vital to understanding the core work that
	occurs on this layer; many changes made here will be for the purpose of working around debugger
	limitations when no other option will work.

	## Quirks

	Rust's generated DI nodes "pretend" to be C/C++ for both CDB and LLDB's sake. This can result in
	some unintuitive and non-idiomatic debug info.

	### Pointers and Reference

	Wide pointers/references/`Box` are treated as a struct with 2 fields: `data_ptr` and `length`.

	All non-wide pointers, references, and `Box` pointers are output as pointer nodes, and no
	distinction is made between `mut` and non-`mut`. Several attempts have been made to rectify this,
	but unfortunately there is not a straightforward solution. Using the `reference` DI nodes of the
	respective formats has pitfalls. There is a semantic difference between C++ references and Rust
	references that is unreconcilable.

	>From [cppreference](https://en.cppreference.com/w/cpp/language/reference.html):
	>
	>References are not objects; they do not necessarily occupy storage, although the compiler may
	>allocate storage if it is necessary to implement the desired semantics (e.g. a non-static data
	>member of reference type usually increases the size of the class by the amount necessary to store
	>a memory address).
	>
	>Because references are not objects, there are no arrays of references, no pointers to references, and no references to references

	The current proposed solution is to simply [typedef the pointer nodes][issue_144394].

	[issue_144394]: https://github.com/rust-lang/rust/pull/144394

	Using the `const` qualifier to denote non-`mut` poses potential issues due to LLDB's internal
	optimizations. In short, LLDB attempts to cache the child-values of variables (e.g. struct fields,
	array elements) when stepping through code. A heuristic is used to determine which values are safely
	cache-able, and `const` is part of that heuristic. Research has not been done into how this would
	interact with things like Rust's interior mutability constructs.

	### DWARF vs PDB

	While most of the type information is fairly straight forward, one notable issue is the debug info
	format of the target. Each format has different semantics and limitations, as such they require
	slightly different debug info in some cases. This is gated by calls to
	[`cpp_like_debuginfo`][cpp_like].

	[cpp_like]: https://github.com/rust-lang/rust/blob/main/compiler/rustc_codegen_ssa/src/debuginfo/type_names.rs#L813

	### Naming

	Rust attempts to communicate type names as accurately as possible, but debuggers and debug info
	formats do not always respect that.

	Due to limitations in MSVC's expression parser, the following name transformations are made for PDB
	debug info:

	\| Rust name \| MSVC name \|
	\| --- \| --- \|
	\| `&str`/`&mut str` \| `ref$<str$>`/`ref_mut$<str$>` \|
	\| `&[T]`/`&mut [T]` \| `ref$<slice$<T> >`/`ref_mut$<slice$<T> >`[^1] \|
	\| `[T; N]` \| `array$<T, N>` \|
	\| `RustEnum` \| `enum2$<RustEnum>` \|
	\| `(T1, T2)` \| `tuple$<T1, T2>`\|
	\| `*const T` \| `ptr_const$<T>` \|
	\| `*mut T` \| `ptr_mut$<T>` \|
	\| `usize` \| `size_t`[^2] \|
	\| `isize` \| `ptrdiff_t`[^2] \|
	\| `uN` \| `unsigned __intN`[^2] \|
	\| `iN` \| `__intN`[^2] \|
	\| `f32` \| `float`[^2] \|
	\| `f64` \| `double`[^2] \|
	\| `f128` \| `fp128`[^2] \|

	[^1]: MSVC's expression parser will treat `>>` as a right shift. It is necessary to separate
	consecutive `>`'s with a space (`> >`) in type names.

	[^2]: While these type names are generated as part of the debug info node (which is then wrapped in
	a typedef node with the Rust name), once the LLVM-IR node is converted to a CodeView node, the type
	name information is lost. This is because CodeView has special shorthand nodes for primitive types,
	and those shorthand nodes to not have a "name" field.

	### Generics

	Rust outputs generic type information (`T` in `ArrayVec<T, N: usize>`), but not generic value
	information (`N` in `ArrayVec<T, N: usize>`).

	CodeView does not have a leaf node for generics/C++ templates, so all generic information is lost
	when generating PDB debug info. There are workarounds that allow the debugger to retrieve the
	generic arguments via the type name, but it is fragile solution at best. Efforts are being made to
	contact Microsoft to correct this deficiency, and/or to use one of the unused CodeView node types as
	a suitable equivalent.

	### Type aliases

	Rust outputs typedef nodes in several cases to help account for debugger limitiations, but it does
	not currently output nodes for [type aliases in the source code][type_aliases].

	[type_aliases]: https://doc.rust-lang.org/reference/items/type-aliases.html

	### Enums

	Enum DI nodes are generated in [rustc_codegen_llvm/src/debuginfo/metadata/enums][di_metadata_enums]

	[di_metadata_enums]: https://github.com/rust-lang/rust/tree/main/compiler/rustc_codegen_llvm/src/debuginfo/metadata/enums

	#### DWARF

	DWARF has a dedicated node for discriminated unions: `DW_TAG_variant`. It is a container that
	references `DW_TAG_variant_part` nodes that may or may not contain a discriminant value. The
	hierarchy looks as follows:

	```txt
	DW_TAG_structure_type (top-level type for the coroutine)
	DW_TAG_variant_part (variant part)
	DW_AT_discr (reference to discriminant DW_TAG_member)
	DW_TAG_member (discriminant member)
	DW_TAG_variant (variant 1)
	DW_TAG_variant (variant 2)
	DW_TAG_variant (variant 3)
	DW_TAG_structure_type (type of variant 1)
	DW_TAG_structure_type (type of variant 2)
	DW_TAG_structure_type (type of variant 3)
	```

	#### PDB
	PDB does not have a dedicated node, so it generates the C equivalent of a discriminated union:

	```c
	union enum2$<RUST_ENUM_NAME> {
	enum VariantNames {
	First,
	Second
	};
	struct Variant0 {
	struct First {
	// fields
	};
	static const enum2$<RUST_ENUM_NAME>::VariantNames NAME;
	static const unsigned long DISCR_EXACT;
	enum2$<RUST_ENUM_NAME>::Variant0::First value;
	};
	struct Variant1 {
	struct Second {
	// fields
	};
	static enum2$<RUST_ENUM_NAME>::VariantNames NAME;
	static unsigned long DISCR_EXACT;
	enum2$<RUST_ENUM_NAME>::Variant1::Second value;
	};
	enum2$<RUST_ENUM_NAME>::Variant0 variant0;
	enum2$<RUST_ENUM_NAME>::Variant1 variant1;
	unsigned long tag;
	}
	```

	An important note is that due to limitations in LLDB, the `DISCR_*` value generated is always a
	`u64` even if the value is not `#[repr(u64)]`. This is largely a non-issue for LLDB because the
	`DISCR_*` value and the `tag` are read into `uint64_t` values regardless of their type.

	# Source Information

	TODO