mlir/docs/Dialects/Linalg/_index.md - rust-lang/llvm-project - Git at Google

 # 'linalg' Dialect

 [TOC]

 ## Rationale

 <img width="90" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">

 Linalg is designed to solve the High-level Hierarchical Optimization (HHO box)
 in MLIR and to interoperate nicely within a *Mixture Of Expert Compilers*
 environment (i.e. the *CGSel* box).

 The [Rationale Document](../../Rationale/RationaleLinalgDialect.md) goes into
 significantly more design and architectural decision details.

 ## Set of Key Transformations<a name="key_transformations"></a>

 The following key transformations have been central to driving the design of
 Linalg. They are all implemented in terms of the properties of the
 `linalg.generic` OpInterface and avoid the pitfall of relying on hardcoded
 one-off op knowledge.

 The textual form description of these transformations is left for future work.
 Still, it is useful to list the key transformations that are performed on the
 Linalg IR and that have influenced its design:

 1.  Progressive Buffer Allocation.
 1.  Parametric Tiling.
 1.  Promotion to Temporary Buffer in Fast Memory.
 1.  Tiled Producer-Consumer Fusion with Parametric Tile-And-Fuse.
 1.  Map to Parallel and Reduction Loops and Hardware.
 1.  Vectorization: Rewrite in Vector Form.
 1.  Lower to Loops (Affine, Generic, and Parallel).
 1.  Lower to Library Calls or Special Instructions, Intrinsics or ISA.
 1.  Partially Lower to Iterations Over a Finer-Grained Linalg Op.

 ## High-Level Description of Linalg Ops<a name="linalg_ops"></a>

 Linalg takes at least some inspiration from all previously
 [listed prior art](../../Rationale/RationaleLinalgDialect.md/#prior-art). The
 design enables the definition of ***CustomOps*** with generic properties that
 enable [key transformations](#key_transformations), including lowering to scalar
 load/store and other operations or to external library calls and intrinsics.

 These ops can have ***either tensor or buffer*** as both input and output
 operands. Output tensors operands serve the purpose of providing a unifying
 abstraction and give a shape to the results. Output tensors can come in 2
 flavors and are always associated with a corresponding op result:

 1.  an "init tensor" output value which provides an initial value for a tensor
     that is created by iteratively updating the result (also called "destructive
     updates"). Such tensor is always materialized in some form. If enough fusion
     occurs it may end up being materialized only as a register-level SSA value.
     It is expected (but not required) that the destructive update pattern can be
     rewritten as an inplace update on buffers.

 2.  a "shape-only" tensor output value whose underlying elements are not used in
     the payload computation and only serves the purpose of carrying shape
     information to lower levels of abstraction. In the future this will be
     replaced by an appropriate shape type when it is available as a builtin type
     (see the discourse discussion
     [Linalg and Shapes](https://llvm.discourse.group/t/linalg-and-shapes/2421)
     for more details).

 ### Payload-Carrying Ops<a name="payload_ops"></a>

 Linalg defines a payload carrying operation that implements the
 [structured op](https://docs.google.com/presentation/d/1P-j1GrH6Q5gLBjao0afQ-GfvcAeF-QU4GXXeSy0eJ9I/edit#slide=id.p)
 abstraction on tensors and buffers. This `linalg.generic` operation can express
 custom operations that optionally have *indexing semantics* (by accessing the
 iteration indices using the `linalg.index` operation). The properties of
 `linalg.generic` are the result of applying the guiding principles described in
 the [Rationale Document](../../Rationale/RationaleLinalgDialect.md). They are
 listed next, with a brief example and discussion for each.

 #### Property 1: Input and Output Operands Define The Iteration Space<a name="prop1"></a>

 A `linalg.generic` op fully *derives* the specification of its iteration space
 from its operands. The property enforces that a localized IR element (the op)
 *has* all the information needed to synthesize the control-flow required to
 iterate over its operands, according to their type. This notion of IR
 localization bears some resemblance to
 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf).

 Consider the following fully specified `linalg.generic` example. Here, the first
 operand is a `memref` of `f32` scalar elements that has an ordinary identity
 layout, and the second one is a `memref` of 4-element vectors with a 2-strided,
 1-offset layout.

 ```mlir
 // File name: example1.mlir
 #accesses = [
   affine_map<(m) -> (m)>,
   affine_map<(m) -> (m)>
 ]

 #attrs = {
   indexing_maps = #accesses,
   iterator_types = ["parallel"]
 }

 func.func @example(%A: memref<?xf32, strided<[1]>>,
               %B: memref<?xvector<4xf32>, strided<[2], offset: 1>>) {
   linalg.generic #attrs
   ins(%A: memref<?xf32, strided<[1]>>)
   outs(%B: memref<?xvector<4xf32>, strided<[2], offset: 1>>) {
   ^bb0(%a: f32, %b: vector<4xf32>):
     %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
     linalg.yield %c: vector<4xf32>
   }
   return
 }
 ```

 The property "*Input and Output Operands Define The Iteration Space*" is
 materialized by a lowering into a form that will resemble:

 ```mlir
 // Run: mlir-opt example1.mlir -allow-unregistered-dialect -convert-linalg-to-loops
 // This converted representation is in the `scf` dialect.
 // It's syntax can be found here: https://mlir.llvm.org/docs/Dialects/SCFDialect/

 func.func @example(%arg0: memref<?xf32>,
                    %arg1: memref<?xvector<4xf32>, strided<[2], offset: 1>>) {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
   %0 = memref.dim %arg0, %c0 : memref<?xf32>
   scf.for %arg2 = %c0 to %0 step %c1 {
     %1 = memref.load %arg0[%arg2] : memref<?xf32>
     %2 = memref.load %arg1[%arg2]
        : memref<?xvector<4xf32>, strided<[2], offset: 1>>
     %3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
     memref.store %3, %arg1[%arg2]
        : memref<?xvector<4xf32>, strided<[2], offset: 1>>
   }
   return
 }
 ```

 The property participates in simplifying analyses and transformations. For
 instance, it guarantees no out-of bounds access can occur by construction
 (assuming dynamic operand dimensions agree with each other, which is the purpose
 of the `assert` runtime check).

 Before lowering to loop form, loop induction variables and iterators are
 implicit (i.e. *not yet materialized*).

 The main implications are that:

 1.  The semantics of the ops are *restricted to operate on structured data
     types*, on which we can define an iterator.

 2.  This does not model arbitrary code with side-effects.

 We do not think these are serious limitations in practice because MLIR is all
 about mixing different levels of abstractions in the same IR. As long as Linalg
 can progressively lower to the next level of abstraction, it can also be just
 bypassed for things that do not fit.

 At the same time, conditioning op semantics on structured data types is a very
 promising path towards extensibility to non-dense tensors as experience with
 LIFT abstractions for
 [sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf) and
 [position-dependent arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf),
 as well as [TACO](http://tensor-compiler.org/), has shown.

 #### Property 2: Reversible Mappings Between Control and Data Structures<a name="prop2"></a>

 A `linalg.generic` *defines* the mapping between the iteration space (i.e. the
 loops) and the data.

 Consider the following fully specified `linalg.generic` example. Here, the first
 `memref` is a 2-strided one on both of its dimensions, and the second `memref`
 uses an identity layout.

 ```mlir
 // File name: example2.mlir
 #indexing_maps = [
   affine_map<(i, j) -> (j, i)>,
   affine_map<(i, j) -> (j)>
 ]

 #attrs = {
   indexing_maps = #indexing_maps,
   iterator_types = ["parallel", "parallel"]
 }

 func.func @example(%A: memref<8x?xf32, strided<[2, 2], offset: 0>>,
               %B: memref<?xvector<4xf32>>) {
   linalg.generic #attrs
   ins(%A: memref<8x?xf32, strided<[2, 2], offset: 0>>)
   outs(%B: memref<?xvector<4xf32>>) {
   ^bb0(%a: f32, %b: vector<4xf32>):
     %c = "some_compute"(%a, %b): (f32, vector<4xf32>) -> (vector<4xf32>)
     linalg.yield %c: vector<4xf32>
   }
   return
 }
 ```

 The property "*Reversible Mappings Between Control and Data Structures*" is
 materialized by a lowering into a form that will resemble:

 ```mlir
 // Run: mlir-opt example2.mlir -allow-unregistered-dialect -convert-linalg-to-loops

 func.func @example(%arg0: memref<8x?xf32, strided<[2, 2]>>, %arg1: memref<?xvector<4xf32>>) {
   %c8 = arith.constant 8 : index
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
   %0 = memref.dim %arg0, %c1 : memref<8x?xf32, strided<[2, 2]>>
   scf.for %arg2 = %c0 to %0 step %c1 {
     scf.for %arg3 = %c0 to %c8 step %c1 {
       %1 = memref.load %arg0[%arg3, %arg2] : memref<8x?xf32, strided<[2, 2]>>
       %2 = memref.load %arg1[%arg3] : memref<?xvector<4xf32>>
       %3 = "some_compute"(%1, %2) : (f32, vector<4xf32>) -> vector<4xf32>
       memref.store %3, %arg1[%arg3] : memref<?xvector<4xf32>>
     }
   }
   return
 }
 ```

 This mapping needs to be reversible because we want to be able to go back and
 forth between the two and answer questions such as:

 -   Given a subset of the iteration space, what subset of data does it read and
     write?
 -   Given a subset of data read or written, what subset of the iteration space
     is responsible for this read or write?

 Answering these `2` questions is one of the main analyses that Linalg uses to
 implement transformations such as tiling, tiled producer-consumer fusion, and
 promotion to temporary buffers in fast memory.

 In the current implementation, `linalg.generic` uses a list of
 [AffineMaps](https://mlir.llvm.org/docs/LangRef/#affinemap-attribute) (see the
 `#indexing_maps` attribute in the previous examples). This is a pragmatic
 short-term solution, but in the longer term note that this property could be
 even evaluated dynamically, similarly to inspector-executor algorithms.

 #### Property 3: The Type Of Iterators is Defined Explicitly<a name="prop3"></a>

 A `linalg.generic` op fully *declares* the type of its iterators. This
 information is used in transformations.

 These properties are derived from established practice in the field and mirror
 the properties from Ken Kennedy's
 [Optimizing Compilers for Modern Architectures](https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9).
 The key idea of legality of loop transformations expressed by Kennedy is that
 ***the lexicographic order of all dependence vectors must be preserved***.

 This can be better captured directly at the loop level thanks to specific
 iterator types, among which: *parallel*, *reduction*, *partition*,
 *permutable/monotonic*, *sequential*, *dependence distance*, ...

 These types are traditionally the result of complex dependence analyses and have
 been referred to as "*bands*" in the polyhedral community (e.g. *parallel
 bands*, *permutable bands*, etc, in
 [ISL](https://en.wikipedia.org/wiki/Integer_set_library) schedule tree
 parlance).

 Specifying the information declaratively in a `linalg.generic` allows conveying
 properties that may be hard (or even impossible) to derive from lower-level
 information. These properties can be brought all the way to the moment when they
 are useful for transformations, used and then discarded.

 Additionally, these properties may also be viewed as a contract that the
 frontend/user guarantees and that the compiler may take advantage of. The common
 example is the use of data-dependent reduction semantics for specifying
 histogram computations. If the frontend has additional knowledge that proper
 atomic operations are available, it may be better to specify parallel semantics
 and use the special atomic in the computation region.

 At this time, Linalg only has an explicit use for *parallel* and *reduction*
 loops but previous experience shows that the abstraction generalizes.

 #### Property 4: The Compute Payload is Specified With a Region<a name="prop4"></a>

 A `linalg.generic` op has a compute payload that is fully generic thanks to the
 use of
 [Regions](https://github.com/llvm/llvm-project/blob/58265ad42a90ae8905be6a447cb42e53529a54a0/mlir/docs/LangRef.md/#regions).

 The region takes as arguments the scalar elemental types of the tensor or buffer
 operands of the `linalg.generic`. For flexibility and ability to match library
 calls, additional special values may be passed. For instance, a `linalg.fill`
 operation takes a buffer and an additional scalar value.

 At this time there are no additional restrictions to the region semantics. This
 is meant to allow the exploration of various design tradeoffs at the
 intersection of regions and iterator types. In particular, the frontend is
 responsible for the semantics of iterator types to correspond to the operations
 inside the region: the region can capture buffers arbitrarily and write into
 them. If this conflicts with some parallel iterator requirement, this is
 undefined behavior.

 Previous examples already elaborate compute payloads with an unregistered
 function `"some_compute"`. The following code snippet shows what the result will
 be when using a concrete operation `addf`:

 ```mlir
 // File name: example3.mlir
 #map = affine_map<(i, j) -> (i, j)>

 #attrs = {
   indexing_maps = [#map, #map, #map],
   iterator_types = ["parallel", "parallel"]
 }

 func.func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
   linalg.generic #attrs
   ins(%A, %B: memref<?x?xf32>, memref<?x?xf32>)
   outs(%C: memref<?x?xf32>) {
     ^bb0(%a: f32, %b: f32, %c: f32):
       %d = arith.addf %a, %b : f32
       linalg.yield %d : f32
   }

   return
 }
 ```

 This function basically element-wise adds up two matrices (`%A` and `%B`) and
 stores the result into another one (`%C`).

 The property "*The Compute Payload is Specified With a Region*" is materialized
 by a lowering into a form that will resemble:

 ```mlir
 func.func @example(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
   %0 = memref.dim %arg0, %c0 : memref<?x?xf32>
   %1 = memref.dim %arg0, %c1 : memref<?x?xf32>
   scf.for %arg3 = %c0 to %0 step %c1 {
     scf.for %arg4 = %c0 to %1 step %c1 {
       %2 = memref.load %arg0[%arg3, %arg4] : memref<?x?xf32>
       %3 = memref.load %arg1[%arg3, %arg4] : memref<?x?xf32>
       %4 = arith.addf %2, %3 : f32
       memref.store %4, %arg2[%arg3, %arg4] : memref<?x?xf32>
     }
   }
   return
 }
 ```

 In the process of lowering to loops and lower-level constructs, similar
 requirements are encountered, as are discussed in the
 [inlined call op proposal](https://llvm.discourse.group/t/introduce-std-inlined-call-op-proposal/282/2).
 We expect to be able to reuse the common lower-level infrastructure provided it
 evolves to support both region arguments and captures.

 #### Property 5: May Map To an External Library Call<a name="prop5"></a>

 A `linalg.generic` op may map to an external library call by specifying a
 `SymbolAttr`. At this level of abstraction, the important glue is the ability to
 perform transformations that preserve the structure necessary to ***call the
 external library after different transformations have been applied***.

 This involves considerations related to preservation of op semantics and
 integration at the ABI level. Regardless of whether one wants to use external
 library calls or a custom ISA, the problem for codegen is similar: preservation
 of a fixed granularity.

 Consider the following example that adds an additional attribute
 `library_call="pointwise_add"` that specifies the name of an external library
 call we intend to use:

 ```mlir
 // File name: example4.mlir
 #indexing_maps = [
   affine_map<(i, j) -> (i, j)>,
   affine_map<(i, j) -> (i, j)>,
   affine_map<(i, j) -> (i, j)>
 ]

 #attrs = {
   indexing_maps = #indexing_maps,
   iterator_types = ["parallel", "parallel"],
   library_call = "pointwise_add"
 }

 func.func @example(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
   linalg.generic #attrs
   ins(%A, %B: memref<?x?xf32>, memref<?x?xf32>)
   outs(%C: memref<?x?xf32>) {
   ^bb0(%a: f32, %b: f32, %c: f32):
     %d = arith.addf %a, %b : f32
     linalg.yield %d : f32
   }
   return
 }
 ```

 The property "*Map To an External Library Call*" is materialized by a lowering
 into a form that will resemble:

 ```mlir
 // Run: mlir-opt example4.mlir -convert-linalg-to-std

 func.func @example(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
   %0 = memref.cast %arg0 : memref<?x?xf32> to memref<?x?xf32, strided<[?, ?], offset: ?>>
   %1 = memref.cast %arg1 : memref<?x?xf32> to memref<?x?xf32, strided<[?, ?], offset: ?>>
   %2 = memref.cast %arg2 : memref<?x?xf32> to memref<?x?xf32, strided<[?, ?], offset: ?>>
   call @pointwise_add(%0, %1, %2) : (memref<?x?xf32, strided<[?, ?], offset: ?>>,
     memref<?x?xf32, strided<[?, ?], offset: ?>>, memref<?x?xf32, strided<[?, ?], offset: ?>>) -> ()
   return
 }
 func.func @pointwise_add(memref<?x?xf32, strided<[?, ?], offset: ?>>,
                          memref<?x?xf32, strided<[?, ?], offset: ?>>,
                          memref<?x?xf32, strided<[?, ?], offset: ?>>) attributes {llvm.emit_c_interface}
 ```

 Which, after lowering to LLVM resembles:

 ```mlir
 // Run: mlir-opt example4.mlir -convert-linalg-to-std | mlir-opt -convert-func-to-llvm
 // Some generated code are omitted here.
 func.func @example(%arg0: !llvm<"float*">, ...) {
   ...
   llvm.call @pointwise_add(...) : (!llvm<"float*">, ...) -> ()
   return
 }

 llvm.func @pointwise_add(%arg0: !llvm<"float*">, ...) attributes {llvm.emit_c_interface} {
   ...
   llvm.call @_mlir_ciface_pointwise_add(%9, %19, %29) : (!llvm."{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ f32*, f32*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ float*, float*, i64, [2 x i64], [2 x i64] }
 *">) -> ()
   llvm.return
 }
 llvm.func @_mlir_ciface_pointwise_add(!llvm."{ float*, float*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ f32*, f32*, i64, [2 x i64], [2 x i64] }*">, !llvm<"{ f32*, f32*, i64, [2 x i64], [2 x i64] }*">) attributes {llvm.emit_c_interface}
 ```

 ##### Convention For External Library Interoperability

 The `linalg` dialect adopts a convention that is similar to `BLAS` when
 offloading operations to fast library implementations: pass a non-owning pointer
 to input and output data with additional metadata. This convention is also found
 in libraries such as `MKL`, `OpenBLAS`, `BLIS`, `cuBLAS`, `cuDNN`, etc.. and
 more generally at interface points across language boundaries (e.g. C++ /
 Python).

 Generally, `linalg` passes non-owning pointers to View data structures to
 pre-compiled library calls linked externally.

 There is an
 [ongoing discussion](https://llvm.discourse.group/t/lowering-optional-attributes-in-linalg-structuredops-to-standard-dialect/333/3)
 on the topic of extending interoperability in the presence of key attributes.

 #### Property 6: Perfectly Nested Writes To The Whole Output Operands<a name="prop6"></a>

 Perfectly nested loops form a particularly important class of structure that
 enables key loop transformations such as tiling and mapping to library calls.
 Unfortunately, this type of structure is easily broken by transformations such
 as partial loop fusion. Tiling and mapping to library calls become more
 challenging, or even infeasible. Linalg ops adopt perfect-nestedness as a
 first-class property: the structure cannot be broken and is transported in the
 IR by construction.

 A `linalg.generic` op represents a perfectly nested loop nest that writes the
 entire memory region. This is a structural constraint across regions and loops
 that has proven to be key in simplifying transformations.

 One particular point to mention is that converting imperfectly nested code into
 perfectly nested code can often be done with enough loop distribution and
 embedding of conditionals down to the innermost loop level.

 Previous experience with Tensor Comprehensions gave us the intuition that
 forcing innermost control-flow nesting is a lot like writing data-parallel code
 with arrays of boolean values and predication. This type of trick has also been
 used before in polyhedral compilers to convert non-affine control into affine
 compute dependencies.

 While it may be possible to automate such rewrites from generic IR,
 `linalg.generic` just forces the semantics for now.

 The key implication is that this conversion to deep predication needs to be
 undone once we are done with Linalg transformations. After iterators and
 induction variables are materialized (i.e. after lowering out of
 `linalg.generic` occurred), the overall performance will be greatly influenced
 by the quality of canonicalizations, foldings and *Loop Independent Code Motion*
 (LICM).

 In the grander scheme, the reliance on late LICM was deemed a necessary risk.

 #### Putting it Together<a name="summary"></a>

 As it stands, the six properties above define the semantics of a
 `linalg.generic` op. It is an open question whether all of these semantics are
 strictly necessary in practice and whether some should or could be derived
 automatically while still maintaining the
 [core guiding principles](../../Rationale/RationaleLinalgDialect.md/#core-guiding-principlesa-nameguiding_principlesa).

 For the time being, we have settled on the combination of these properties
 because of empirical evidence building and working on multiple high-level
 compilers. As we lay those down and engage more with the community, we expect
 multiple rounds of discussions and design changes to the original architecture.

 ### Data Representation: Views<a name="views"></a>

 The current implementation uses the
 [Strided MemRef (a.k.a View)](https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio)
 abstraction. The name *View* is used interchangeably in `linalg` to signify
 *Strided MemRef*. In the future we expect to use other structured data types and
 support ragged, mixed-sparse and other types. We expect to draw on the
 experience from existing LIFT abstractions for
 [sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf) and
 [position-dependent arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).

 ### Metadata Ops<a name="metadata_ops"></a>

 A set of ops that manipulate metadata but do not move memory. These ops take
 `view` operands + extra attributes and return new `view`s. The returned `view`s
 generally alias the operand `view`. At the moment the existing ops are:

 ```
 * `memref.view`,
 * `memref.subview`,
 * `memref.transpose`.
 * `linalg.slice`,
 * `linalg.reshape`,
 ```

 Future ops are added on a per-need basis but should include:

 ```
 * `linalg.tile`,
 * `linalg.intersection`,
 * `linalg.convex_union`,
 * `linalg.difference` (would need to work on a list of views).
 ```

 These additional operations correspond to abstractions that have been known to
 work in the field of large-scale distributed stencil computations.

 In a longer-term future, the abstractions from
 [Legion data-centric programming model](https://legion.stanford.edu/overview/)
 seem generally appealing.

 ### Named Payload-Carrying Ops<a name="named_ops"></a>

 Additionally, `linalg` provides a small subset of commonly named operations:

 ```
 * `linalg.fill`,
 * `linalg.dot`,
 * `linalg.matmul`,
 * `linalg.conv`.
 ```

 These named operations adhere to the `linalg.generic` op interface. Work is in
 progress to define declarative mechanisms to automatically generate named ops
 from a description in terms of only the generic op interface.

 This is the main reason there are only a small number of ops today: we expect
 them to be auto-generated from Tablegen soon.

 ### Named Payload Ops Specification

 Linalg provides a declarative specification and a generation tool
 (`mlir-linalg-ods-gen`) to automatically produce named ops from a notation that
 is inspired by Einstein notation.

 The syntax and semantics used in `mlir-linalg-ods-gen` are very much in flight
 and borrow from Tensor Comprehensions (TC) but differ in a few dimensions, to
 better adapt to Linalg:

 1.  The input and output tensor parameters are specified as `id :
     type(symbolic-affine-expression-list)` (e.g. `A : f32(M, N + M)`) and each
     new symbol is discovered eagerly. TC on the other hand does not allow
     general symbolic affine expressions.
 1.  The output shapes are specified explicitly, in TC they are always derived
     from the input shapes.
 1.  The operations used to specify computations use EDSC intrinsics so that they
     can easily be parsed and emitted into a simple region builder without
     resorting to more general MLIR parsing.
 1.  Reduction dimensions are specified with angle bracket notation on the
     operation they apply to (e.g. `std_add<k>` specifies that `k` is a reduction
     dimension). In TC, the reduction dimensions are inferred. If one of the
     operand is not used in any expressions, it will be considered a shape-only
     operand, and the result of the indexing_map will be reduction dimensions.
 1.  The parallel and reduction dimension are ordered by the textual program
     order. For instance, in the comprehension `O(i, j) = std_add<k, l>(...)`,
     `i` (resp. `j`) is a parallel iterator encoded by affine dimension of
     position `0` (resp. `1`); `k` (resp. `l`) is a reduction iterator encoded by
     an affine dimension of position `2` (resp. `3`).
 1.  A list of attributes can be defined for the op with the format of `attr(
     strides: 2xi32)` and referenced in comprehension like `strides[0]`. These
     attribute uses will be parsed as affine symbols to generate op definition
     and implementation. For a concrete op instance, the runtime constant values
     from the attributes will be used to replace the affine symbols and simplify
     the indexing maps.

 These decisions and syntax are subject to evolution and change. In particular,
 op-specific attributes, dynamic ranks, some form of templating, shape
 calculation function specification, etc. may be added in the future.

 At this time, the following restrictions are imposed on the syntax and
 semantics:

 1.  Each def may only contain a single comprehension but each comprehension may
     perform multiple updates.
 2.  Each tensor may only be used with a single indexing expression.

 A `"""`-wrapped doc string can be attached to the named op. It should contain a
 oneliner for summary first, followed by lengthy description.

 The following specification may be used to define a named `batchmatmul` op:

 ```
 def batchmatmul(A: f32(Batch, M, K), B: f32(K, N)) -> (C: f32(Batch, M, N))
 """Batch matrix-multiply operation.

 This operation performs batch matrix-multiply over ...
 """
 {
   C(b, m, n) = std_addf<k>(std_mulf(A(b, m, k), B(k, n)));
 }
 ```

 When `mlir-linalg-ods-gen -gen-ods-decl=1` is called, the following ODS is
 produced:

 ```
 def batchmatmulOp : LinalgNamedStructured_Op<"batchmatmul", [
   NInputs<2>,
   NOutputs<1>,
   NamedStructuredOpTrait]> { ... }
 ```

 When `mlir-linalg-ods-gen -gen-impl=1` is called, the following C++ is produced:

 ```
 std::optional<SmallVector<StringRef, 8>> batchmatmul::referenceIterators() {
   return SmallVector<StringRef, 8>{
     getParallelIteratorTypeName(),
     getParallelIteratorTypeName(),
     getParallelIteratorTypeName(),
     getReductionIteratorTypeName() };
 }
 std::optional<SmallVector<AffineMap, 8>> batchmatmul::referenceIndexingMaps() {
   MLIRContext *context = getContext();
   AffineExpr d0, d1, d2, d3;
   bindDims(context, d0, d1, d2, d3);
   return SmallVector<AffineMap, 8>{
       AffineMap::get(4, 0, {d0, d1, d3}),
       AffineMap::get(4, 0, {d3, d2}),
       AffineMap::get(4, 0, {d0, d1, d2}) };
 }
 void batchmatmul::regionBuilder(ArrayRef<BlockArgument> args) {
   using namespace edsc;
   using namespace intrinsics;
   Value _0(args[0]), _1(args[1]), _2(args[2]);
   Value _4 = std_mulf(_0, _1);
   Value _5 = std_addf(_2, _4);
   (linalg_yield(ValueRange{ _5 }));
 }
 ```

 ### YAML Based Named Structured Ops<a name="yaml-gen"></a>

 Linalg provides a declarative generation tool (`mlir-linalg-ods-yaml-gen`) to
 automatically produce named ops from a YAML-based op description format intended
 to capture the structure of the named ops. The YAML-based op descriptions are
 generated from a higher level [DSL](OpDSL.md) and are not meant to be edited
 directly.

 This facility is currently in flight and is intended to subsume the above when
 ready. See the C++ class to YAML mapping traits in
 `mlir-linalg-ods-yaml-gen.cpp` as the source of truth for the schema.

 Most of the above documentation roughly applies to this path and will be ported
 as migration continues.

 ## Open Issues and Design Alternatives<a name="open_issues"></a>

 Multiple open issues and design alternatives are in flight and it is time to lay
 them out for the community to discuss and pick apart:

 1.  Should `linalg.generic` support nesting?
 1.  Should `linalg.generic` regions take views or only scalars?
 1.  Should we try to solve automatic differentiation at this level of
     abstraction?
 1.  Are all the six properties really necessary?
 1.  Is this relying too much on declarative specification and would we be better
     off relying more on analyses?
 1.  Is this general enough for the community's needs? If not how should this be
     extended, if at all? ...

 These key questions (and much more) should be really thought of in the general
 context of MLIR in which different levels of IR interoperate seamlessly. In
 practice, it is not necessary (or beneficial) to try and solve all problems in
 the same IR.

 ## Operations

 [include "Dialects/LinalgOps.md"]