| # MLIR Rationale |
| |
| This document is intended to capture some of the alternatives considered and |
| open debates in the design of MLIR, along with the rationale for certain |
| decisions we made. This is not intended to be a "finely groomed" document - we |
| prefer the ability to dump in interesting tidbits without worrying too much |
| about their consistency or readability. |
| |
| [TOC] |
| |
| ## Abstract |
| |
| MLIR is a compiler intermediate representation with similarities to traditional |
| three-address SSA representations (like |
| [LLVM IR](http://llvm.org/docs/LangRef.html) or |
| [SIL](https://github.com/apple/swift/blob/main/docs/SIL.rst)), but which |
| introduces notions from the polyhedral loop optimization works as first class |
| concepts. This hybrid design is optimized to represent, analyze, and transform |
| high level dataflow graphs as well as target-specific code generated for high |
| performance data parallel systems. Beyond its representational capabilities, its |
| single continuous design provides a framework to lower from dataflow graphs to |
| high performance target specific code. |
| |
| MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or |
| "Machine Learning IR" or "Mid Level IR", we prefer the first. This document only |
| provides the rationale behind MLIR -- its actual |
| [specification document](../LangRef.md) and other content is hosted elsewhere. |
| |
| ## Introduction and Motivation |
| |
| The Multi-Level Intermediate Representation (MLIR) is intended for easy |
| expression and optimization of computations involving deep loop nests and dense |
| matrices of high dimensionality. It is thus well-suited to deep learning |
| computations in particular. Yet it is general enough to also represent arbitrary |
| sequential computation. The representation allows high-level optimization and |
| parallelization for a wide range of parallel architectures including those with |
| deep memory hierarchies --- general-purpose multicores, GPUs, and specialized |
| neural network accelerators. |
| |
| MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs |
| while combining them with ideas from the polyhedral abstraction to represent |
| loop nests, multidimensional data (tensors), and transformations on these |
| entities as first class concepts in the IR. |
| |
| MLIR is a multi-level IR, i.e., it represents code at a domain-specific |
| representation such as HLO or TensorFlow graphs, all the way down to the machine |
| level. MLIR is able to represent arbitrary control flow and arbitrary data |
| accesses, and is general enough to represent nearly all sequential computation. |
| This is a key distinction from existing polyhedral representation |
| implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to |
| use the polyhedral abstraction in a way isolated from the LLVM IR and only for |
| affine loop nests, i.e., portions of the code where array accesses, loop bounds, |
| and conditionals are regular (involve linear functions of loop iterators and |
| constant symbols). The presence of statically unpredictable data accesses or |
| control flow does not preclude representation in MLIR, but only limits to a |
| certain extent the ability to reason about and apply transformations using the |
| polyhedral abstraction. |
| |
| Maps, sets, and relations with affine constraints are the core structures |
| underlying a polyhedral representation of high-dimensional loop nests and |
| multidimensional arrays. These structures are represented as textual expressions |
| in a form close to their mathematical form. These structures are used to capture |
| loop nests, tensor data structures, and how they are reordered and mapped for a |
| target architecture. All structured or "conforming" loops are captured as part |
| of the polyhedral information, and so are tensor variables, their layouts, and |
| subscripted accesses to these tensors in memory. |
| |
| The information captured in the IR allows a compact expression of all loop |
| transformations, data remappings, explicit copying necessary for explicitly |
| addressed memory in accelerators, mapping to pre-tuned expert-written |
| primitives, and mapping to specialized vector instructions. Loop transformations |
| that can be easily implemented include the body of affine transformations: these |
| subsume all traditional loop transformations (unimodular and non-unimodular) |
| such as loop tiling, interchange, permutation, skewing, scaling, relative |
| shifting, reversal, fusion, and distribution/fission. Transformations on data |
| layout such as padding and transforming to blocked layouts are also represented |
| well via affine layout maps. |
| |
| MLIR's design allows a progressive lowering to target-specific forms. Besides |
| high-level transformations for loop nests and data layouts that a typical |
| mid-level optimizer is expected to deal with, MLIR is also designed to perform |
| certain low-level scheduling and mapping decisions that a typical backend IR is |
| entrusted with: these include mapping to specialized vector instructions, |
| auto-vectorization, and software pipelining. The need to support these |
| transformations stems from the fact that neural network accelerators have |
| specialized units that deal with large chunks of data whose computation maps |
| back to chunks of more than one loop of the loop nests as viewed by a program at |
| a level closer to the original specification. Such specialized units or |
| instructions operate on multidimensional data chunks from a programmer's |
| viewpoint. It thus makes it hard or infeasible for a backend operating on a very |
| low-level IR close to assembly to lift and reconstruct loops and perform such a |
| mapping. This is in contrast to classic instruction selection and scheduling in |
| today's compilers that primarily only deals with the body of the innermost loop. |
| MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor |
| libraries operating on data at higher levels (or at the highest level) of the |
| memory hierarchy. |
| |
| In summary, MLIR is convenient for and closed under the kind of transformations |
| needed to lower to general-purpose as well as specialized accelerators. It also |
| allows one to build modular and reusable target independent and target dependent |
| passes. |
| |
| ## Design Decisions |
| |
| This section sheds light on some of the design decisions -- some of these are |
| indirectly implied by the specification document. |
| |
| ### Loads and stores |
| |
| The 'load' and 'store' instructions are specifically crafted to fully resolve to |
| an element of a memref. These instructions take as arguments n+1 indices for an |
| n-ranked tensor. This disallows the equivalent of pointer arithmetic or the |
| ability to index into the same memref in other ways (something which C arrays |
| allow for example). Furthermore, for the affine constructs, the compiler can |
| follow use-def chains (e.g. through |
| [affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop) or |
| through the map attributes of |
| [affine operations](../Dialects/Affine.md/#operations)) to precisely analyze |
| references at compile-time using polyhedral techniques. This is possible because |
| of the |
| [restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols). |
| |
| A scalar of element-type (a primitive type or a vector type) that is stored in |
| memory is modeled as a 0-d memref. This is also necessary for scalars that are |
| live out of for loops and if conditionals in a function, for which we don't yet |
| have an SSA representation -- |
| [an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow |
| that is described later in this doc. |
| |
| ### Symbols and types |
| |
| The current MLIR disallows use of symbols in types. For example, when a tensor |
| or memref dimension is statically unknown, it is denoted in the type as '?'. An |
| SSA symbol is then bound to it when a memref is created. The actual value of the |
| unknown dimension can be queried using the "dim" builtin as shown below. |
| |
| Example: |
| |
| ```mlir |
| func.func foo(...) { |
| %A = memref.alloc <8x?xf32, #lmap> (%N) |
| ... |
| call bar(%A) : (memref<8x?xf32, #lmap>) |
| } |
| |
| func.func bar(%A : memref<8x?xf32, #lmap>) { |
| // Type of %A indicates that %A has dynamic shape with 8 rows |
| // and unknown number of columns. The number of columns is queried |
| // dynamically using dim instruction. |
| %N = memref.dim %A, 1 : memref<8x?xf32, #lmap> |
| |
| affine.for %i = 0 to 8 { |
| affine.for %j = 0 to %N { |
| // A[i,j] += 1 |
| %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap> |
| %s2 = add %s1, 1 |
| affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap> |
| } |
| } |
| return |
| } |
| |
| ``` |
| |
| An alternative design is to embed the reference to symbols directly in the |
| type - memref<8x%Nxf32>. We went for the current approach in MLIR because it |
| simplifies the design --- types remain immutable when the values of symbols |
| change. |
| |
| ### Block Arguments vs PHI nodes |
| |
| MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)" |
| rather than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in |
| LLVM. This choice is representationally identical (the same constructs can be |
| represented in either form) but block arguments have several advantages: |
| |
| 1. LLVM PHI nodes always have to be kept at the top of a block, and |
| transformations frequently have to manually skip over them. This is defined |
| away with BB arguments. |
| 1. LLVM has a separate function Argument node. This is defined away with BB |
| arguments, because the arguments to the entry block serve this purpose. |
| 1. Blocks of PHI nodes in LLVM execute atomically, which is surprising and |
| super confusing to compiler engineers and it is easy to introduce bugs with |
| this (very related to the |
| "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)" |
| problem in SSA lowering literature.) With the BB argument representation, |
| this confusion is defined away. |
| 1. The entry list of PHI nodes in LLVM are unordered, and some blocks have |
| thousands of predecessors (e.g. unwind blocks). This can cause long compile |
| time problems because transformations have to linearly scan this list. This |
| is defined away with BB argument representation. |
| 1. LLVM has no way to represent values that are available only in one successor |
| but not the other, e.g. its invoke instruction cannot produce the exception |
| value JUST on the exception edge. Instead, the |
| [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction) |
| is a hack used to represent this. MLIR doesn't make use of this capability, |
| but SIL uses it extensively, e.g. in the |
| [switch_enum instruction](https://github.com/apple/swift/blob/main/docs/SIL.rst#switch-enum). |
| |
| For more context, block arguments were previously used in the Swift |
| [SIL Intermediate Representation](https://github.com/apple/swift/blob/main/docs/SIL.rst), |
| and described in |
| [a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of |
| interest |
| [starts here](https://www.youtube.com/watch?v=Ntj8ab-5cvE&t=596s). |
| |
| ### Index type usage and limitations |
| |
| Index types are intended to be used for platform-specific "size" values and may |
| appear in subscripts, sizes of aggregate types and affine expressions. They are |
| also tightly coupled with `affine.apply` and affine.load/store operations; |
| having `index` type is a necessary precondition of a value to be acceptable by |
| these operations. |
| |
| We allow `index` types in tensors, vectors, and memrefs as a code generation |
| strategy has to map `index` to an implementation type and hence needs to be able |
| to materialize corresponding values. However, the target might lack support for |
| `vector` values with the target specific equivalent of the `index` type. |
| |
| ### Data layout of non-primitive types |
| |
| Data layout information such as the bit width or the alignment of types may be |
| target and ABI-specific and thus should be configurable rather than imposed by |
| the compiler. Especially, the layout of compound or `index` types may vary. MLIR |
| specifies default bit widths for certain primitive *types*, in particular for |
| integers and floats. It is equal to the number that appears in the type |
| definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`. |
| The bit width is not *necessarily* related to the amount of memory (in bytes) or |
| the register size (in bits) that is necessary to store the value of the given |
| type. For example, `vector<3xi57>` is likely to be lowered to a vector of four |
| 64-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, |
| rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the |
| bit width. MLIR makes such [data layout information](../DataLayout.md) |
| configurable using attributes that can be queried during lowering, for example, |
| when allocating a compound type. |
| |
| The data layout of dialect-specific types is undefined at MLIR level. Yet |
| dialects are free to define their own quantities and make them available via the |
| data layout infrastructure. |
| |
| ### Integer signedness semantics |
| |
| Integers in the builtin MLIR type system have a bitwidth (note that the `index` |
| type has a symbolic width equal to the machine word size), and they *may* |
| additionally have signedness semantics. The purpose is to satisfy the needs of |
| different dialects, which can model different levels of abstractions. Certain |
| abstraction, especially closer to source language, might want to differentiate |
| signedness with integer types; while others, especially closer to machine |
| instruction, might want signless integers. Instead of forcing each abstraction |
| to adopt the same integer modelling or develop its own one in house, Integer |
| type provides this as an option to help code reuse and consistency. |
| |
| For the standard dialect, the choice is to have signless integer types. An |
| integer value does not have an intrinsic sign, and it's up to the specific op |
| for interpretation. For example, ops like `arith.addi` and `arith.muli` do two's |
| complement arithmetic, but some other operations get a sign, e.g. `arith.divsi` |
| vs `arith.divui`. |
| |
| LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type), |
| which was introduced in a revamp rolled out |
| [in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived). |
| Prior to that, from |
| [LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to |
| [1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM |
| uses signed types like "sbyte" and "ubyte". This shift was important and has |
| served LLVM well over the years. The reason this is important is that it is a |
| good thing for an intermediate representation to represent the same computation |
| with the same instruction. Signed types got in the way, because (e.g.) an "add |
| of an sbyte" does the same computation as an "add of a ubyte", but the type |
| system made them look artificially different. This split also required casts |
| like "cast from sbyte to ubyte" which do nothing at the machine level. Removing |
| signs from the type system eliminated these problems, making the compiler |
| simpler. |
| |
| More information about this split is available in an old |
| [talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about |
| LLVM 2.0. |
| |
| Note that this rationale only applies to the "standard ops" dialect in which we |
| can express an opinion about its design. Other dialects generally try to model |
| an external system, and should aim to reflect its design as closely as possible. |
| |
| ### Splitting floating point vs integer operations |
| |
| The MLIR "Arith" dialect splits many integer and floating point operations |
| into different categories, for example `arith.addf` vs `arith.addi` and |
| `arith.cmpf` vs `arith.cmpi` |
| ([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)). |
| These instructions *are* polymorphic on the number of elements in the type |
| though, for example `addf` is used with scalar floats, vectors of floats, and |
| tensors of floats (LLVM does the same thing with its scalar/vector types). |
| |
| This split is important because floating point and integer operations are quite |
| different in practice: for example, floating point values include NaN's, so |
| [integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and |
| [floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction) |
| should use different comparison opcodes. On the arithmetic side of things, |
| floating point operations support rounding modes, floating point contractions, |
| ["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers |
| may want to have two's complement overflow behavior or be undefined on |
| [various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction) |
| for performance. |
| |
| We are a long way from this sort of thing being a priority to care about in |
| MLIR, but since we have experience and know the right way to do this, we'd |
| rather design it in from the beginning. |
| |
| Note that this rationale only applies to the "standard ops" dialect in which we |
| can express an opinion about its design. Other dialects generally try to model |
| an external system, and should aim to reflect its design as closely as possible. |
| |
| ### Specifying sign in integer comparison operations |
| |
| Since integers are [signless](#integer-signedness-semantics), it is necessary to |
| define the sign for integer comparison operations. This sign indicates how to |
| treat the foremost bit of the integer: as sign bit or as most significant bit. |
| For example, comparing two `i4` values `0b1000` and `0b0010` yields different |
| results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This |
| difference is only significant for *order* comparisons, but not for *equality* |
| comparisons. Indeed, for the latter all bits must have the same value |
| independently of the sign. Since both arguments have exactly the same bit width |
| and cannot be padded by this operation, it is impossible to compare two values |
| whose bit representations would differ while the values are interpreted as |
| equal. |
| |
| ### Specifying comparison kind as attribute |
| |
| Unlike arithmetic, comparison operators share several common properties, e.g. |
| they cannot be considered associative. In practice, comparisons are sometimes |
| implemented by the same instruction or its variants so it makes sense to group |
| them together at the IR level. |
| |
| An alternative would be introducing ten distinct operators for all currently |
| supported kinds of integer comparisons. These operators would have increased the |
| number of "reserved" names used by standard operations as well as the size of |
| the C++ API while their implementations would have been mostly identical. |
| |
| The comparison kind is internally an integer attribute. However, for the sake of |
| readability by humans, custom assembly form accepts string literals that are |
| mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies |
| integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what |
| gets compared to what else. This syntactic sugar is possible thanks to parser |
| logic redefinitions for custom assembly form of non-builtin operations. |
| Supporting it in the full notation would have required changing how the main |
| parsing algorithm works and may have unexpected repercussions. While it had been |
| possible to store the predicate as string attribute, it would have rendered |
| impossible to implement switching logic based on the comparison kind and made |
| attribute validity checks (one out of ten possible kinds) more complex. |
| |
| ### Regions |
| |
| #### Attributes of type 'Block' |
| |
| We considered representing regions through `ArrayAttr`s containing a list of a |
| special type `IRBlockAttr`, which in turn would contain a list of operations. |
| All attributes in MLIR are unique’d within the context, which would make the IR |
| inside the regions immortal for no good reason. |
| |
| #### Use "inlined" functions as regions |
| |
| We considered attaching a "force-inline" attribute on a function and/or a |
| function `call` operation. Even the minimal region support (use cases in |
| affine.for and affine.if existing before the regions) requires access to the |
| values defined in the dominating block, which is not supported by functions. |
| Conceptually, function bodies are instances of regions rather than the inverse; |
| regions can also be device kernels, alternative sections, etc. |
| |
| #### Dedicated `region` operation |
| |
| This would mean we have a special kind of operation that is allowed to have |
| regions while other operations are not. Such distinction is similar to the |
| Stmt/Op difference we have had and chose to remove to make the IR simpler and |
| more flexible. It would also require analyses and passes to consider the |
| interplay between operations (e.g., an `affine.for` operation must be followed |
| by a region operation). Finally, a region operation can be introduced using the |
| current implementation, among other operations and without being special in any |
| sense. |
| |
| #### Explicit capture of the values used in a region |
| |
| Being able to use values defined outside the region implies that use-def chains |
| may contain uses from different nested regions. Consequently, IR transformations |
| and analyses can pull the instruction defining the value across region |
| boundaries, for example in case of TableGen-defined canonicalization patterns. |
| This would not be the case if all used values had been passed as region |
| arguments. One of the motivations for introducing regions in the IR is precisely |
| to enable cross-region analyses and transformations that are simpler than |
| inter-procedural transformations. Having uses from different regions appear in |
| the same use-def chain, contrary to an additional data structure maintaining |
| correspondence between function call arguments as uses of the original |
| definitions and formal arguments as new definitions, enables such |
| simplification. Since individual operations now belong to blocks, which belong |
| to regions, it is always possible to check if the definition of the value |
| belongs to the same region as its particular use. The risk is that any IR |
| traversal will need to handle explicitly this situation and it is easy to forget |
| a check (or conversely it isn’t easy to design the right check in a tablegen |
| pattern for example): traversing use-def chains potentially crosses implicitly |
| semantic barriers, making it possible to unknowingly break region semantics. |
| This is expected to be caught in the verifier after the transformation. |
| |
| At the same time, one may choose to pass certain or all values as region |
| arguments to explicitly break the use-def chains in the current proposal. This |
| can be combined with an attribute-imposed semantic requirement disallowing the |
| body of the region to refer to any value from outside it. |
| |
| ### Dialect type extensions |
| |
| This section describes the design decisions that shaped the dialect extensible |
| type system present in MLIR. |
| |
| #### Interactions between dialects |
| |
| There are two different interactions between dialects that are important to |
| understand. When types of a dialect are: |
| |
| * In operations of other dialects |
| |
| - For standard/builtin operations, only builtin types are allowed. This |
| restriction allows for operations to clearly understand the invariants |
| that they are working under. |
| - Outside of standard/builtin operations, dialects are expected to verify |
| the allowable operation types per operation. |
| |
| * In types of other dialects |
| |
| - For builtin types, these types are allowed to contain types from other |
| dialects. This simplifies the type system and removes the need for |
| dialects to redefine all of the builtin aggregate types, e.g. tensor, as |
| well as the memref type. Dialects are expected to verify that a specific |
| type is valid within a builtin type, e.g. if a type can be an element of |
| a tensor. |
| - For dialect types, the dialect is expected to verify any type |
| invariants, e.g. if the tensor type can contain a specific type of that |
| dialect. |
| |
| #### Separating builtin and standard types |
| |
| Following the separation between the built-in and standard dialect, it makes |
| sense to separate built-in types and standard dialect types. Built-in types are |
| required for the validity of the IR itself, e.g. the function type (which |
| appears in function signatures and generic assembly forms of operations). |
| Integer, float, vector, memref and tensor types, while important, are not |
| necessary for IR validity. |
| |
| #### Unregistered types |
| |
| MLIR supports unregistered operations in generic assembly form. MLIR also |
| supports a similar concept for types. When parsing, if the dialect for dialect |
| type has not been registered the type is modeled as an 'OpaqueType'. This allows |
| for types to be round-tripped without needing to link in the dialect library |
| that defined them. No additional information about opaque types, outside of |
| parsing/printing, will be available. |
| |
| #### Dialect type syntax |
| |
| Dialect extended types are represented as string literals wrapped inside of the |
| dialect namespace. This means that the parser delegates to the dialect for |
| parsing specific type instances. This differs from the representation of dialect |
| defined operations, of which have an identifier name that the parser uses to |
| identify and parse them. |
| |
| This representation was chosen for several reasons: |
| |
| ##### Dialects must provide custom type parsers |
| |
| Dialect type parsing cannot plug into the existing parser infrastructure as |
| operations do with the OpAsmParser/Printer. Operations have a defined syntax |
| structure that is the same across all dialects. Types, on the other hand, may |
| have many different, and sometimes conflicting, parsing constraints that would |
| be difficult/unmaintainable to provide within a single interface. |
| |
| This also has the added benefit of encouraging dialects to reuse existing |
| external type parsers. For example, an LLVM dialect may provide an MLIR LLVM |
| type that is simply a wrapper around LLVM types. The LLVM dialect would then use |
| the existing LLVM type parsing infrastructure. |
| |
| Example: |
| |
| ```mlir |
| %s = "foo"() : () -> !llvm<"i32*"> |
| ``` |
| |
| ##### Types do not always have canonical names |
| |
| Unlike operations, types generally do not have a formal canonical name. For |
| example, function types have no defined keyword and integer types are defined by |
| a regular expression to support arbitrary bitwidth. Dialects with existing type |
| systems, e.g. LLVM, are likely to provide wrappers around their existing type |
| systems. For these wrapper types there is no simple canonical name, it's logical |
| to think of these types as existing within the namespace of the dialect. If a |
| dialect wishes to assign a canonical name to a type, it can be done via |
| [type aliases](../LangRef.md/#type-aliases). |
| |
| ### Tuple types |
| |
| The MLIR type system provides first class support for defining |
| [tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that |
| `Tuple` represents a universal concept that is likely to, and has already begun |
| to, present itself in many different dialects. Though this type is first class |
| in the type system, it merely serves to provide a common mechanism in which to |
| represent this concept in MLIR. As such, MLIR provides no standard operations |
| for interfacing with `tuple` types. It is up to dialect authors to provide |
| operations, e.g. extract_tuple_element, to interpret and manipulate them. When |
| possible, operations should prefer to use multiple results instead. These |
| provide a myriad of benefits, such as alleviating any need for tuple-extract |
| operations that merely get in the way of analysis and transformation. |
| |
| ### Assembly forms |
| |
| MLIR decides to support both generic and custom assembly forms under the |
| following considerations: |
| |
| MLIR is an open system; it is designed to support modular and pluggable |
| dialects. Depending on whether there exists a corresponding dialect and whether |
| the dialect is plugged in, operations may or may not be registered into MLIR |
| system. Yet we still need a way to investigate these operations. So the generic |
| assembly form is mandated by this aspect of MLIR system. It provides a default |
| textual form for operations. |
| |
| On the other hand, an assembly form is for assisting developers to investigate |
| the IR. The generic form serves as a safe fallback but it can be too verbose for |
| certain ops. Therefore, MLIR gives each dialect the choice to define a custom |
| assembly form for each operation according to the operation's semantics and |
| specific needs. The custom assembly form can de-duplicate information from the |
| operation to derive a more concise form, thus better facilitating the |
| comprehension of the IR. |
| |
| ## Examples |
| |
| This section describes a few very simple examples that help understand how MLIR |
| represents computation. |
| |
| ### Non-affine control flow |
| |
| ```mlir |
| // A simple linear search in every row of a matrix |
| for (i = 0; i < N; i++) { |
| for (j = 0; j < N; j++) { |
| // dynamic control flow |
| if (a[i][j] == key) { |
| s[i] = j; |
| break; |
| } |
| } |
| } |
| ``` |
| |
| The presence of dynamic control flow leads to an inner non-affine function |
| nested in an outer function that uses affine loops. |
| |
| ```mlir |
| func.func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) { |
| %ni = memref.dim %A, 0 : memref<?x?xi32> |
| // This loop can be parallelized |
| affine.for %i = 0 to %ni { |
| call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32) |
| } |
| return |
| } |
| |
| func.func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) { |
| %nj = memref.dim %A, 1 : memref<?x?xi32> |
| cf.br ^bb1(0) |
| |
| ^bb1(%j: i32) |
| %p1 = arith.cmpi "lt", %j, %nj : i32 |
| cf.cond_br %p1, ^bb2, ^bb5 |
| |
| ^bb2: |
| %v = affine.load %A[%i, %j] : memref<?x?xi32> |
| %p2 = arith.cmpi "eq", %v, %key : i32 |
| cf.cond_br %p2, ^bb3(%j), ^bb4 |
| |
| ^bb3(%j: i32) |
| affine.store %j, %S[%i] : memref<?xi32> |
| cf.br ^bb5 |
| |
| ^bb4: |
| %jinc = arith.addi %j, 1 : i32 |
| cf.br ^bb1(%jinc) |
| |
| ^bb5: |
| return |
| } |
| ``` |
| |
| As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol |
| identifiers to be used with the affine.apply operation only apply to accesses |
| inside `affine.for` and `affine.if` operations. However, an analysis of accesses |
| inside the called function (`@search_body`) is necessary to determine if the |
| `%i` loop could be parallelized: such function access analysis is calling |
| context sensitive. |
| |
| ### Non-affine loop bounds |
| |
| Loop bounds that are not affine lead to a nesting of functions as shown below. |
| |
| ```c |
| for (i = 0; i < N; i++) |
| for (j = 0; j < N; j++) |
| // Non-affine loop bound for k loop. |
| for (k = 0; k < pow(2, j); k++) |
| for (l = 0; l < N; l++) { |
| // block loop body |
| ... |
| } |
| ``` |
| |
| ```mlir |
| func.func @outer_nest(%n : index) { |
| affine.for %i = 0 to %n { |
| affine.for %j = 0 to %n { |
| %pow = call @pow(2, %j) : (index, index) -> index |
| call @inner_nest(%pow, %n) : ... |
| } |
| } |
| return |
| } |
| |
| func.func @inner_nest(%m : index, %n : index) { |
| affine.for %k = 0 to %m { |
| affine.for %l = 0 to %n { |
| ... |
| } |
| } |
| return |
| } |
| ``` |
| |
| ### Reference 2D Convolution |
| |
| The following example illustrates a reference implementation of a 2D |
| convolution, which uses an integer set `#domain` to represent valid input data |
| in a dilated convolution. |
| |
| ```mlir |
| // Dilation factors S0 and S1 can be constant folded if constant at compile time. |
| #domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0, |
| S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0) |
| // Identity map (shown here for illustration). |
| #map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6) |
| |
| // Affine map from output to input coordinate space. |
| // d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w |
| // S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation |
| // S4 = h_pad_low, S5 = w_pad_low |
| // %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low |
| // %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low |
| #map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4) |
| #map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5) |
| |
| // Semi-affine map to undilated input coordinate space. |
| // d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation. |
| #map2_0 = (d0, d1) [S0, S1] -> (d0 / S0) |
| #map2_1 = (d0, d1) [S0, S1] -> (d1 / S1) |
| |
| // Conv2D shapes: |
| // input: [batch, input_height, input_width, input_feature] |
| // kernel: [kernel_height, kernel_width, input_feature, output_feature] |
| // output: [batch, output_height, output_width, output_feature] |
| func.func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>, |
| %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>, |
| %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) { |
| affine.for %b = 0 to %batch { |
| affine.for %oh = 0 to %output_height { |
| affine.for %ow = 0 to %output_width { |
| affine.for %of = 0 to %output_feature { |
| affine.for %kh = 0 to %kernel_height { |
| affine.for %kw = 0 to %kernel_width { |
| affine.for %if = 0 to %input_feature { |
| // Calculate input indices. |
| %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5) |
| [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, |
| %h_pad_low, %w_pad_low] |
| %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5) |
| [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, |
| %h_pad_low, %w_pad_low] |
| |
| // Check if access is not in padding. |
| affine.if #domain(%1_0, %1_1) |
| [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] { |
| %2_0 = affine.apply #map2 (%1_0, %1_1) |
| %2_1 = affine.apply #map2 (%1_0, %1_1) |
| // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices] |
| call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1) |
| } |
| } |
| } |
| } |
| } |
| } |
| } |
| } |
| return |
| } |
| ``` |
| |
| TODO: (Add more examples showing the IR for a variety of interesting cases) |
| |
| ## Design alternatives and extensions |
| |
| This is a list of some design alternatives and extensions that we discussed in |
| detail but did not include in the spec or postponed them for future |
| consideration on demand. We will revisit these discussions when we have more |
| implementation experience and learn more about the challenges and limitations of |
| our current design in practice. |
| |
| ### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms |
| |
| The current MLIR uses a representation of polyhedral schedules using a tree of |
| if/for loops. We extensively debated the tradeoffs involved in the typical |
| unordered polyhedral instruction representation (where each instruction has |
| multidimensional schedule information), discussed the benefits of schedule tree |
| forms, and eventually decided to go with a syntactic tree of affine if/else |
| conditionals and affine for loops. Discussion of the tradeoff was captured in |
| this document: |
| [ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md). |
| |
| At a high level, we have two alternatives here: |
| |
| 1. Schedule tree representation instead of an affine loop AST form: The current |
| proposal uses an affine loop and conditional tree form, which is syntactic |
| and with no separation of domains as sets and schedules as multidimensional |
| affine functions. A schedule tree form however makes polyhedral domains and |
| schedules a first class concept in the IR allowing compact expression of |
| transformations through the schedule tree without changing the domains of |
| instructions. Such a representation also hides prologues, epilogues, partial |
| tiles, complex loop bounds and conditionals making loop nests free of |
| "syntax". Cost models instead look at domains and schedules. In addition, if |
| necessary such a domain schedule representation can be normalized to |
| explicitly propagate the schedule into domains and model all the cleanup |
| code. An example and more detail on the schedule tree form is in the next |
| section. |
| 1. Having two different forms of "affine regions": an affine loop tree form and |
| a polyhedral schedule tree form. In the latter, ops could carry attributes |
| capturing domain, scheduling, and other polyhedral code generation options |
| with IntegerSet, AffineMap, and other attributes. |
| |
| #### Schedule Tree Representation for Affine Regions |
| |
| This representation is based on a simplified form of the domain/schedule |
| representation used by the polyhedral compiler community. Domains represent what |
| has to be executed while schedules represent the order in which domain elements |
| are interleaved. We model domains as non-piece-wise convex integer sets, and |
| schedules as affine functions; however, the former can be disjunctive, and the |
| latter can be piece-wise affine relations. In the schedule tree representation, |
| domain and schedules for instructions are represented in a tree-like structure |
| which is called a schedule tree. Each non-leaf node of the tree is an abstract |
| polyhedral dimension corresponding to an abstract fused loop for each ML |
| instruction that appears in that branch. Each leaf node is an ML Instruction. |
| |
| ```mlir |
| // A tiled matmul code (128x128x128) represented in schedule tree form |
| |
| // #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5) |
| #intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0 |
| #intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, |
| -j + M-1 >= 0, k >= 0, -k + N - 1 >= 0) |
| func.func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols |
| // t1, t2, t3, t4, t5, t6 are abstract polyhedral loops |
| mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) { |
| mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) { |
| // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) |
| call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K) |
| with @intset_ij(%i, %j) [%M, %N, %K] |
| mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) { |
| // (%i, %j, %k) = affine.apply (d0, d1, d2) |
| // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) |
| call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] |
| // (%i, %j, %k) = affine.apply (d0, d1, d2) |
| // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) |
| call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] |
| mldim %t4 : {S4} i mod 128 { |
| mldim %t5 : {S4} j mod 128 { |
| mldim %t6 : {S4} k mod 128 { |
| // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6) |
| call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K) |
| with #inset_ijk(%i, %j, %k) [%M, %N, %K] |
| } // end mld4im t6 |
| } // end mldim t5 |
| } // end mldim t4 |
| } // end mldim t3 |
| // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) |
| call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K] |
| } // end mldim t2 |
| } // end mldim t1 |
| return |
| } |
| |
| ``` |
| |
| ### Affine Relations |
| |
| The current MLIR spec includes affine maps and integer sets, but not affine |
| relations. Affine relations are a natural way to model read and write access |
| information, which can be very useful to capture the behavior of external |
| library calls where no implementation is available, high-performance vendor |
| libraries, or user-provided / user-tuned routines. |
| |
| An affine relation is a relation between input and output dimension identifiers |
| while being symbolic on a list of symbolic identifiers and with affine |
| constraints on the identifiers. |
| |
| Syntax: |
| |
| ``` |
| // Affine relation definition at the top of file |
| affine-rel-def ::= affine-rel-id `=` affine-relation-inline |
| |
| affine-rel-id ::= `##` prefixed-id |
| |
| affine-relation-inline ::= |
| `(` input-dims `)` (`[` symbols `]`)? `->` |
| `(` output-dims `)` : affine-constraint-conjunction |
| |
| input-dims ::= bare-id-list |
| output-dims ::= bare-id-list |
| symbols ::= bare-id-list |
| |
| affine-rel ::= affine-rel-id | affine-relation-inline |
| |
| // Usage |
| affine-rel-spec ::= affine-rel dim-and-symbol-use-list |
| ``` |
| |
| All identifiers appearing in input-dims, output-dims, and symbol-dims are |
| pairwise distinct. All affine-constraint non-terminals in the above syntax are |
| allowed to contain identifiers only from input-dims, output-dims, and |
| symbol-dims. |
| |
| Affine relations are used to model read, write, may_read, and may_write sets of |
| functions in the IR. The output dimension identifiers correspond to the data |
| dimensions. |
| |
| Example: |
| |
| ```mlir |
| // read relation: two elements ( d0 <= r0 <= d0+1 ) |
| ##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0 |
| |
| func.func @count (%A : memref<128xf32>, %pos : i32) -> f32 |
| reads: {%A ##aff_rel9 (%pos)} |
| writes: /* empty */ |
| may_reads: /* empty */ |
| may_writes: /* empty */ { |
| bb0 (%0, %1: memref<128xf32>, i64): |
| %val = affine.load %A [%pos] |
| %val = affine.load %A [%pos + 1] |
| %p = arith.mulf %val, %val : f32 |
| return %p : f32 |
| } |
| ``` |
| |
| ### Regions |
| |
| #### Making function definition an operation |
| |
| MLIR supports values of a Function type. Instead of having first-class IR |
| concept for functions, one could define an operation with a body region that |
| defines a function value. The particularity of functions is that their names are |
| globally visible and can be referred to before being defined, unlike SSA values |
| that must be defined first. Implementing a "function definition" operation would |
| require to relax some of the SSA constraints in a region, and also make the IR |
| Module a region as well. It would also affect the core infrastructure (e.g., |
| function passes) only for the sake of concept unification. |
| |
| #### Having types on a region |
| |
| Instead of inspecting the types of arguments of the first block, one could give |
| the region itself a type. This type would be redundant with block argument |
| types, which must have values and create room for type mismatches. While |
| functions do have types that are partly redundant with the arguments of the |
| first block in the function, this is necessary to support function declarations |
| that do not have a body which we can refer to in order to obtain the argument |
| types. A region is always contained in an operation or a function that can be |
| queried to obtain the “type” of the region if necessary. |
| |
| A type on a region can be justified if Regions were to be considered separately |
| from the enclosing entity (operation or function) and had their own semantics |
| that should be checked. |
| |
| #### Attaching attributes to regions |
| |
| Regions could be annotated with dialect attributes to use attribute verification |
| hooks. An operation could take multiple regions as arguments, and each of them |
| may require different attributes. However, there are currently very few |
| practical cases where this would be necessary. Instead, one could simulate |
| per-region attributes with array attributes attached to the entity containing |
| the region (operation or function). This decreases the overall complexity of the |
| IR and enables more concise and op-specific forms, e.g., when all regions of an |
| op have the same attribute that can be only mentioned once. Since the semantics |
| of the region is entirely defined by the enclosing entity, it also makes sense |
| to have attributes attached to that entity rather than to the region itself. |
| |
| This can be reconsidered in the future if we see a non-neglectable amount of use |
| cases. |
| |
| ### Read/Write/May_Read/May_Write sets for External Functions |
| |
| Having read, write, may_read, and may_write sets for external functions which |
| include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL, |
| FFT libraries, user-provided/optimized functions, or data movement runtimes such |
| as DMA ones is a powerful feature. It allows the compiler to perform analysis, |
| composition/transformation in the presence of such calls and with loops around |
| such calls on sub-tensors. For user-provided or custom hand-tuned functions, the |
| read/write/may_read/may_write sets could be provided a-priori by a user as part |
| of the external function signature or they could be part of a database. |
| |
| TODO: Design this, and update to use function attribute syntax. |
| |
| Example: |
| |
| ```mlir |
| ##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1 |
| |
| func.func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>) |
| -> f32 [ |
| reads: {%M, ##rel9() } |
| writes: /* empty */ |
| may_reads: /* empty */ |
| may_writes: /* empty */ |
| ] |
| |
| func.func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>, |
| %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32, |
| #layout_map0>) [ |
| reads: {%M, ##rel9() } |
| writes: /* empty */ |
| may_reads: /* empty */ |
| may_writes: /* empty */ |
| ] |
| |
| ``` |
| |
| ### Memref Extensions |
| |
| 1. Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor |
| dimensions where there is symmetry: use integer set (affine constraints) to |
| model tensor data space (instead of just extents). Requires some changes to |
| the IR and the in-memory form. |
| 1. Layout maps |
| |
| 1. Allow piece-wise affine maps for layouts: allows clean modeling of |
| boundary cases for images/tensors through padding, wrapping, mirroring, |
| padding where padded values are the results of computation as opposed to |
| data, padding in the interior as opposed to just boundaries. |
| 1. Allow many-to-one layout maps: Index and layout maps in the current |
| proposal are bijective. Extending them to many-to-one layout maps allows |
| cleaner(?) modeling of broadcast/reduce style computations while reusing |
| memory. |
| |
| Proposal 2(a) requires non-trivial changes to the IR and the in-memory |
| representation. 2(b) requires no change, but impacts how cost models look at |
| index and layout maps. |
| |
| ### `affine.if` and `affine.for` Extensions for "Escaping Scalars" |
| |
| We considered providing a representation for SSA values that are live out of |
| `if/else` conditional bodies and loop carried in `affine.for` loops. We |
| ultimately abandoned this approach due to its complexity. In the current design |
| of MLIR, scalar variables cannot escape for loops or if instructions. In |
| situations, where escaping is necessary, we use zero-dimensional tensors and |
| memrefs instead of scalars. |
| |
| **TODO**: This whole section is obsolete and should be updated to use block |
| arguments and a yield like terminator in for/if instructions. |
| |
| The abandoned design of supporting escaping scalars is as follows: |
| |
| #### affine.for Instruction |
| |
| Syntax: |
| |
| ``` |
| [<out-var-list> =] |
| for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step> |
| [with <in-var-list>] { <loop-instruction-list> } |
| ``` |
| |
| out-var-list is a comma separated list of SSA values defined in the loop body |
| and used outside the loop body. in-var-list is a comma separated list of SSA |
| values used inside the loop body and their initializers. loop-instruction-list |
| is a list of instructions that may also include a yield instruction. |
| |
| Example: |
| |
| ```mlir |
| // Return sum of elements in 1-dimensional mref A |
| func.func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) { |
| %init = 0 |
| %result = affine.for %i = 0 to N with %tmp(%init) { |
| %value = affine.load %A[%i] |
| %sum = %value + %tmp |
| yield %sum |
| } |
| return %result : i32 |
| } |
| ``` |
| |
| #### affine.if/else Instruction |
| |
| Syntax: |
| |
| ``` |
| <out-var-list> = affine.if (<cond-list>) {...} [else {...}] |
| ``` |
| |
| Out-var-list is a list of SSA values defined by the if-instruction. The values |
| are arguments to the yield-instruction that occurs in both then and else clauses |
| when else clause is present. When if instruction contains only if clause, the |
| escaping value defined in the then clause should be merged with the value the |
| variable had before the if instruction. The design captured here does not handle |
| this situation. |
| |
| Example: |
| |
| ```mlir |
| // Compute sum of half of the array |
| func.func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) { |
| %s0 = 0 |
| %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) { |
| %s3 = if (%i >= %N / 2) { |
| %v0 = affine.load %A[%i] |
| %s4 = %s2 + %v0 |
| yield %s4 |
| } |
| yield %s3 |
| } |
| return %s1 : i32 |
| } |
| ``` |
| |
| ### Multithreading the compiler |
| |
| People want compilers to go fast, and one simple way to do that is to |
| multi-thread them. There are multiple strategies for this, but a simple one is |
| to optimize and compile separate functions in parallel. LLVM's original pass |
| manager anticipated this demand, and the CallGraphSCCPass manager is even |
| designed to support this as well, but unfortunately, a few early design |
| decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO |
| are forced to split programs into separate LLVM modules/context and optimize |
| those chunks independently. |
| |
| The problem is that LLVM has several objects in its IR that are globally uniqued |
| and also mutable: notably constants like `i32 0`. In LLVM, these constants are |
| `Value`'s, which allow them to be used as operands to instructions, and that |
| they also have SSA use lists. Because these things are uniqued, every `i32 0` in |
| any function shares a use list. This means that optimizing multiple functions in |
| parallel won't work (at least without some sort of synchronization on the use |
| lists, which would be unbearably inefficient). |
| |
| MLIR now supports a multithreaded pass manager. We do this through several |
| design choices: |
| |
| 1. MLIR makes use of extensive uniqued immutable data structures (affine |
| expressions, types, etc are all immutable, uniqued, and immortal). |
| 2. Constants are defined in per-operation pools, instead of being globally |
| uniqued. |
| 3. Functions, and other global-like operations, themselves are not SSA values |
| either, so they don't have the same problem as constants. |
| 4. Passes are copied (through their copy ctor) into one instance per |
| thread, avoiding sharing of local state across threads. |
| |
| This allows MLIR passes to support efficient multithreaded compilation |
| and code generation. |