| # Chapter 5: Partial Lowering to Lower-Level Dialects for Optimization |
| |
| [TOC] |
| |
| At this point, we are eager to generate actual code and see our Toy language |
| take life. We will use LLVM to generate code, but just showing the LLVM builder |
| interface here wouldn't be very exciting. Instead, we will show how to perform |
| progressive lowering through a mix of dialects coexisting in the same function. |
| |
| To make it more interesting, in this chapter we will consider that we want to |
| reuse existing optimizations implemented in a dialect optimizing affine |
| transformations: `Affine`. This dialect is tailored to the computation-heavy |
| part of the program and is limited: it doesn't support representing our |
| `toy.print` builtin, for instance, neither should it! Instead, we can target |
| `Affine` for the computation heavy part of Toy, and in the |
| [next chapter](Ch-6.md) directly target the `LLVM IR` dialect for lowering |
| `print`. As part of this lowering, we will be lowering from the |
| [TensorType](../../Dialects/Builtin.md/#rankedtensortype) that `Toy` operates on |
| to the [MemRefType](../../Dialects/Builtin.md/#memreftype) that is indexed via |
| an affine loop-nest. Tensors represent an abstract value-typed sequence of data, |
| meaning that they don't live in any memory. MemRefs, on the other hand, |
| represent lower level buffer access, as they are concrete references to a region |
| of memory. |
| |
| # Dialect Conversions |
| |
| MLIR has many different dialects, so it is important to have a unified framework |
| for [converting](../../../getting_started/Glossary.md/#conversion) between them. |
| This is where the `DialectConversion` framework comes into play. This framework |
| allows for transforming a set of *illegal* operations to a set of *legal* ones. |
| To use this framework, we need to provide two things (and an optional third): |
| |
| * A [Conversion Target](../../DialectConversion.md/#conversion-target) |
| |
| - This is the formal specification of what operations or dialects are |
| legal for the conversion. Operations that aren't legal will require |
| rewrite patterns to perform |
| [legalization](../../../getting_started/Glossary.md/#legalization). |
| |
| * A set of |
| [Rewrite Patterns](../../DialectConversion.md/#rewrite-pattern-specification) |
| |
| - This is the set of [patterns](../QuickstartRewrites.md) used to convert |
| *illegal* operations into a set of zero or more *legal* ones. |
| |
| * Optionally, a [Type Converter](../../DialectConversion.md/#type-conversion). |
| |
| - If provided, this is used to convert the types of block arguments. We |
| won't be needing this for our conversion. |
| |
| ## Conversion Target |
| |
| For our purposes, we want to convert the compute-intensive `Toy` operations into |
| a combination of operations from the `Affine`, `Arith`, `Func`, and `MemRef` dialects |
| for further optimization. To start off the lowering, we first define our |
| conversion target: |
| |
| ```c++ |
| void ToyToAffineLoweringPass::runOnOperation() { |
| // The first thing to define is the conversion target. This will define the |
| // final target for this lowering. |
| mlir::ConversionTarget target(getContext()); |
| |
| // We define the specific operations, or dialects, that are legal targets for |
| // this lowering. In our case, we are lowering to a combination of the |
| // `Affine`, `Arith`, `Func`, and `MemRef` dialects. |
| target.addLegalDialect<affine::AffineDialect, arith::ArithDialect, |
| func::FuncDialect, memref::MemRefDialect>(); |
| |
| // We also define the Toy dialect as Illegal so that the conversion will fail |
| // if any of these operations are *not* converted. Given that we actually want |
| // a partial lowering, we explicitly mark the Toy operations that don't want |
| // to lower, `toy.print`, as *legal*. `toy.print` will still need its operands |
| // to be updated though (as we convert from TensorType to MemRefType), so we |
| // only treat it as `legal` if its operands are legal. |
| target.addIllegalDialect<ToyDialect>(); |
| target.addDynamicallyLegalOp<toy::PrintOp>([](toy::PrintOp op) { |
| return llvm::none_of(op->getOperandTypes(), |
| [](Type type) { return type.isa<TensorType>(); }); |
| }); |
| ... |
| } |
| ``` |
| |
| Above, we first set the toy dialect to illegal, and then the print operation as |
| legal. We could have done this the other way around. Individual operations |
| always take precedence over the (more generic) dialect definitions, so the order |
| doesn't matter. See `ConversionTarget::getOpInfo` for the details. |
| |
| ## Conversion Patterns |
| |
| After the conversion target has been defined, we can define how to convert the |
| *illegal* operations into *legal* ones. Similarly to the canonicalization |
| framework introduced in [chapter 3](Ch-3.md), the |
| [`DialectConversion` framework](../../DialectConversion.md) also uses |
| [RewritePatterns](../QuickstartRewrites.md) to perform the conversion logic. |
| These patterns may be the `RewritePatterns` seen before or a new type of pattern |
| specific to the conversion framework `ConversionPattern`. `ConversionPatterns` |
| are different from traditional `RewritePatterns` in that they accept an |
| additional `operands` parameter containing operands that have been |
| remapped/replaced. This is used when dealing with type conversions, as the |
| pattern will want to operate on values of the new type but match against the |
| old. For our lowering, this invariant will be useful as it translates from the |
| [TensorType](../../Dialects/Builtin.md/#rankedtensortype) currently being |
| operated on to the [MemRefType](../../Dialects/Builtin.md/#memreftype). Let's |
| look at a snippet of lowering the `toy.transpose` operation: |
| |
| ```c++ |
| /// Lower the `toy.transpose` operation to an affine loop nest. |
| struct TransposeOpLowering : public mlir::ConversionPattern { |
| TransposeOpLowering(mlir::MLIRContext *ctx) |
| : mlir::ConversionPattern(TransposeOp::getOperationName(), 1, ctx) {} |
| |
| /// Match and rewrite the given `toy.transpose` operation, with the given |
| /// operands that have been remapped from `tensor<...>` to `memref<...>`. |
| llvm::LogicalResult |
| matchAndRewrite(mlir::Operation *op, ArrayRef<mlir::Value> operands, |
| mlir::ConversionPatternRewriter &rewriter) const final { |
| auto loc = op->getLoc(); |
| |
| // Call to a helper function that will lower the current operation to a set |
| // of affine loops. We provide a functor that operates on the remapped |
| // operands, as well as the loop induction variables for the inner most |
| // loop body. |
| lowerOpToLoops( |
| op, operands, rewriter, |
| [loc](mlir::PatternRewriter &rewriter, |
| ArrayRef<mlir::Value> memRefOperands, |
| ArrayRef<mlir::Value> loopIvs) { |
| // Generate an adaptor for the remapped operands of the TransposeOp. |
| // This allows for using the nice named accessors that are generated |
| // by the ODS. This adaptor is automatically provided by the ODS |
| // framework. |
| TransposeOpAdaptor transposeAdaptor(memRefOperands); |
| mlir::Value input = transposeAdaptor.input(); |
| |
| // Transpose the elements by generating a load from the reverse |
| // indices. |
| SmallVector<mlir::Value, 2> reverseIvs(llvm::reverse(loopIvs)); |
| return rewriter.create<mlir::AffineLoadOp>(loc, input, reverseIvs); |
| }); |
| return success(); |
| } |
| }; |
| ``` |
| |
| Now we can prepare the list of patterns to use during the lowering process: |
| |
| ```c++ |
| void ToyToAffineLoweringPass::runOnOperation() { |
| ... |
| |
| // Now that the conversion target has been defined, we just need to provide |
| // the set of patterns that will lower the Toy operations. |
| mlir::RewritePatternSet patterns(&getContext()); |
| patterns.add<..., TransposeOpLowering>(&getContext()); |
| |
| ... |
| ``` |
| |
| ## Partial Lowering |
| |
| Once the patterns have been defined, we can perform the actual lowering. The |
| `DialectConversion` framework provides several different modes of lowering, but, |
| for our purposes, we will perform a partial lowering, as we will not convert |
| `toy.print` at this time. |
| |
| ```c++ |
| void ToyToAffineLoweringPass::runOnOperation() { |
| ... |
| |
| // With the target and rewrite patterns defined, we can now attempt the |
| // conversion. The conversion will signal failure if any of our *illegal* |
| // operations were not converted successfully. |
| if (mlir::failed(mlir::applyPartialConversion(getOperation(), target, patterns))) |
| signalPassFailure(); |
| } |
| ``` |
| |
| ### Design Considerations With Partial Lowering |
| |
| Before diving into the result of our lowering, this is a good time to discuss |
| potential design considerations when it comes to partial lowering. In our |
| lowering, we transform from a value-type, TensorType, to an allocated |
| (buffer-like) type, MemRefType. However, given that we do not lower the |
| `toy.print` operation, we need to temporarily bridge these two worlds. There are |
| many ways to go about this, each with their own tradeoffs: |
| |
| * Generate `load` operations from the buffer |
| |
| One option is to generate `load` operations from the buffer type to |
| materialize an instance of the value type. This allows for the definition of |
| the `toy.print` operation to remain unchanged. The downside to this approach |
| is that the optimizations on the `affine` dialect are limited, because the |
| `load` will actually involve a full copy that is only visible *after* our |
| optimizations have been performed. |
| |
| * Generate a new version of `toy.print` that operates on the lowered type |
| |
| Another option would be to have another, lowered, variant of `toy.print` |
| that operates on the lowered type. The benefit of this option is that there |
| is no hidden, unnecessary copy to the optimizer. The downside is that |
| another operation definition is needed that may duplicate many aspects of |
| the first. Defining a base class in [ODS](../../DefiningDialects/Operations.md) may |
| simplify this, but you still need to treat these operations separately. |
| |
| * Update `toy.print` to allow for operating on the lowered type |
| |
| A third option is to update the current definition of `toy.print` to allow |
| for operating the on the lowered type. The benefit of this approach is that |
| it is simple, does not introduce an additional hidden copy, and does not |
| require another operation definition. The downside to this option is that it |
| requires mixing abstraction levels in the `Toy` dialect. |
| |
| For the sake of simplicity, we will use the third option for this lowering. This |
| involves updating the type constraints on the PrintOp in the operation |
| definition file: |
| |
| ```tablegen |
| def PrintOp : Toy_Op<"print"> { |
| ... |
| |
| // The print operation takes an input tensor to print. |
| // We also allow a F64MemRef to enable interop during partial lowering. |
| let arguments = (ins AnyTypeOf<[F64Tensor, F64MemRef]>:$input); |
| } |
| ``` |
| |
| ## Complete Toy Example |
| |
| Let's take a concrete example: |
| |
| ```mlir |
| toy.func @main() { |
| %0 = toy.constant dense<[[1.000000e+00, 2.000000e+00, 3.000000e+00], [4.000000e+00, 5.000000e+00, 6.000000e+00]]> : tensor<2x3xf64> |
| %2 = toy.transpose(%0 : tensor<2x3xf64>) to tensor<3x2xf64> |
| %3 = toy.mul %2, %2 : tensor<3x2xf64> |
| toy.print %3 : tensor<3x2xf64> |
| toy.return |
| } |
| ``` |
| |
| With affine lowering added to our pipeline, we can now generate: |
| |
| ```mlir |
| func.func @main() { |
| %cst = arith.constant 1.000000e+00 : f64 |
| %cst_0 = arith.constant 2.000000e+00 : f64 |
| %cst_1 = arith.constant 3.000000e+00 : f64 |
| %cst_2 = arith.constant 4.000000e+00 : f64 |
| %cst_3 = arith.constant 5.000000e+00 : f64 |
| %cst_4 = arith.constant 6.000000e+00 : f64 |
| |
| // Allocating buffers for the inputs and outputs. |
| %0 = memref.alloc() : memref<3x2xf64> |
| %1 = memref.alloc() : memref<3x2xf64> |
| %2 = memref.alloc() : memref<2x3xf64> |
| |
| // Initialize the input buffer with the constant values. |
| affine.store %cst, %2[0, 0] : memref<2x3xf64> |
| affine.store %cst_0, %2[0, 1] : memref<2x3xf64> |
| affine.store %cst_1, %2[0, 2] : memref<2x3xf64> |
| affine.store %cst_2, %2[1, 0] : memref<2x3xf64> |
| affine.store %cst_3, %2[1, 1] : memref<2x3xf64> |
| affine.store %cst_4, %2[1, 2] : memref<2x3xf64> |
| |
| // Load the transpose value from the input buffer and store it into the |
| // next input buffer. |
| affine.for %arg0 = 0 to 3 { |
| affine.for %arg1 = 0 to 2 { |
| %3 = affine.load %2[%arg1, %arg0] : memref<2x3xf64> |
| affine.store %3, %1[%arg0, %arg1] : memref<3x2xf64> |
| } |
| } |
| |
| // Multiply and store into the output buffer. |
| affine.for %arg0 = 0 to 3 { |
| affine.for %arg1 = 0 to 2 { |
| %3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64> |
| %4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64> |
| %5 = arith.mulf %3, %4 : f64 |
| affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64> |
| } |
| } |
| |
| // Print the value held by the buffer. |
| toy.print %0 : memref<3x2xf64> |
| memref.dealloc %2 : memref<2x3xf64> |
| memref.dealloc %1 : memref<3x2xf64> |
| memref.dealloc %0 : memref<3x2xf64> |
| return |
| } |
| ``` |
| |
| ## Taking Advantage of Affine Optimization |
| |
| Our naive lowering is correct, but it leaves a lot to be desired with regards to |
| efficiency. For example, the lowering of `toy.mul` has generated some redundant |
| loads. Let's look at how adding a few existing optimizations to the pipeline can |
| help clean this up. Adding the `LoopFusion` and `AffineScalarReplacement` passes |
| to the pipeline gives the following result: |
| |
| ```mlir |
| func.func @main() { |
| %cst = arith.constant 1.000000e+00 : f64 |
| %cst_0 = arith.constant 2.000000e+00 : f64 |
| %cst_1 = arith.constant 3.000000e+00 : f64 |
| %cst_2 = arith.constant 4.000000e+00 : f64 |
| %cst_3 = arith.constant 5.000000e+00 : f64 |
| %cst_4 = arith.constant 6.000000e+00 : f64 |
| |
| // Allocating buffers for the inputs and outputs. |
| %0 = memref.alloc() : memref<3x2xf64> |
| %1 = memref.alloc() : memref<2x3xf64> |
| |
| // Initialize the input buffer with the constant values. |
| affine.store %cst, %1[0, 0] : memref<2x3xf64> |
| affine.store %cst_0, %1[0, 1] : memref<2x3xf64> |
| affine.store %cst_1, %1[0, 2] : memref<2x3xf64> |
| affine.store %cst_2, %1[1, 0] : memref<2x3xf64> |
| affine.store %cst_3, %1[1, 1] : memref<2x3xf64> |
| affine.store %cst_4, %1[1, 2] : memref<2x3xf64> |
| |
| affine.for %arg0 = 0 to 3 { |
| affine.for %arg1 = 0 to 2 { |
| // Load the transpose value from the input buffer. |
| %2 = affine.load %1[%arg1, %arg0] : memref<2x3xf64> |
| |
| // Multiply and store into the output buffer. |
| %3 = arith.mulf %2, %2 : f64 |
| affine.store %3, %0[%arg0, %arg1] : memref<3x2xf64> |
| } |
| } |
| |
| // Print the value held by the buffer. |
| toy.print %0 : memref<3x2xf64> |
| memref.dealloc %1 : memref<2x3xf64> |
| memref.dealloc %0 : memref<3x2xf64> |
| return |
| } |
| ``` |
| |
| Here, we can see that a redundant allocation was removed, the two loop nests |
| were fused, and some unnecessary `load`s were removed. You can build `toyc-ch5` |
| and try yourself: `toyc-ch5 test/Examples/Toy/Ch5/affine-lowering.mlir |
| -emit=mlir-affine`. We can also check our optimizations by adding `-opt`. |
| |
| In this chapter we explored some aspects of partial lowering, with the intent to |
| optimize. In the [next chapter](Ch-6.md) we will continue the discussion about |
| dialect conversion by targeting LLVM for code generation. |