mlir/docs/Tutorials/Toy/Ch-5.md - rust-lang/llvm-project - Git at Google

 # Chapter 5: Partial Lowering to Lower-Level Dialects for Optimization

 [TOC]

 At this point, we are eager to generate actual code and see our Toy language
 take life. We will use LLVM to generate code, but just showing the LLVM builder
 interface here wouldn't be very exciting. Instead, we will show how to perform
 progressive lowering through a mix of dialects coexisting in the same function.

 To make it more interesting, in this chapter we will consider that we want to
 reuse existing optimizations implemented in a dialect optimizing affine
 transformations: `Affine`. This dialect is tailored to the computation-heavy
 part of the program and is limited: it doesn't support representing our
 `toy.print` builtin, for instance, neither should it! Instead, we can target
 `Affine` for the computation heavy part of Toy, and in the
 [next chapter](Ch-6.md) directly target the `LLVM IR` dialect for lowering
 `print`. As part of this lowering, we will be lowering from the
 [TensorType](../../Dialects/Builtin.md/#rankedtensortype) that `Toy` operates on
 to the [MemRefType](../../Dialects/Builtin.md/#memreftype) that is indexed via
 an affine loop-nest. Tensors represent an abstract value-typed sequence of data,
 meaning that they don't live in any memory. MemRefs, on the other hand,
 represent lower level buffer access, as they are concrete references to a region
 of memory.

 # Dialect Conversions

 MLIR has many different dialects, so it is important to have a unified framework
 for [converting](../../../getting_started/Glossary.md/#conversion) between them.
 This is where the `DialectConversion` framework comes into play. This framework
 allows for transforming a set of *illegal* operations to a set of *legal* ones.
 To use this framework, we need to provide two things (and an optional third):

 *   A [Conversion Target](../../DialectConversion.md/#conversion-target)

     -   This is the formal specification of what operations or dialects are
         legal for the conversion. Operations that aren't legal will require
         rewrite patterns to perform
         [legalization](../../../getting_started/Glossary.md/#legalization).

 *   A set of
     [Rewrite Patterns](../../DialectConversion.md/#rewrite-pattern-specification)

     -   This is the set of [patterns](../QuickstartRewrites.md) used to convert
         *illegal* operations into a set of zero or more *legal* ones.

 *   Optionally, a [Type Converter](../../DialectConversion.md/#type-conversion).

     -   If provided, this is used to convert the types of block arguments. We
         won't be needing this for our conversion.

 ## Conversion Target

 For our purposes, we want to convert the compute-intensive `Toy` operations into
 a combination of operations from the `Affine`, `Arith`, `Func`, and `MemRef` dialects
 for further optimization. To start off the lowering, we first define our
 conversion target:

 ```c++
 void ToyToAffineLoweringPass::runOnOperation() {
   // The first thing to define is the conversion target. This will define the
   // final target for this lowering.
   mlir::ConversionTarget target(getContext());

   // We define the specific operations, or dialects, that are legal targets for
   // this lowering. In our case, we are lowering to a combination of the
   // `Affine`, `Arith`, `Func`, and `MemRef` dialects.
   target.addLegalDialect<affine::AffineDialect, arith::ArithDialect,
                          func::FuncDialect, memref::MemRefDialect>();

   // We also define the Toy dialect as Illegal so that the conversion will fail
   // if any of these operations are *not* converted. Given that we actually want
   // a partial lowering, we explicitly mark the Toy operations that don't want
   // to lower, `toy.print`, as *legal*. `toy.print` will still need its operands
   // to be updated though (as we convert from TensorType to MemRefType), so we
   // only treat it as `legal` if its operands are legal.
   target.addIllegalDialect<ToyDialect>();
   target.addDynamicallyLegalOp<toy::PrintOp>([](toy::PrintOp op) {
     return llvm::none_of(op->getOperandTypes(),
                          [](Type type) { return type.isa<TensorType>(); });
   });
   ...
 }
 ```

 Above, we first set the toy dialect to illegal, and then the print operation as
 legal. We could have done this the other way around. Individual operations
 always take precedence over the (more generic) dialect definitions, so the order
 doesn't matter. See `ConversionTarget::getOpInfo` for the details.

 ## Conversion Patterns

 After the conversion target has been defined, we can define how to convert the
 *illegal* operations into *legal* ones. Similarly to the canonicalization
 framework introduced in [chapter 3](Ch-3.md), the
 [`DialectConversion` framework](../../DialectConversion.md) also uses
 [RewritePatterns](../QuickstartRewrites.md) to perform the conversion logic.
 These patterns may be the `RewritePatterns` seen before or a new type of pattern
 specific to the conversion framework `ConversionPattern`. `ConversionPatterns`
 are different from traditional `RewritePatterns` in that they accept an
 additional `operands` parameter containing operands that have been
 remapped/replaced. This is used when dealing with type conversions, as the
 pattern will want to operate on values of the new type but match against the
 old. For our lowering, this invariant will be useful as it translates from the
 [TensorType](../../Dialects/Builtin.md/#rankedtensortype) currently being
 operated on to the [MemRefType](../../Dialects/Builtin.md/#memreftype). Let's
 look at a snippet of lowering the `toy.transpose` operation:

 ```c++
 /// Lower the `toy.transpose` operation to an affine loop nest.
 struct TransposeOpLowering : public mlir::ConversionPattern {
   TransposeOpLowering(mlir::MLIRContext *ctx)
       : mlir::ConversionPattern(TransposeOp::getOperationName(), 1, ctx) {}

   /// Match and rewrite the given `toy.transpose` operation, with the given
   /// operands that have been remapped from `tensor<...>` to `memref<...>`.
   llvm::LogicalResult
   matchAndRewrite(mlir::Operation *op, ArrayRef<mlir::Value> operands,
                   mlir::ConversionPatternRewriter &rewriter) const final {
     auto loc = op->getLoc();

     // Call to a helper function that will lower the current operation to a set
     // of affine loops. We provide a functor that operates on the remapped
     // operands, as well as the loop induction variables for the inner most
     // loop body.
     lowerOpToLoops(
         op, operands, rewriter,
         [loc](mlir::PatternRewriter &rewriter,
               ArrayRef<mlir::Value> memRefOperands,
               ArrayRef<mlir::Value> loopIvs) {
           // Generate an adaptor for the remapped operands of the TransposeOp.
           // This allows for using the nice named accessors that are generated
           // by the ODS. This adaptor is automatically provided by the ODS
           // framework.
           TransposeOpAdaptor transposeAdaptor(memRefOperands);
           mlir::Value input = transposeAdaptor.input();

           // Transpose the elements by generating a load from the reverse
           // indices.
           SmallVector<mlir::Value, 2> reverseIvs(llvm::reverse(loopIvs));
           return rewriter.create<mlir::AffineLoadOp>(loc, input, reverseIvs);
         });
     return success();
   }
 };
 ```

 Now we can prepare the list of patterns to use during the lowering process:

 ```c++
 void ToyToAffineLoweringPass::runOnOperation() {
   ...

   // Now that the conversion target has been defined, we just need to provide
   // the set of patterns that will lower the Toy operations.
   mlir::RewritePatternSet patterns(&getContext());
   patterns.add<..., TransposeOpLowering>(&getContext());

   ...
 ```

 ## Partial Lowering

 Once the patterns have been defined, we can perform the actual lowering. The
 `DialectConversion` framework provides several different modes of lowering, but,
 for our purposes, we will perform a partial lowering, as we will not convert
 `toy.print` at this time.

 ```c++
 void ToyToAffineLoweringPass::runOnOperation() {
   ...

   // With the target and rewrite patterns defined, we can now attempt the
   // conversion. The conversion will signal failure if any of our *illegal*
   // operations were not converted successfully.
   if (mlir::failed(mlir::applyPartialConversion(getOperation(), target, patterns)))
     signalPassFailure();
 }
 ```

 ### Design Considerations With Partial Lowering

 Before diving into the result of our lowering, this is a good time to discuss
 potential design considerations when it comes to partial lowering. In our
 lowering, we transform from a value-type, TensorType, to an allocated
 (buffer-like) type, MemRefType. However, given that we do not lower the
 `toy.print` operation, we need to temporarily bridge these two worlds. There are
 many ways to go about this, each with their own tradeoffs:

 *   Generate `load` operations from the buffer

     One option is to generate `load` operations from the buffer type to
     materialize an instance of the value type. This allows for the definition of
     the `toy.print` operation to remain unchanged. The downside to this approach
     is that the optimizations on the `affine` dialect are limited, because the
     `load` will actually involve a full copy that is only visible *after* our
     optimizations have been performed.

 *   Generate a new version of `toy.print` that operates on the lowered type

     Another option would be to have another, lowered, variant of `toy.print`
     that operates on the lowered type. The benefit of this option is that there
     is no hidden, unnecessary copy to the optimizer. The downside is that
     another operation definition is needed that may duplicate many aspects of
     the first. Defining a base class in [ODS](../../DefiningDialects/Operations.md) may
     simplify this, but you still need to treat these operations separately.

 *   Update `toy.print` to allow for operating on the lowered type

     A third option is to update the current definition of `toy.print` to allow
     for operating the on the lowered type. The benefit of this approach is that
     it is simple, does not introduce an additional hidden copy, and does not
     require another operation definition. The downside to this option is that it
     requires mixing abstraction levels in the `Toy` dialect.

 For the sake of simplicity, we will use the third option for this lowering. This
 involves updating the type constraints on the PrintOp in the operation
 definition file:

 ```tablegen
 def PrintOp : Toy_Op<"print"> {
   ...

   // The print operation takes an input tensor to print.
   // We also allow a F64MemRef to enable interop during partial lowering.
   let arguments = (ins AnyTypeOf<[F64Tensor, F64MemRef]>:$input);
 }
 ```

 ## Complete Toy Example

 Let's take a concrete example:

 ```mlir
 toy.func @main() {
   %0 = toy.constant dense<[[1.000000e+00, 2.000000e+00, 3.000000e+00], [4.000000e+00, 5.000000e+00, 6.000000e+00]]> : tensor<2x3xf64>
   %2 = toy.transpose(%0 : tensor<2x3xf64>) to tensor<3x2xf64>
   %3 = toy.mul %2, %2 : tensor<3x2xf64>
   toy.print %3 : tensor<3x2xf64>
   toy.return
 }
 ```

 With affine lowering added to our pipeline, we can now generate:

 ```mlir
 func.func @main() {
   %cst = arith.constant 1.000000e+00 : f64
   %cst_0 = arith.constant 2.000000e+00 : f64
   %cst_1 = arith.constant 3.000000e+00 : f64
   %cst_2 = arith.constant 4.000000e+00 : f64
   %cst_3 = arith.constant 5.000000e+00 : f64
   %cst_4 = arith.constant 6.000000e+00 : f64

   // Allocating buffers for the inputs and outputs.
   %0 = memref.alloc() : memref<3x2xf64>
   %1 = memref.alloc() : memref<3x2xf64>
   %2 = memref.alloc() : memref<2x3xf64>

   // Initialize the input buffer with the constant values.
   affine.store %cst, %2[0, 0] : memref<2x3xf64>
   affine.store %cst_0, %2[0, 1] : memref<2x3xf64>
   affine.store %cst_1, %2[0, 2] : memref<2x3xf64>
   affine.store %cst_2, %2[1, 0] : memref<2x3xf64>
   affine.store %cst_3, %2[1, 1] : memref<2x3xf64>
   affine.store %cst_4, %2[1, 2] : memref<2x3xf64>

   // Load the transpose value from the input buffer and store it into the
   // next input buffer.
   affine.for %arg0 = 0 to 3 {
     affine.for %arg1 = 0 to 2 {
       %3 = affine.load %2[%arg1, %arg0] : memref<2x3xf64>
       affine.store %3, %1[%arg0, %arg1] : memref<3x2xf64>
     }
   }

   // Multiply and store into the output buffer.
   affine.for %arg0 = 0 to 3 {
     affine.for %arg1 = 0 to 2 {
       %3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
       %4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
       %5 = arith.mulf %3, %4 : f64
       affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>
     }
   }

   // Print the value held by the buffer.
   toy.print %0 : memref<3x2xf64>
   memref.dealloc %2 : memref<2x3xf64>
   memref.dealloc %1 : memref<3x2xf64>
   memref.dealloc %0 : memref<3x2xf64>
   return
 }
 ```

 ## Taking Advantage of Affine Optimization

 Our naive lowering is correct, but it leaves a lot to be desired with regards to
 efficiency. For example, the lowering of `toy.mul` has generated some redundant
 loads. Let's look at how adding a few existing optimizations to the pipeline can
 help clean this up. Adding the `LoopFusion` and `AffineScalarReplacement` passes
 to the pipeline gives the following result:

 ```mlir
 func.func @main() {
   %cst = arith.constant 1.000000e+00 : f64
   %cst_0 = arith.constant 2.000000e+00 : f64
   %cst_1 = arith.constant 3.000000e+00 : f64
   %cst_2 = arith.constant 4.000000e+00 : f64
   %cst_3 = arith.constant 5.000000e+00 : f64
   %cst_4 = arith.constant 6.000000e+00 : f64

   // Allocating buffers for the inputs and outputs.
   %0 = memref.alloc() : memref<3x2xf64>
   %1 = memref.alloc() : memref<2x3xf64>

   // Initialize the input buffer with the constant values.
   affine.store %cst, %1[0, 0] : memref<2x3xf64>
   affine.store %cst_0, %1[0, 1] : memref<2x3xf64>
   affine.store %cst_1, %1[0, 2] : memref<2x3xf64>
   affine.store %cst_2, %1[1, 0] : memref<2x3xf64>
   affine.store %cst_3, %1[1, 1] : memref<2x3xf64>
   affine.store %cst_4, %1[1, 2] : memref<2x3xf64>

   affine.for %arg0 = 0 to 3 {
     affine.for %arg1 = 0 to 2 {
       // Load the transpose value from the input buffer.
       %2 = affine.load %1[%arg1, %arg0] : memref<2x3xf64>

       // Multiply and store into the output buffer.
       %3 = arith.mulf %2, %2 : f64
       affine.store %3, %0[%arg0, %arg1] : memref<3x2xf64>
     }
   }

   // Print the value held by the buffer.
   toy.print %0 : memref<3x2xf64>
   memref.dealloc %1 : memref<2x3xf64>
   memref.dealloc %0 : memref<3x2xf64>
   return
 }
 ```

 Here, we can see that a redundant allocation was removed, the two loop nests
 were fused, and some unnecessary `load`s were removed. You can build `toyc-ch5`
 and try yourself: `toyc-ch5 test/Examples/Toy/Ch5/affine-lowering.mlir
 -emit=mlir-affine`. We can also check our optimizations by adding `-opt`.

 In this chapter we explored some aspects of partial lowering, with the intent to
 optimize. In the [next chapter](Ch-6.md) we will continue the discussion about
 dialect conversion by targeting LLVM for code generation.
	# Chapter 5: Partial Lowering to Lower-Level Dialects for Optimization

	[TOC]

	At this point, we are eager to generate actual code and see our Toy language
	take life. We will use LLVM to generate code, but just showing the LLVM builder
	interface here wouldn't be very exciting. Instead, we will show how to perform
	progressive lowering through a mix of dialects coexisting in the same function.

	To make it more interesting, in this chapter we will consider that we want to
	reuse existing optimizations implemented in a dialect optimizing affine
	transformations: `Affine`. This dialect is tailored to the computation-heavy
	part of the program and is limited: it doesn't support representing our
	`toy.print` builtin, for instance, neither should it! Instead, we can target
	`Affine` for the computation heavy part of Toy, and in the
	[next chapter](Ch-6.md) directly target the `LLVM IR` dialect for lowering
	`print`. As part of this lowering, we will be lowering from the
	[TensorType](../../Dialects/Builtin.md/#rankedtensortype) that `Toy` operates on
	to the [MemRefType](../../Dialects/Builtin.md/#memreftype) that is indexed via
	an affine loop-nest. Tensors represent an abstract value-typed sequence of data,
	meaning that they don't live in any memory. MemRefs, on the other hand,
	represent lower level buffer access, as they are concrete references to a region
	of memory.

	# Dialect Conversions

	MLIR has many different dialects, so it is important to have a unified framework
	for [converting](../../../getting_started/Glossary.md/#conversion) between them.
	This is where the `DialectConversion` framework comes into play. This framework
	allows for transforming a set of illegal operations to a set of legal ones.
	To use this framework, we need to provide two things (and an optional third):

	* A [Conversion Target](../../DialectConversion.md/#conversion-target)

	- This is the formal specification of what operations or dialects are
	legal for the conversion. Operations that aren't legal will require
	rewrite patterns to perform
	[legalization](../../../getting_started/Glossary.md/#legalization).

	* A set of
	[Rewrite Patterns](../../DialectConversion.md/#rewrite-pattern-specification)

	- This is the set of [patterns](../QuickstartRewrites.md) used to convert
	illegal operations into a set of zero or more legal ones.

	* Optionally, a [Type Converter](../../DialectConversion.md/#type-conversion).

	- If provided, this is used to convert the types of block arguments. We
	won't be needing this for our conversion.

	## Conversion Target

	For our purposes, we want to convert the compute-intensive `Toy` operations into
	a combination of operations from the `Affine`, `Arith`, `Func`, and `MemRef` dialects
	for further optimization. To start off the lowering, we first define our
	conversion target:

	```c++
	void ToyToAffineLoweringPass::runOnOperation() {
	// The first thing to define is the conversion target. This will define the
	// final target for this lowering.
	mlir::ConversionTarget target(getContext());

	// We define the specific operations, or dialects, that are legal targets for
	// this lowering. In our case, we are lowering to a combination of the
	// `Affine`, `Arith`, `Func`, and `MemRef` dialects.
	target.addLegalDialect<affine::AffineDialect, arith::ArithDialect,
	func::FuncDialect, memref::MemRefDialect>();

	// We also define the Toy dialect as Illegal so that the conversion will fail
	// if any of these operations are not converted. Given that we actually want
	// a partial lowering, we explicitly mark the Toy operations that don't want
	// to lower, `toy.print`, as legal. `toy.print` will still need its operands
	// to be updated though (as we convert from TensorType to MemRefType), so we
	// only treat it as `legal` if its operands are legal.
	target.addIllegalDialect<ToyDialect>();
	target.addDynamicallyLegalOp<toy::PrintOp>([](toy::PrintOp op) {
	return llvm::none_of(op->getOperandTypes(),
	[](Type type) { return type.isa<TensorType>(); });
	});
	...
	}
	```

	Above, we first set the toy dialect to illegal, and then the print operation as
	legal. We could have done this the other way around. Individual operations
	always take precedence over the (more generic) dialect definitions, so the order
	doesn't matter. See `ConversionTarget::getOpInfo` for the details.

	## Conversion Patterns

	After the conversion target has been defined, we can define how to convert the
	illegal operations into legal ones. Similarly to the canonicalization
	framework introduced in [chapter 3](Ch-3.md), the
	[`DialectConversion` framework](../../DialectConversion.md) also uses
	[RewritePatterns](../QuickstartRewrites.md) to perform the conversion logic.
	These patterns may be the `RewritePatterns` seen before or a new type of pattern
	specific to the conversion framework `ConversionPattern`. `ConversionPatterns`
	are different from traditional `RewritePatterns` in that they accept an
	additional `operands` parameter containing operands that have been
	remapped/replaced. This is used when dealing with type conversions, as the
	pattern will want to operate on values of the new type but match against the
	old. For our lowering, this invariant will be useful as it translates from the
	[TensorType](../../Dialects/Builtin.md/#rankedtensortype) currently being
	operated on to the [MemRefType](../../Dialects/Builtin.md/#memreftype). Let's
	look at a snippet of lowering the `toy.transpose` operation:

	```c++
	/// Lower the `toy.transpose` operation to an affine loop nest.
	struct TransposeOpLowering : public mlir::ConversionPattern {
	TransposeOpLowering(mlir::MLIRContext *ctx)
	: mlir::ConversionPattern(TransposeOp::getOperationName(), 1, ctx) {}

	/// Match and rewrite the given `toy.transpose` operation, with the given
	/// operands that have been remapped from `tensor<...>` to `memref<...>`.
	llvm::LogicalResult
	matchAndRewrite(mlir::Operation *op, ArrayRef<mlir::Value> operands,
	mlir::ConversionPatternRewriter &rewriter) const final {
	auto loc = op->getLoc();

	// Call to a helper function that will lower the current operation to a set
	// of affine loops. We provide a functor that operates on the remapped
	// operands, as well as the loop induction variables for the inner most
	// loop body.
	lowerOpToLoops(
	op, operands, rewriter,
	[loc](mlir::PatternRewriter &rewriter,
	ArrayRef<mlir::Value> memRefOperands,
	ArrayRef<mlir::Value> loopIvs) {
	// Generate an adaptor for the remapped operands of the TransposeOp.
	// This allows for using the nice named accessors that are generated
	// by the ODS. This adaptor is automatically provided by the ODS
	// framework.
	TransposeOpAdaptor transposeAdaptor(memRefOperands);
	mlir::Value input = transposeAdaptor.input();

	// Transpose the elements by generating a load from the reverse
	// indices.
	SmallVector<mlir::Value, 2> reverseIvs(llvm::reverse(loopIvs));
	return rewriter.create<mlir::AffineLoadOp>(loc, input, reverseIvs);
	});
	return success();
	}
	};
	```

	Now we can prepare the list of patterns to use during the lowering process:

	```c++
	void ToyToAffineLoweringPass::runOnOperation() {
	...

	// Now that the conversion target has been defined, we just need to provide
	// the set of patterns that will lower the Toy operations.
	mlir::RewritePatternSet patterns(&getContext());
	patterns.add<..., TransposeOpLowering>(&getContext());

	...
	```

	## Partial Lowering

	Once the patterns have been defined, we can perform the actual lowering. The
	`DialectConversion` framework provides several different modes of lowering, but,
	for our purposes, we will perform a partial lowering, as we will not convert
	`toy.print` at this time.

	```c++
	void ToyToAffineLoweringPass::runOnOperation() {
	...

	// With the target and rewrite patterns defined, we can now attempt the
	// conversion. The conversion will signal failure if any of our illegal
	// operations were not converted successfully.
	if (mlir::failed(mlir::applyPartialConversion(getOperation(), target, patterns)))
	signalPassFailure();
	}
	```

	### Design Considerations With Partial Lowering

	Before diving into the result of our lowering, this is a good time to discuss
	potential design considerations when it comes to partial lowering. In our
	lowering, we transform from a value-type, TensorType, to an allocated
	(buffer-like) type, MemRefType. However, given that we do not lower the
	`toy.print` operation, we need to temporarily bridge these two worlds. There are
	many ways to go about this, each with their own tradeoffs:

	* Generate `load` operations from the buffer

	One option is to generate `load` operations from the buffer type to
	materialize an instance of the value type. This allows for the definition of
	the `toy.print` operation to remain unchanged. The downside to this approach
	is that the optimizations on the `affine` dialect are limited, because the
	`load` will actually involve a full copy that is only visible after our
	optimizations have been performed.

	* Generate a new version of `toy.print` that operates on the lowered type

	Another option would be to have another, lowered, variant of `toy.print`
	that operates on the lowered type. The benefit of this option is that there
	is no hidden, unnecessary copy to the optimizer. The downside is that
	another operation definition is needed that may duplicate many aspects of
	the first. Defining a base class in [ODS](../../DefiningDialects/Operations.md) may
	simplify this, but you still need to treat these operations separately.

	* Update `toy.print` to allow for operating on the lowered type

	A third option is to update the current definition of `toy.print` to allow
	for operating the on the lowered type. The benefit of this approach is that
	it is simple, does not introduce an additional hidden copy, and does not
	require another operation definition. The downside to this option is that it
	requires mixing abstraction levels in the `Toy` dialect.

	For the sake of simplicity, we will use the third option for this lowering. This
	involves updating the type constraints on the PrintOp in the operation
	definition file:

	```tablegen
	def PrintOp : Toy_Op<"print"> {
	...

	// The print operation takes an input tensor to print.
	// We also allow a F64MemRef to enable interop during partial lowering.
	let arguments = (ins AnyTypeOf<[F64Tensor, F64MemRef]>:$input);
	}
	```

	## Complete Toy Example

	Let's take a concrete example:

	```mlir
	toy.func @main() {
	%0 = toy.constant dense<[[1.000000e+00, 2.000000e+00, 3.000000e+00], [4.000000e+00, 5.000000e+00, 6.000000e+00]]> : tensor<2x3xf64>
	%2 = toy.transpose(%0 : tensor<2x3xf64>) to tensor<3x2xf64>
	%3 = toy.mul %2, %2 : tensor<3x2xf64>
	toy.print %3 : tensor<3x2xf64>
	toy.return
	}
	```

	With affine lowering added to our pipeline, we can now generate:

	```mlir
	func.func @main() {
	%cst = arith.constant 1.000000e+00 : f64
	%cst_0 = arith.constant 2.000000e+00 : f64
	%cst_1 = arith.constant 3.000000e+00 : f64
	%cst_2 = arith.constant 4.000000e+00 : f64
	%cst_3 = arith.constant 5.000000e+00 : f64
	%cst_4 = arith.constant 6.000000e+00 : f64

	// Allocating buffers for the inputs and outputs.
	%0 = memref.alloc() : memref<3x2xf64>
	%1 = memref.alloc() : memref<3x2xf64>
	%2 = memref.alloc() : memref<2x3xf64>

	// Initialize the input buffer with the constant values.
	affine.store %cst, %2[0, 0] : memref<2x3xf64>
	affine.store %cst_0, %2[0, 1] : memref<2x3xf64>
	affine.store %cst_1, %2[0, 2] : memref<2x3xf64>
	affine.store %cst_2, %2[1, 0] : memref<2x3xf64>
	affine.store %cst_3, %2[1, 1] : memref<2x3xf64>
	affine.store %cst_4, %2[1, 2] : memref<2x3xf64>

	// Load the transpose value from the input buffer and store it into the
	// next input buffer.
	affine.for %arg0 = 0 to 3 {
	affine.for %arg1 = 0 to 2 {
	%3 = affine.load %2[%arg1, %arg0] : memref<2x3xf64>
	affine.store %3, %1[%arg0, %arg1] : memref<3x2xf64>
	}
	}

	// Multiply and store into the output buffer.
	affine.for %arg0 = 0 to 3 {
	affine.for %arg1 = 0 to 2 {
	%3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
	%4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
	%5 = arith.mulf %3, %4 : f64
	affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>
	}
	}

	// Print the value held by the buffer.
	toy.print %0 : memref<3x2xf64>
	memref.dealloc %2 : memref<2x3xf64>
	memref.dealloc %1 : memref<3x2xf64>
	memref.dealloc %0 : memref<3x2xf64>
	return
	}
	```

	## Taking Advantage of Affine Optimization

	Our naive lowering is correct, but it leaves a lot to be desired with regards to
	efficiency. For example, the lowering of `toy.mul` has generated some redundant
	loads. Let's look at how adding a few existing optimizations to the pipeline can
	help clean this up. Adding the `LoopFusion` and `AffineScalarReplacement` passes
	to the pipeline gives the following result:

	```mlir
	func.func @main() {
	%cst = arith.constant 1.000000e+00 : f64
	%cst_0 = arith.constant 2.000000e+00 : f64
	%cst_1 = arith.constant 3.000000e+00 : f64
	%cst_2 = arith.constant 4.000000e+00 : f64
	%cst_3 = arith.constant 5.000000e+00 : f64
	%cst_4 = arith.constant 6.000000e+00 : f64

	// Allocating buffers for the inputs and outputs.
	%0 = memref.alloc() : memref<3x2xf64>
	%1 = memref.alloc() : memref<2x3xf64>

	// Initialize the input buffer with the constant values.
	affine.store %cst, %1[0, 0] : memref<2x3xf64>
	affine.store %cst_0, %1[0, 1] : memref<2x3xf64>
	affine.store %cst_1, %1[0, 2] : memref<2x3xf64>
	affine.store %cst_2, %1[1, 0] : memref<2x3xf64>
	affine.store %cst_3, %1[1, 1] : memref<2x3xf64>
	affine.store %cst_4, %1[1, 2] : memref<2x3xf64>

	affine.for %arg0 = 0 to 3 {
	affine.for %arg1 = 0 to 2 {
	// Load the transpose value from the input buffer.
	%2 = affine.load %1[%arg1, %arg0] : memref<2x3xf64>

	// Multiply and store into the output buffer.
	%3 = arith.mulf %2, %2 : f64
	affine.store %3, %0[%arg0, %arg1] : memref<3x2xf64>
	}
	}

	// Print the value held by the buffer.
	toy.print %0 : memref<3x2xf64>
	memref.dealloc %1 : memref<2x3xf64>
	memref.dealloc %0 : memref<3x2xf64>
	return
	}
	```

	Here, we can see that a redundant allocation was removed, the two loop nests
	were fused, and some unnecessary `load`s were removed. You can build `toyc-ch5`
	and try yourself: `toyc-ch5 test/Examples/Toy/Ch5/affine-lowering.mlir
	-emit=mlir-affine`. We can also check our optimizations by adding `-opt`.

	In this chapter we explored some aspects of partial lowering, with the intent to
	optimize. In the [next chapter](Ch-6.md) we will continue the discussion about
	dialect conversion by targeting LLVM for code generation.