R. Yadav et al., “Composing Distributed Computations Through Task and Kernel Fusion,” 2024. DOI:
10.48550/arxiv.2406.18109
“Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction,” 2022. DOI:
10.48550/arxiv.2202.03293
polyhedral $\sim$ affine level
loop tiling
interchange
parallelization
A. George, “Portable Sparse Polyhedral Framework Code Generation Using Multi Level Intermediate Representation.” DOI:
10.18122/td.2081.boisestate
S. Thangamani, “Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR.”
hardware-specific
memory coalescing
register allocation
instruction scheduling
T. Gysi et al., “Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation,” ACM Transactions on Architecture and Code Optimization, 2021. DOI:
10.1145/3469030
I. Katel et al., “MLIR-based code generation for GPU tensor cores,” in International Conference on Compiler Construction, 2022. DOI:
10.1145/3497776.3517770
现状
CPU-GPU
GPU Codegen (简化 GPU 编程模型)
thread/parallelism model
mlir gpu dialect
tensor core utilization
I. Katel et al., “High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results,” arXiv: Distributed, Parallel, and Cluster Computing, 2021.
I. Katel et al., “MLIR-based code generation for GPU tensor cores,” in International Conference on Compiler Construction, 2022. DOI:
10.1145/3497776.3517770
memory management
T. Gysi et al., “Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation,” ACM Transactions on Architecture and Code Optimization, 2021. DOI:
10.1145/3469030
CPU-GPU Coord (异构计算)
MLIR based SYCL
memory management
major concern
minimizing data transfer overhead
maximizing memory bandwidth utilization
ensuring correct synchronization
W. Moses et al., “High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs,” in ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 2022. DOI:
10.1145/3572848.3577475
W. Moses et al., “High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs,” 2022. DOI:
10.48550/arxiv.2207.00257
unified memory & managed memory
Performance Portablility (跨平台/跨架构, write once run anywhere)
codegen for specific platforms
programming model abstraction
SYCL
CUDA
OpenMP
transpilation
CUDA to CPU-threaded using MLIR
W. Moses et al., “High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs,” in ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 2022. DOI:
10.1145/3572848.3577475
W. Moses et al., “High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs,” 2022. DOI:
10.48550/arxiv.2207.00257
adaptive compilation
基础设施
重写案例
IREE
flang: fortran compiler in MLIR
domain specific compilers
MLIR-Forge:B. Ates et al., “MLIR-Forge: A Modular Framework for Language Smiths.”
SODA-OPT: N. Agostini et al., “An MLIR-based Compiler Flow for System-Level Design and Hardware Acceleration,” in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022. DOI:
10.1145/3508352.3549424
R. Yadav et al., “Composing Distributed Computations Through Task and Kernel Fusion,” 2024. DOI:
10.48550/arxiv.2406.18109
MLIR-based mojo
W. Godoy et al., “Mojo: MLIR-Based Performance-Portable HPC Science Kernels on GPUs for the Python Ecosystem,” 2025. DOI:
10.1145/3731599.3767573
MLIR opt for loop vectorization (fft)
R. He et al., “Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries,” Lecture Notes in Computer Science, 2024. DOI:
10.1007/978-3-031-50684-0_16