Kernel Development
Here are examples of Kernel development
Basic Usage
Step 1: DRAM Loopback
Learn the basic structure of a Metalium application by implementing a DRAM loopback that copies data from one DRAM buffer to another.
What you’ll learn:
- Basic host and kernel structure
- Buffer management
- Data transfer
Step 2: Eltwise Binary Kernel
Build on the loopback example by implementing an Elwise binary kernel that performs addition on two buffers. This introduces computation using the matrix engine (FPU) and data passing within a Tensix core.
What you’ll learn:
- Circular buffers for data passing
- Compute kernels
- Using the matrix engine (FPU)
Step 3: Eltwise SFPU
Extend the previous example to implement an Eltwise SFPU kernel that performs element-wise addition using the SFPU (vector engine, Special Function Processing Unit). This will introduce you to the SFPU and how to use it for vectorized operations
What you’ll learn:
- Vectorized operations using the SFPU
Intermediate Usage
Step 4: Single-core Matrix Multiplication
Implement a matrix multiplication on a single Tensix core using the matrix engine.
What you’ll learn:
- Complex dataflow
- Tiled operations
- Matrix engine utilization
Advanced Usage
Step 5: Multi-core Matrix Multiplication
Extend to a multi-core implementation by distributing computation across multiple Tensix cores.
What you’ll learn:
- Parallel processing
- Workload distribution across cores
Step 6: Optimized Multi-core Matrix Multiplication
Optimize the multi-core implementationby leveraging the processor grid, reducing redundant DRAM access, and minimizing NoC congestion through data sharing.
What you’ll learn:
- Performance optimization techniques
- Efficient data movement and reuse
Reference: https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/get_started/get_started.html