Skip to main content

Kernel Development

Here are examples of Kernel development

Basic Usage

Step 1: DRAM Loopback

Learn the basic structure of a Metalium application by implementing a DRAM loopback that copies data from one DRAM buffer to another.

What you’ll learn:

  • Basic host and kernel structure
  • Buffer management
  • Data transfer

Step 2: Eltwise Binary Kernel

Build on the loopback example by implementing an Elwise binary kernel that performs addition on two buffers. This introduces computation using the matrix engine (FPU) and data passing within a Tensix core.

What you’ll learn:

  • Circular buffers for data passing
  • Compute kernels
  • Using the matrix engine (FPU)

Step 3: Eltwise SFPU

Extend the previous example to implement an Eltwise SFPU kernel that performs element-wise addition using the SFPU (vector engine, Special Function Processing Unit). This will introduce you to the SFPU and how to use it for vectorized operations

What you’ll learn:

  • Vectorized operations using the SFPU

Intermediate Usage

Step 4: Single-core Matrix Multiplication

Implement a matrix multiplication on a single Tensix core using the matrix engine.

What you’ll learn:

  • Complex dataflow
  • Tiled operations
  • Matrix engine utilization

Advanced Usage

Step 5: Multi-core Matrix Multiplication

Extend to a multi-core implementation by distributing computation across multiple Tensix cores.

What you’ll learn:

  • Parallel processing
  • Workload distribution across cores

Step 6: Optimized Multi-core Matrix Multiplication

Optimize the multi-core implementationby leveraging the processor grid, reducing redundant DRAM access, and minimizing NoC congestion through data sharing.

What you’ll learn:

  • Performance optimization techniques
  • Efficient data movement and reuse

Reference: https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/get_started/get_started.html