AI Models SDK

This section introduces the BOS AI Models SDK and its end-to-end workflow, illustrated in the accompanying diagram. It outlines how models move from high-level frameworks through a transparent, hardware-aware compilation pipeline into efficient execution on BOS NPUs. The workflow is supported by a rich set of tools for validation, optimization, and performance analysis. Together, these components provide developers with full visibility and control, enabling efficient deployment, fine-grained tuning, and deep insight into both model behavior and hardware execution.

AI Models Workflow

AI Model Compiler

BOS delivers a modern, end-to-end AI model compiler built on top of Tenstorrent’s software stack, designed to seamlessly bridge framework-level models and efficient NPU execution. It is a transparent, debuggable, and hardware-aware AI compiler stack that gives developers full visibility and control, from model ingestion to final execution, while remaining flexible across frameworks and deployment targets.

BOS proposes a 3 stages compilation workflow, detailed in the next sections.

Debugging

While the BOS SDK emphasizes pre-compilation preparation, ttnn-standalone enables debugging and validation after compilation.

ttnn-standalone is a runtime-level tool that allows developers to execute compiled TTNN models independently of the full runtime stack. It is primarily used to:

Run compiled models and validate correctness against reference outputs
Inspect execution behavior and experiment with runtime configurations
Perform post-compilation tuning and debugging of model execution
Isolate issues between compilation and hardware/runtime execution

This makes it particularly useful for identifying discrepancies that only appear after lowering and compilation, complementing IR-level debugging tools such as TT-Explorer.

Runtime

TT-NN runtime

Build and run AI models using a PyTorch-like API
Execute neural network operations without managing low-level hardware details

TT-Metalium runtime

Develop custom kernels and integrate them into model execution
Control data movement, memory layout, and execution scheduling
Optimize performance by tuning parallelism, tiling, and compute patterns

TT-LLK (Low-Level Kernels)

With TT-LLK, users can:

Access bare-metal compute primitives on Tensix cores
Implement custom operations using:
- data unpacking
- compute (math kernels)
- data packing
Write highly optimized kernels with fine-grained hardware control
Maximize performance by directly managing compute and data flow
Extend or customize the foundation of the runtime and compiler stack
Develop and validate kernels for different Tenstorrent architectures

Performance Profilers

Tracy profiler

With Tracy, users can:

Profile host-side execution (C++ and Python in TT-Metal runtime)
Visualize timeline of model execution and system activity
Identify performance bottlenecks and hotspots
Measure latency and throughput of model operations
Analyze kernel execution behavior across Tensix cores
Inspect data movement and scheduling interactions
Correlate CPU-side activity with accelerator execution
Optimize end-to-end performance across host + device

Memory visualizer

With the Memory Visualizer, users can:

Inspect SRAM, DRAM, and circular buffer usage over time
Identify peak memory consumption and bottlenecks
Analyze tensor allocation and buffer usage per operation
Understand how tensors are sharded and distributed across cores
Visualize data movement and operation sequencing
Explore per-tensor details interactively
Optimize memory layout and buffer reuse strategies
Improve overall memory efficiency and model performance

NOC visualizer

With the NoC Visualizer, users can:

Analyze network-on-chip (NoC) traffic patterns
Track data movement between cores and memory
Identify bandwidth bottlenecks and congestion points
Understand inter-core communication behavior
Optimize data routing and communication efficiency
Correlate NoC activity with model execution phases
Improve performance of distributed compute and data movement

:::

AI Models Workflow​

AI Model Compiler​

Debugging​

Runtime​

TT-NN runtime​

TT-Metalium runtime​

TT-LLK (Low-Level Kernels)​

Performance Profilers​

Tracy profiler​

Memory visualizer​

NOC visualizer​