AI Models SDK
This section introduces the BOS AI Models SDK and its end-to-end workflow, illustrated in the accompanying diagram. It outlines how models move from high-level frameworks through a transparent, hardware-aware compilation pipeline into efficient execution on BOS NPUs. The workflow is supported by a rich set of tools for validation, optimization, and performance analysis. Together, these components provide developers with full visibility and control, enabling efficient deployment, fine-grained tuning, and deep insight into both model behavior and hardware execution.
AI Models Workflow
AI Model Compiler
BOS delivers a modern, end-to-end AI model compiler built on top of Tenstorrent’s software stack, designed to seamlessly bridge framework-level models and efficient NPU execution. It is a transparent, debuggable, and hardware-aware AI compiler stack that gives developers full visibility and control, from model ingestion to final execution, while remaining flexible across frameworks and deployment targets.
BOS proposes a 3 stages compilation workflow, detailed in the next sections.
Debugging
While the BOS SDK emphasizes pre-compilation preparation, ttnn-standalone enables debugging and validation after compilation.
ttnn-standalone is a runtime-level tool that allows developers to execute compiled TTNN models independently of the full runtime stack. It is primarily used to:
- Run compiled models and validate correctness against reference outputs
- Inspect execution behavior and experiment with runtime configurations
- Perform post-compilation tuning and debugging of model execution
- Isolate issues between compilation and hardware/runtime execution
This makes it particularly useful for identifying discrepancies that only appear after lowering and compilation, complementing IR-level debugging tools such as TT-Explorer.
Runtime
TT-NN runtime
- Build and run AI models using a PyTorch-like API
- Execute neural network operations without managing low-level hardware details
TT-Metalium runtime
- Develop custom kernels and integrate them into model execution
- Control data movement, memory layout, and execution scheduling
- Optimize performance by tuning parallelism, tiling, and compute patterns
TT-LLK (Low-Level Kernels)
With TT-LLK, users can:
- Access bare-metal compute primitives on Tensix cores
- Implement custom operations using:
- data unpacking
- compute (math kernels)
- data packing
- Write highly optimized kernels with fine-grained hardware control
- Maximize performance by directly managing compute and data flow
- Extend or customize the foundation of the runtime and compiler stack
- Develop and validate kernels for different Tenstorrent architectures
Performance Profilers
Tracy profiler
With Tracy, users can:
- Profile host-side execution (C++ and Python in TT-Metal runtime)
- Visualize timeline of model execution and system activity
- Identify performance bottlenecks and hotspots
- Measure latency and throughput of model operations
- Analyze kernel execution behavior across Tensix cores
- Inspect data movement and scheduling interactions
- Correlate CPU-side activity with accelerator execution
- Optimize end-to-end performance across host + device
Memory visualizer
With the Memory Visualizer, users can:
- Inspect SRAM, DRAM, and circular buffer usage over time
- Identify peak memory consumption and bottlenecks
- Analyze tensor allocation and buffer usage per operation
- Understand how tensors are sharded and distributed across cores
- Visualize data movement and operation sequencing
- Explore per-tensor details interactively
- Optimize memory layout and buffer reuse strategies
- Improve overall memory efficiency and model performance
NOC visualizer
With the NoC Visualizer, users can:
- Analyze network-on-chip (NoC) traffic patterns
- Track data movement between cores and memory
- Identify bandwidth bottlenecks and congestion points
- Understand inter-core communication behavior
- Optimize data routing and communication efficiency
- Correlate NoC activity with model execution phases
- Improve performance of distributed compute and data movement
:::