Qwen2.5-VL
Complete guide for deploying the Qwen2.5-VL multimodal model (3B and 7B variants) on BOS Eagle-N hardware. This guide covers model setup, Hugging Face authentication, inference execution, batch processing, and performance profiling with Tracy.
Introduction
This codebase includes the Qwen2.5-VL family of models and currently supports the model variants:
- Qwen2.5-VL-3B-Instruct: Qwen/Qwen2.5-VL-3B-Instruct
- Qwen2.5-VL-7B-Instruct: Qwen/Qwen2.5-VL-7B-Instruct
- Qwen2.5-VL-3B-Instruct-AWQ: Qwen/Qwen2.5-VL-3B-Instruct-AWQ
- Qwen2.5-VL-7B-Instruct-AWQ: Qwen/Qwen2.5-VL-7B-Instruct-AWQ
Set environment variables
# at $TT_METAL_HOME
source env_set.sh
How to Run
For a single user example:
HF_MODEL=<model_name> pytest models/bos_model/qwen25_vl/demo/vision_demo.py -k 'accuracy and batch1-trace'
Notes:
<model_name>is the HuggingFace model repo string, e.g.Qwen/Qwen2.5-VL-3B-InstructorQwen/Qwen2.5-VL-7B-Instruct,Qwen2.5-VL-3B-Instruct-AWQ,Qwen2.5-VL-7B-Instruct-AWQ.-kis the pytest filter; to run a specific test, use-k <test_name>; additional test names are listed inmodels/bos_model/qwen25_vl/demo/vision_demo.py.models/bos_model/qwen25_vl/demo/outputsis the path to the directory containing dumped vision outputs.--resis an optional flag to specify the input resolution for vision tests. It currently supports128x128and224x224, and defaults to224x224.
For a batch user example:
HF_MODEL=<model_name> pytest models/bos_model/qwen25_vl/demo/vision_demo.py -k 'accuracy and batch2-trace'
- Note: The current implementation supports a batch size of 2.
To capture Tracy report:
HF_MODEL=Qwen/Qwen2.5-VL-7B-Instruct python -m tracy -m -r -p -v "pytest models/bos_model/qwen25_vl/demo/vision_demo.py -k 'accuracy and profiler'"
Notes:
- The model name
Qwen/Qwen2.5-VL-7B-Instructcan be changed toQwen/Qwen2.5-VL-3B-Instructif you want to record Tracy for the 3B model. -kis the pytest filter. To run a specific test,profileris a special test case for Tracy recording orttnn-visualizer. Use this test case for profiling.- Profiling parameters:
res = [224, 224],max_batch_size = 1,warmup_iters = 0,include_text_only_prompts = False. - The
accuracymode (BF16) can be changed toperformance(BFP8-mixed). generated/profiler/reportsis the path to the directory containing the Tracy report. Refer to tt-perf-report to read the report.
For Live chatting demo example:
HF_MODEL=<model_name> python models/bos_model/qwen25_vl/demo_qwen25_vl.py -i <image_path>
Notes:
- Use
-ito pass the input image path to the model, for examplemodels/bos_model/qwen25_vl/demo/images/dog.jpg. - Use
-cto enable Qwen to remember context. - Since image tokens are large, the context will grow with each interaction. Over time, this can exceed memory limits, so for longer chats it is recommended to run Qwen without context.
Details
- On the first execution of each model, TTNN will create weight cache files for that model to speed up future runs.
- These cache files only need to be created once for each model and each weight. New fine-tuned weights will need to be cached separately and will be stored according to the machine you are running the models on.
Run for ttnn-visualizer Profiler
- First, export environment variables using the script file.
$EXPERIMENT_NAME: input any string, for exampleqwen
source models/bos_model/export_l1_vis.sh $EXPERIMENT_NAME
- Second, run the model.
- If the model finishes running successfully, the result report will be generated in
generated/ttnn/reports/$EXPERIMENT_NAME_MMDD_hhmm/.
- If the model finishes running successfully, the result report will be generated in
HF_MODEL=Qwen/Qwen2.5-VL-7B-Instruct pytest models/bos_model/qwen25_vl/demo/vision_demo.py -k 'accuracy and profiler'
- Third, run
ttnn-visualizer.$REPORT_PATH: the path mentioned in the previous step- Visit
http://localhost:8000/using your web browser
ttnn-visualizer --profiler-path $REPORT_PATH
- If the experiment has finished, run the following command to clear the environment variables.
source models/bos_model/unset_l1_vis.sh