⚡ Performance and Efficiency Benchmarks

This section reports the performance of Qwen 3.5 on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.38.
  • Under FLM’s default NPU power mode (Performance)
  • Newer versions may deliver improved performance.
  • Fine-tuned models show performance comparable to their base models.

Test System 1:

AMD Ryzen™ AI 7 350 (Kraken Point) with 32 GB DRAM; performance is comparable to other Kraken Point systems.


🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)

Model HW 1k 2k 4k 8k 16k 32k
Qwen3.5-0.8B NPU (FLM) 39.2 38.0 36.3 33.1 28.1 21.6
Qwen3.5-2B NPU (FLM) 26.8 26.2 25.4 23.7 21.3 17.0
Qwen3.5-4B NPU (FLM) 15.0 14.6 14.2 13.3 11.8 9.6
Qwen3.5-9B NPU (FLM) 9.3 9.2 9.0 8.5 7.8 6.9

🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)

Model HW 1k 2k 4k 8k 16k 32k
Qwen3.5-0.8B NPU (FLM) 983 1257 1471 1584 1579 1447
Qwen3.5-2B NPU (FLM) 803 1004 1142 1223 1225 1151
Qwen3.5-4B NPU (FLM) 378 440 479 493 487 450
Qwen3.5-9B NPU (FLM) 284 333 362 379 378 357

🚀 Prefill TTFT with Image Input (Seconds)

Prefill time-to-first-token (TTFT) for Qwen3.5-4B on NPU (FastFlowLM) with different image resolutions.

Mid Resolution Images:

Model HW 720p (1280×720) 1080p (1920×1080)
Qwen3.5-0.8B NPU (FLM) 1.6 2.9
Qwen3.5-2B NPU (FLM) 2.4 4.8
Qwen3.5-4B NPU (FLM) 3.7 7.5
Qwen3.5-9B NPU (FLM) 4.8 9.6

High Resolution Images:

Model HW 2K (2560×1440) 4K (3840×2160)
Qwen3.5-0.8B NPU (FLM) 5.3 15.2
Qwen3.5-2B NPU (FLM) 9.6 30.5
Qwen3.5-4B NPU (FLM) 14.7 41.3
Qwen3.5-9B NPU (FLM) 18.0 50.8

This test uses a short prompt: “Describe this image.”