⚡ Performance and Efficiency Benchmarks

This section reports the performance on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.33.
  • Under FLM’s default NPU power mode (Performance)
  • Newer versions may deliver improved performance.
  • Fine-tuned models show performance comparable to their base models.

Test System 1:

AMD Ryzen™ AI 7 350 (Kraken Point) with 32 GB DRAM; performance is comparable to other Kraken Point systems.


🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)

Model HW 1k 2k 4k 8k 16k 32k
Qwen2.5-3B-Instruct NPU (FLM) 23.5 22.5 19.8 16.8 12.5 8.4
Qwen2.5-VL-3B-Instruct NPU (FLM) 23.5 22.5 19.8 16.8 12.5 8.4

🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)

Model HW 1k 2k 4k 8k 16k 32k
Qwen2.5-3B-Instruct NPU (FLM) 660 809 899 891 741 532
Qwen2.5-VL-3B-Instruct NPU (FLM) 660 809 899 891 741 532

🚀 Prefill TTFT with Image Input (Seconds)

Prefill time-to-first-token (TTFT) for Qwen2.5-VL-3B-Instruct on NPU (FastFlowLM) with different image resolutions.

Mid Resolution Images:

Model HW 720p (1280×720) 1080p (1920×1080)
Qwen2.5-VL-3B-Instruct NPU (FLM) 4.3 7.9

High Resolution Images:

Model HW 2K (2560×1440) 4K (3840×2160)
Qwen2.5-VL-3B-Instruct NPU (FLM) 13.3 36.4

This test uses a short prompt: “Describe this image.”