Qwen 3 · FastFlowLM

⚡ Performance and Efficiency Benchmarks

This section reports the performance of Qwen 3 on NPU with FastFlowLM (FLM).

Note:

Results are based on FastFlowLM v0.9.31.

Under FLM’s default NPU power mode (Performance)

Newer versions may deliver improved performance.

Fine-tuned models show performance comparable to their base models.

Test System 1:

AMD Ryzen™ AI 7 350 (Kraken Point) with 32 GB DRAM; performance is comparable to other Kraken Point systems.

🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)

Model	HW	1k	2k	4k	8k	16k	32k
Qwen 3 0.6B	NPU (FLM)	66.5	57.5	44.5	31.0	19.6	14.1
Qwen 3 1.7B	NPU (FLM)	40.2	35.8	30.8	23.7	16.4	12.5
Qwen 3 4B	NPU (FLM)	19.6	18.1	16.3	13.7	10.6	8.5
Qwen 3 8B	NPU (FLM)	11.9	11.5	11.1	10.4	8.7	7.2

🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)

Model	HW	1k	2k	4k	8k	16k	32k
Qwen 3 0.6B	NPU (FLM)	1494	2003	2165	1981	1485	907
Qwen 3 1.7B	NPU (FLM)	956	1263	1434	1411	1143	768
Qwen 3 4B	NPU (FLM)	509	582	615	576	448	303
Qwen 3 8B	NPU (FLM)	357	435	457	442	367	260

🚀 Prefill TTFT with Image Input (Seconds)

Prefill time-to-first-token (TTFT) for Qwen3-VL-4B on NPU (FastFlowLM) with different image resolutions.

Mid Resolution Images:

Model	HW	720p (1280×720)	1080p (1920×1080)
Qwen3-VL-4B	NPU (FLM)	3.3	7.4

High Resolution Images:

Model	HW	2K (2560×1440)	4K (3840×2160)
Qwen3-VL-4B	NPU (FLM)	13.7	41.2

This test uses a short prompt: “Describe this image.”