Benchmarks

Measured on real laptops.

Every FastFlowLM release is validated on Ryzen™ AI NPUs. We publish the results in docs/benchmarks so teams can compare apples-to-apples.

The runtime extends AMD’s native 2K context limit to 256K tokens for long-context LLMs and VLMs and, in power efficiency tests, consumes 67.2× less energy per token than the integrated GPU and 222.9× less energy per token than the CPU on the same chip while holding higher throughput.

GPT-OSS 20B

19 tps

AMD Ryzen™ AI 7 350 with 32 GB DRAM

Qwen 3 0.6B

80 tps

Prefill speed: 1,356 tps

Gemma3 1B

43 tps

Prefill speed: 1,657 tps

Gemma3 4B Vision

On-device power efficiency (Tokens/s/Watt or Tokens/Joule)

Gemma3 4B benchmark overview of power efficiency (TPS/W) for both prefill and decoding

Ultra-high power efficiency