⚡ Performance and Efficiency Benchmarks
This section reports the performance of LLaMA 3.x on NPU with FastFlowLM (FLM).
Note:
- Results are based on FastFlowLM v0.9.30.
- Under FLM’s default NPU power mode (Performance)
- Newer versions may deliver improved performance.
- Fine-tuned models show performance comparable to their base models.
Test System 1:
AMD Ryzen™ AI 7 350 (Kraken Point) with 32 GB DRAM; performance is comparable to other Kraken Point systems.
🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)
| Model | HW | 1k | 2k | 4k | 8k | 16k | 32k | 64k | 128k |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA 3.2 1B | NPU (FLM) | 64.5 | 62.2 | 58.9 | 53.9 | 45.5 | 35.0 | 24.1 | 13.6 |
| LLaMA 3.2 3B | NPU (FLM) | 26.3 | 25.5 | 24.1 | 21.7 | 18.0 | 13.6 | 9.0 | OOM |
| LLaMA 3.1 8B | NPU (FLM) | 12.8 | 12.6 | 12.2 | 11.5 | 10.2 | 8.5 | OOM | OOM |
OOM: Out Of Memory
Only <50% system DRAM can be accessed by NPU
On systems with more than 32 GB DRAM, longer context lengths are supported. FLM supports the full context length available for each model.
🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)
| Model | HW | 1k | 2k | 4k | 8k | 16k | 32k |
|---|---|---|---|---|---|---|---|
| LLaMA 3.2 1B | NPU (FLM) | 1686 | 2136 | 2339 | 2212 | 1706 | 1157 |
| LLaMA 3.2 3B | NPU (FLM) | 766 | 910 | 991 | 933 | 721 | 500 |
| LLaMA 3.1 8B | NPU (FLM) | 403 | 472 | 495 | 467 | 381 | 281 |