1# llama.cpp/tools/llama-bench
2
3Performance testing tool for llama.cpp.
4
5## Table of contents
6
71. [Syntax](#syntax)
82. [Examples](#examples)
9 1. [Text generation with different models](#text-generation-with-different-models)
10 2. [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes)
11 3. [Different numbers of threads](#different-numbers-of-threads)
12 4. [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu)
133. [Output formats](#output-formats)
14 1. [Markdown](#markdown)
15 2. [CSV](#csv)
16 3. [JSON](#json)
17 4. [JSONL](#jsonl)
18 5. [SQL](#sql)
19
20## Syntax
21
22```
23usage: llama-bench [options]
24
25options:
26 -h, --help
27 --numa <distribute|isolate|numactl> numa mode (default: disabled)
28 -r, --repetitions <n> number of times to repeat each test (default: 5)
29 --prio <0|1|2|3> process/thread priority (default: 0)
30 --delay <0...N> (seconds) delay between each test (default: 0)
31 -o, --output <csv|json|jsonl|md|sql> output format printed to stdout (default: md)
32 -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
33 --list-devices list available devices and exit
34 -v, --verbose verbose output
35 --progress print test progress indicators
36 -rpc, --rpc <rpc_servers> register RPC devices (comma separated)
37
38test parameters:
39 -m, --model <filename> (default: models/7B/ggml-model-q4_0.gguf)
40 -p, --n-prompt <n> (default: 512)
41 -n, --n-gen <n> (default: 128)
42 -pg <pp,tg> (default: )
43 -d, --n-depth <n> (default: 0)
44 -b, --batch-size <n> (default: 2048)
45 -ub, --ubatch-size <n> (default: 512)
46 -ctk, --cache-type-k <t> (default: f16)
47 -ctv, --cache-type-v <t> (default: f16)
48 -t, --threads <n> (default: system dependent)
49 -C, --cpu-mask <hex,hex> (default: 0x0)
50 --cpu-strict <0|1> (default: 0)
51 --poll <0...100> (default: 50)
52 -ngl, --n-gpu-layers <n> (default: 99)
53 -ncmoe, --n-cpu-moe <n> (default: 0)
54 -sm, --split-mode <none|layer|row> (default: layer)
55 -mg, --main-gpu <i> (default: 0)
56 -nkvo, --no-kv-offload <0|1> (default: 0)
57 -fa, --flash-attn <0|1> (default: 0)
58 -dev, --device <dev0/dev1/...> (default: auto)
59 -mmp, --mmap <0|1> (default: 1)
60 -embd, --embeddings <0|1> (default: 0)
61 -ts, --tensor-split <ts0/ts1/..> (default: 0)
62 -ot --override-tensors <tensor name pattern>=<buffer type>;...
63 (default: disabled)
64 -nopo, --no-op-offload <0|1> (default: 0)
65
66Multiple values can be given for each parameter by separating them with ','
67or by specifying the parameter multiple times. Ranges can be given as
68'first-last' or 'first-last+step' or 'first-last*mult'.
69```
70
71llama-bench can perform three types of tests:
72
73- Prompt processing (pp): processing a prompt in batches (`-p`)
74- Text generation (tg): generating a sequence of tokens (`-n`)
75- Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens (`-pg`)
76
77With the exception of `-r`, `-o` and `-v`, all options can be specified multiple times to run multiple tests. Each pp and tg test is run with all combinations of the specified options. To specify multiple values for an option, the values can be separated by commas (e.g. `-n 16,32`), or the option can be specified multiple times (e.g. `-n 16 -n 32`).
78
79Each test is repeated the number of times given by `-r`, and the results are averaged. The results are given in average tokens per second (t/s) and standard deviation. Some output formats (e.g. json) also include the individual results of each repetition.
80
81Using the `-d <n>` option, each test can be run at a specified context depth, prefilling the KV cache with `<n>` tokens.
82
83For a description of the other options, see the [completion example](../completion/README.md).
84
85> [!NOTE]
86> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
87
88## Examples
89
90### Text generation with different models
91
92```sh
93$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
94```
95
96| model | size | params | backend | ngl | test | t/s |
97| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
98| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 128 | 132.19 ± 0.55 |
99| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 256 | 129.37 ± 0.54 |
100| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 512 | 123.83 ± 0.25 |
101| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 128 | 82.17 ± 0.31 |
102| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 256 | 80.74 ± 0.23 |
103| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 512 | 78.08 ± 0.07 |
104
105### Prompt processing with different batch sizes
106
107```sh
108$ ./llama-bench -n 0 -p 1024 -b 128,256,512,1024
109```
110
111| model | size | params | backend | ngl | n_batch | test | t/s |
112| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
113| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 128 | pp 1024 | 1436.51 ± 3.66 |
114| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 256 | pp 1024 | 1932.43 ± 23.48 |
115| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 512 | pp 1024 | 2254.45 ± 15.59 |
116| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1024 | pp 1024 | 2498.61 ± 13.58 |
117
118### Different numbers of threads
119
120```sh
121$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
122```
123
124| model | size | params | backend | threads | test | t/s |
125| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
126| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 1 | pp 64 | 6.17 ± 0.07 |
127| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 1 | tg 16 | 4.05 ± 0.02 |
128| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 2 | pp 64 | 12.31 ± 0.13 |
129| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 2 | tg 16 | 7.80 ± 0.07 |
130| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp 64 | 23.18 ± 0.06 |
131| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg 16 | 12.22 ± 0.07 |
132| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | pp 64 | 32.29 ± 1.21 |
133| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 8 | tg 16 | 16.71 ± 0.66 |
134| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | pp 64 | 33.52 ± 0.03 |
135| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 16 | tg 16 | 15.32 ± 0.05 |
136| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | pp 64 | 59.00 ± 1.11 |
137| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 32 | tg 16 | 16.41 ± 0.79 |
138
139### Different numbers of layers offloaded to the GPU
140
141```sh
142$ ./llama-bench -ngl 10,20,30,31,32,33,34,35
143```
144
145| model | size | params | backend | ngl | test | t/s |
146| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
147| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 10 | pp 512 | 373.36 ± 2.25 |
148| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 10 | tg 128 | 13.45 ± 0.93 |
149| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 20 | pp 512 | 472.65 ± 1.25 |
150| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 20 | tg 128 | 21.36 ± 1.94 |
151| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 30 | pp 512 | 631.87 ± 11.25 |
152| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 30 | tg 128 | 40.04 ± 1.82 |
153| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 31 | pp 512 | 657.89 ± 5.08 |
154| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 31 | tg 128 | 48.19 ± 0.81 |
155| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 32 | pp 512 | 688.26 ± 3.29 |
156| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 32 | tg 128 | 54.78 ± 0.65 |
157| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 33 | pp 512 | 704.27 ± 2.24 |
158| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 33 | tg 128 | 60.62 ± 1.76 |
159| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 34 | pp 512 | 881.34 ± 5.40 |
160| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 34 | tg 128 | 71.76 ± 0.23 |
161| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 35 | pp 512 | 2400.01 ± 7.72 |
162| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 35 | tg 128 | 131.66 ± 0.49 |
163
164### Different prefilled context
165
166```
167$ ./llama-bench -d 0,512
168```
169
170| model | size | params | backend | ngl | test | t/s |
171| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
172| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | pp512 | 7340.20 ± 23.45 |
173| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | tg128 | 120.60 ± 0.59 |
174| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | pp512 @ d512 | 6425.91 ± 18.88 |
175| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | tg128 @ d512 | 116.71 ± 0.60 |
176
177## Output formats
178
179By default, llama-bench outputs the results in markdown format. The results can be output in other formats by using the `-o` option.
180
181### Markdown
182
183```sh
184$ ./llama-bench -o md
185```
186
187| model | size | params | backend | ngl | test | t/s |
188| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
189| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | pp 512 | 2368.80 ± 93.24 |
190| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 128 | 131.42 ± 0.59 |
191
192### CSV
193
194```sh
195$ ./llama-bench -o csv
196```
197
198```csv
199build_commit,build_number,cpu_info,gpu_info,backends,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,n_depth,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
200"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","0","2025-04-24T11:57:09Z","70285660","982040","7285.676949","100.064434"
201"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","0","2025-04-24T11:57:10Z","1067431600","3834831","119.915244","0.430617"
202```
203
204### JSON
205
206```sh
207$ ./llama-bench -o json
208```
209
210```json
211[
212 {
213 "build_commit": "8cf427ff",
214 "build_number": 5163,
215 "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
216 "gpu_info": "NVIDIA GeForce RTX 4080",
217 "backends": "CUDA",
218 "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
219 "model_type": "qwen2 7B Q4_K - Medium",
220 "model_size": 4677120000,
221 "model_n_params": 7615616512,
222 "n_batch": 2048,
223 "n_ubatch": 512,
224 "n_threads": 8,
225 "cpu_mask": "0x0",
226 "cpu_strict": false,
227 "poll": 50,
228 "type_k": "f16",
229 "type_v": "f16",
230 "n_gpu_layers": 99,
231 "split_mode": "layer",
232 "main_gpu": 0,
233 "no_kv_offload": false,
234 "flash_attn": false,
235 "tensor_split": "0.00",
236 "use_mmap": true,
237 "embeddings": false,
238 "n_prompt": 512,
239 "n_gen": 0,
240 "n_depth": 0,
241 "test_time": "2025-04-24T11:58:50Z",
242 "avg_ns": 72135640,
243 "stddev_ns": 1453752,
244 "avg_ts": 7100.002165,
245 "stddev_ts": 140.341520,
246 "samples_ns": [ 74601900, 71632900, 71745200, 71952700, 70745500 ],
247 "samples_ts": [ 6863.1, 7147.55, 7136.37, 7115.79, 7237.21 ]
248 },
249 {
250 "build_commit": "8cf427ff",
251 "build_number": 5163,
252 "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
253 "gpu_info": "NVIDIA GeForce RTX 4080",
254 "backends": "CUDA",
255 "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
256 "model_type": "qwen2 7B Q4_K - Medium",
257 "model_size": 4677120000,
258 "model_n_params": 7615616512,
259 "n_batch": 2048,
260 "n_ubatch": 512,
261 "n_threads": 8,
262 "cpu_mask": "0x0",
263 "cpu_strict": false,
264 "poll": 50,
265 "type_k": "f16",
266 "type_v": "f16",
267 "n_gpu_layers": 99,
268 "split_mode": "layer",
269 "main_gpu": 0,
270 "no_kv_offload": false,
271 "flash_attn": false,
272 "tensor_split": "0.00",
273 "use_mmap": true,
274 "embeddings": false,
275 "n_prompt": 0,
276 "n_gen": 128,
277 "n_depth": 0,
278 "test_time": "2025-04-24T11:58:51Z",
279 "avg_ns": 1076767880,
280 "stddev_ns": 9449585,
281 "avg_ts": 118.881588,
282 "stddev_ts": 1.041811,
283 "samples_ns": [ 1075361300, 1065089400, 1071761200, 1081934900, 1089692600 ],
284 "samples_ts": [ 119.03, 120.178, 119.43, 118.307, 117.464 ]
285 }
286]
287```
288
289
290### JSONL
291
292```sh
293$ ./llama-bench -o jsonl
294```
295
296```json lines
297{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 512, "n_gen": 0, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 70497220, "stddev_ns": 883196, "avg_ts": 7263.609157, "stddev_ts": 90.940578, "samples_ns": [ 71551000, 71222800, 70364100, 69439100, 69909100 ],"samples_ts": [ 7155.74, 7188.71, 7276.44, 7373.37, 7323.8 ]}
298{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 0, "n_gen": 128, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 1068078400, "stddev_ns": 6279455, "avg_ts": 119.844681, "stddev_ts": 0.699739, "samples_ns": [ 1066331700, 1064864900, 1079042600, 1063328400, 1066824400 ],"samples_ts": [ 120.038, 120.203, 118.624, 120.377, 119.982 ]}
299```
300
301
302### SQL
303
304SQL output is suitable for importing into a SQLite database. The output can be piped into the `sqlite3` command line tool to add the results to a database.
305
306```sh
307$ ./llama-bench -o sql
308```
309
310```sql
311CREATE TABLE IF NOT EXISTS test (
312 build_commit TEXT,
313 build_number INTEGER,
314 cpu_info TEXT,
315 gpu_info TEXT,
316 backends TEXT,
317 model_filename TEXT,
318 model_type TEXT,
319 model_size INTEGER,
320 model_n_params INTEGER,
321 n_batch INTEGER,
322 n_ubatch INTEGER,
323 n_threads INTEGER,
324 cpu_mask TEXT,
325 cpu_strict INTEGER,
326 poll INTEGER,
327 type_k TEXT,
328 type_v TEXT,
329 n_gpu_layers INTEGER,
330 split_mode TEXT,
331 main_gpu INTEGER,
332 no_kv_offload INTEGER,
333 flash_attn INTEGER,
334 tensor_split TEXT,
335 use_mmap INTEGER,
336 embeddings INTEGER,
337 n_prompt INTEGER,
338 n_gen INTEGER,
339 n_depth INTEGER,
340 test_time TEXT,
341 avg_ns INTEGER,
342 stddev_ns INTEGER,
343 avg_ts REAL,
344 stddev_ts REAL
345);
346
347INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '512', '0', '0', '2025-04-24T12:00:08Z', '69905000', '519516', '7324.546977', '54.032613');
348INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '0', '128', '0', '2025-04-24T12:00:09Z', '1063608780', '4464130', '120.346696', '0.504647');
349```