llmnpc - llama.cpp/tools/llama-bench/README.md

Path: llmnpc / llama.cpp / tools / llama-bench / README.md (raw)
  1# llama.cpp/tools/llama-bench
  2
  3Performance testing tool for llama.cpp.
  4
  5## Table of contents
  6
  71. [Syntax](#syntax)
  82. [Examples](#examples)
  9    1. [Text generation with different models](#text-generation-with-different-models)
 10    2. [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes)
 11    3. [Different numbers of threads](#different-numbers-of-threads)
 12    4. [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu)
 133. [Output formats](#output-formats)
 14    1. [Markdown](#markdown)
 15    2. [CSV](#csv)
 16    3. [JSON](#json)
 17    4. [JSONL](#jsonl)
 18    5. [SQL](#sql)
 19
 20## Syntax
 21
 22```
 23usage: llama-bench [options]
 24
 25options:
 26  -h, --help
 27  --numa <distribute|isolate|numactl>       numa mode (default: disabled)
 28  -r, --repetitions <n>                     number of times to repeat each test (default: 5)
 29  --prio <0|1|2|3>                          process/thread priority (default: 0)
 30  --delay <0...N> (seconds)                 delay between each test (default: 0)
 31  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
 32  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
 33  --list-devices                            list available devices and exit
 34  -v, --verbose                             verbose output
 35  --progress                                print test progress indicators
 36  -rpc, --rpc <rpc_servers>                 register RPC devices (comma separated)
 37
 38test parameters:
 39  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
 40  -p, --n-prompt <n>                        (default: 512)
 41  -n, --n-gen <n>                           (default: 128)
 42  -pg <pp,tg>                               (default: )
 43  -d, --n-depth <n>                         (default: 0)
 44  -b, --batch-size <n>                      (default: 2048)
 45  -ub, --ubatch-size <n>                    (default: 512)
 46  -ctk, --cache-type-k <t>                  (default: f16)
 47  -ctv, --cache-type-v <t>                  (default: f16)
 48  -t, --threads <n>                         (default: system dependent)
 49  -C, --cpu-mask <hex,hex>                  (default: 0x0)
 50  --cpu-strict <0|1>                        (default: 0)
 51  --poll <0...100>                          (default: 50)
 52  -ngl, --n-gpu-layers <n>                  (default: 99)
 53  -ncmoe, --n-cpu-moe <n>                   (default: 0)
 54  -sm, --split-mode <none|layer|row>        (default: layer)
 55  -mg, --main-gpu <i>                       (default: 0)
 56  -nkvo, --no-kv-offload <0|1>              (default: 0)
 57  -fa, --flash-attn <0|1>                   (default: 0)
 58  -dev, --device <dev0/dev1/...>            (default: auto)
 59  -mmp, --mmap <0|1>                        (default: 1)
 60  -embd, --embeddings <0|1>                 (default: 0)
 61  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
 62  -ot --override-tensors <tensor name pattern>=<buffer type>;...
 63                                            (default: disabled)
 64  -nopo, --no-op-offload <0|1>              (default: 0)
 65
 66Multiple values can be given for each parameter by separating them with ','
 67or by specifying the parameter multiple times. Ranges can be given as
 68'first-last' or 'first-last+step' or 'first-last*mult'.
 69```
 70
 71llama-bench can perform three types of tests:
 72
 73- Prompt processing (pp): processing a prompt in batches (`-p`)
 74- Text generation (tg): generating a sequence of tokens (`-n`)
 75- Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens (`-pg`)
 76
 77With the exception of `-r`, `-o` and `-v`, all options can be specified multiple times to run multiple tests. Each pp and tg test is run with all combinations of the specified options. To specify multiple values for an option, the values can be separated by commas (e.g. `-n 16,32`), or the option can be specified multiple times (e.g. `-n 16 -n 32`).
 78
 79Each test is repeated the number of times given by `-r`, and the results are averaged. The results are given in average tokens per second (t/s) and standard deviation. Some output formats (e.g. json) also include the individual results of each repetition.
 80
 81Using the `-d <n>` option, each test can be run at a specified context depth, prefilling the KV cache with `<n>` tokens.
 82
 83For a description of the other options, see the [completion example](../completion/README.md).
 84
 85> [!NOTE]
 86> The measurements with `llama-bench` do not include the times for tokenization and for sampling.
 87
 88## Examples
 89
 90### Text generation with different models
 91
 92```sh
 93$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
 94```
 95
 96| model                          |       size |     params | backend    | ngl | test       |              t/s |
 97| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
 98| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 128     |    132.19 ± 0.55 |
 99| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 256     |    129.37 ± 0.54 |
100| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 512     |    123.83 ± 0.25 |
101| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     82.17 ± 0.31 |
102| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     80.74 ± 0.23 |
103| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     78.08 ± 0.07 |
104
105### Prompt processing with different batch sizes
106
107```sh
108$ ./llama-bench -n 0 -p 1024 -b 128,256,512,1024
109```
110
111| model                          |       size |     params | backend    | ngl |    n_batch | test       |              t/s |
112| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
113| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 | pp 1024    |   1436.51 ± 3.66 |
114| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 | pp 1024    |  1932.43 ± 23.48 |
115| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 | pp 1024    |  2254.45 ± 15.59 |
116| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 | pp 1024    |  2498.61 ± 13.58 |
117
118### Different numbers of threads
119
120```sh
121$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
122```
123
124| model                          |       size |     params | backend    |    threads | test       |              t/s |
125| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
126| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          1 | pp 64      |      6.17 ± 0.07 |
127| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          1 | tg 16      |      4.05 ± 0.02 |
128| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          2 | pp 64      |     12.31 ± 0.13 |
129| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          2 | tg 16      |      7.80 ± 0.07 |
130| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          4 | pp 64      |     23.18 ± 0.06 |
131| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          4 | tg 16      |     12.22 ± 0.07 |
132| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          8 | pp 64      |     32.29 ± 1.21 |
133| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |          8 | tg 16      |     16.71 ± 0.66 |
134| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         16 | pp 64      |     33.52 ± 0.03 |
135| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         16 | tg 16      |     15.32 ± 0.05 |
136| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         32 | pp 64      |     59.00 ± 1.11 |
137| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CPU        |         32 | tg 16      |     16.41 ± 0.79 |
138
139### Different numbers of layers offloaded to the GPU
140
141```sh
142$ ./llama-bench -ngl 10,20,30,31,32,33,34,35
143```
144
145| model                          |       size |     params | backend    | ngl | test       |              t/s |
146| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
147| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  10 | pp 512     |    373.36 ± 2.25 |
148| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  10 | tg 128     |     13.45 ± 0.93 |
149| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  20 | pp 512     |    472.65 ± 1.25 |
150| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  20 | tg 128     |     21.36 ± 1.94 |
151| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  30 | pp 512     |   631.87 ± 11.25 |
152| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  30 | tg 128     |     40.04 ± 1.82 |
153| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  31 | pp 512     |    657.89 ± 5.08 |
154| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  31 | tg 128     |     48.19 ± 0.81 |
155| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  32 | pp 512     |    688.26 ± 3.29 |
156| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  32 | tg 128     |     54.78 ± 0.65 |
157| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  33 | pp 512     |    704.27 ± 2.24 |
158| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  33 | tg 128     |     60.62 ± 1.76 |
159| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  34 | pp 512     |    881.34 ± 5.40 |
160| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  34 | tg 128     |     71.76 ± 0.23 |
161| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  35 | pp 512     |   2400.01 ± 7.72 |
162| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  35 | tg 128     |    131.66 ± 0.49 |
163
164### Different prefilled context
165
166```
167$ ./llama-bench -d 0,512
168```
169
170| model                          |       size |     params | backend    | ngl |            test |                  t/s |
171| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
172| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |           pp512 |      7340.20 ± 23.45 |
173| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |           tg128 |        120.60 ± 0.59 |
174| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |    pp512 @ d512 |      6425.91 ± 18.88 |
175| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |    tg128 @ d512 |        116.71 ± 0.60 |
176
177## Output formats
178
179By default, llama-bench outputs the results in markdown format. The results can be output in other formats by using the `-o` option.
180
181### Markdown
182
183```sh
184$ ./llama-bench -o md
185```
186
187| model                          |       size |     params | backend    | ngl | test       |              t/s |
188| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
189| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | pp 512     |  2368.80 ± 93.24 |
190| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 128     |    131.42 ± 0.59 |
191
192### CSV
193
194```sh
195$ ./llama-bench -o csv
196```
197
198```csv
199build_commit,build_number,cpu_info,gpu_info,backends,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,n_depth,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
200"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","0","2025-04-24T11:57:09Z","70285660","982040","7285.676949","100.064434"
201"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","0","2025-04-24T11:57:10Z","1067431600","3834831","119.915244","0.430617"
202```
203
204### JSON
205
206```sh
207$ ./llama-bench -o json
208```
209
210```json
211[
212  {
213    "build_commit": "8cf427ff",
214    "build_number": 5163,
215    "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
216    "gpu_info": "NVIDIA GeForce RTX 4080",
217    "backends": "CUDA",
218    "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
219    "model_type": "qwen2 7B Q4_K - Medium",
220    "model_size": 4677120000,
221    "model_n_params": 7615616512,
222    "n_batch": 2048,
223    "n_ubatch": 512,
224    "n_threads": 8,
225    "cpu_mask": "0x0",
226    "cpu_strict": false,
227    "poll": 50,
228    "type_k": "f16",
229    "type_v": "f16",
230    "n_gpu_layers": 99,
231    "split_mode": "layer",
232    "main_gpu": 0,
233    "no_kv_offload": false,
234    "flash_attn": false,
235    "tensor_split": "0.00",
236    "use_mmap": true,
237    "embeddings": false,
238    "n_prompt": 512,
239    "n_gen": 0,
240    "n_depth": 0,
241    "test_time": "2025-04-24T11:58:50Z",
242    "avg_ns": 72135640,
243    "stddev_ns": 1453752,
244    "avg_ts": 7100.002165,
245    "stddev_ts": 140.341520,
246    "samples_ns": [ 74601900, 71632900, 71745200, 71952700, 70745500 ],
247    "samples_ts": [ 6863.1, 7147.55, 7136.37, 7115.79, 7237.21 ]
248  },
249  {
250    "build_commit": "8cf427ff",
251    "build_number": 5163,
252    "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
253    "gpu_info": "NVIDIA GeForce RTX 4080",
254    "backends": "CUDA",
255    "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
256    "model_type": "qwen2 7B Q4_K - Medium",
257    "model_size": 4677120000,
258    "model_n_params": 7615616512,
259    "n_batch": 2048,
260    "n_ubatch": 512,
261    "n_threads": 8,
262    "cpu_mask": "0x0",
263    "cpu_strict": false,
264    "poll": 50,
265    "type_k": "f16",
266    "type_v": "f16",
267    "n_gpu_layers": 99,
268    "split_mode": "layer",
269    "main_gpu": 0,
270    "no_kv_offload": false,
271    "flash_attn": false,
272    "tensor_split": "0.00",
273    "use_mmap": true,
274    "embeddings": false,
275    "n_prompt": 0,
276    "n_gen": 128,
277    "n_depth": 0,
278    "test_time": "2025-04-24T11:58:51Z",
279    "avg_ns": 1076767880,
280    "stddev_ns": 9449585,
281    "avg_ts": 118.881588,
282    "stddev_ts": 1.041811,
283    "samples_ns": [ 1075361300, 1065089400, 1071761200, 1081934900, 1089692600 ],
284    "samples_ts": [ 119.03, 120.178, 119.43, 118.307, 117.464 ]
285  }
286]
287```
288
289
290### JSONL
291
292```sh
293$ ./llama-bench -o jsonl
294```
295
296```json lines
297{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 512, "n_gen": 0, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 70497220, "stddev_ns": 883196, "avg_ts": 7263.609157, "stddev_ts": 90.940578, "samples_ns": [ 71551000, 71222800, 70364100, 69439100, 69909100 ],"samples_ts": [ 7155.74, 7188.71, 7276.44, 7373.37, 7323.8 ]}
298{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 0, "n_gen": 128, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 1068078400, "stddev_ns": 6279455, "avg_ts": 119.844681, "stddev_ts": 0.699739, "samples_ns": [ 1066331700, 1064864900, 1079042600, 1063328400, 1066824400 ],"samples_ts": [ 120.038, 120.203, 118.624, 120.377, 119.982 ]}
299```
300
301
302### SQL
303
304SQL output is suitable for importing into a SQLite database. The output can be piped into the `sqlite3` command line tool to add the results to a database.
305
306```sh
307$ ./llama-bench -o sql
308```
309
310```sql
311CREATE TABLE IF NOT EXISTS test (
312  build_commit TEXT,
313  build_number INTEGER,
314  cpu_info TEXT,
315  gpu_info TEXT,
316  backends TEXT,
317  model_filename TEXT,
318  model_type TEXT,
319  model_size INTEGER,
320  model_n_params INTEGER,
321  n_batch INTEGER,
322  n_ubatch INTEGER,
323  n_threads INTEGER,
324  cpu_mask TEXT,
325  cpu_strict INTEGER,
326  poll INTEGER,
327  type_k TEXT,
328  type_v TEXT,
329  n_gpu_layers INTEGER,
330  split_mode TEXT,
331  main_gpu INTEGER,
332  no_kv_offload INTEGER,
333  flash_attn INTEGER,
334  tensor_split TEXT,
335  use_mmap INTEGER,
336  embeddings INTEGER,
337  n_prompt INTEGER,
338  n_gen INTEGER,
339  n_depth INTEGER,
340  test_time TEXT,
341  avg_ns INTEGER,
342  stddev_ns INTEGER,
343  avg_ts REAL,
344  stddev_ts REAL
345);
346
347INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '512', '0', '0', '2025-04-24T12:00:08Z', '69905000', '519516', '7324.546977', '54.032613');
348INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '0', '128', '0', '2025-04-24T12:00:09Z', '1063608780', '4464130', '120.346696', '0.504647');
349```