llmnpc - llama.cpp/tools/server/bench/README.md

Path: llmnpc / llama.cpp / tools / server / bench / README.md (raw)
  1### Server benchmark tools
  2
  3Benchmark is using [k6](https://k6.io/).
  4
  5##### Install k6 and sse extension
  6
  7SSE is not supported by default in k6, you have to build k6 with the [xk6-sse](https://github.com/phymbert/xk6-sse) extension.
  8
  9Example (assuming golang >= 1.21 is installed):
 10```shell
 11go install go.k6.io/xk6/cmd/xk6@latest
 12$GOPATH/bin/xk6 build master \
 13--with github.com/phymbert/xk6-sse
 14```
 15
 16#### Download a dataset
 17
 18This dataset was originally proposed in [vLLM benchmarks](https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md).
 19
 20```shell
 21wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 22```
 23
 24#### Download a model
 25Example for PHI-2
 26
 27```shell
 28../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
 29```
 30
 31#### Start the server
 32The server must answer OAI Chat completion requests on `http://localhost:8080/v1` or according to the environment variable `SERVER_BENCH_URL`.
 33
 34Example:
 35```shell
 36llama-server --host localhost --port 8080 \
 37  --model ggml-model-q4_0.gguf \
 38  --cont-batching \
 39  --metrics \
 40  --parallel 8 \
 41  --batch-size 512 \
 42  --ctx-size 4096 \
 43  -ngl 33
 44```
 45
 46#### Run the benchmark
 47
 48For 500 chat completions request with 8 concurrent users during maximum 10 minutes, run:
 49```shell
 50./k6 run script.js --duration 10m --iterations 500 --vus 8
 51```
 52
 53The benchmark values can be overridden with:
 54- `SERVER_BENCH_URL` server url prefix for chat completions, default `http://localhost:8080/v1`
 55- `SERVER_BENCH_N_PROMPTS` total prompts to randomly select in the benchmark, default `480`
 56- `SERVER_BENCH_MODEL_ALIAS` model alias to pass in the completion request, default `my-model`
 57- `SERVER_BENCH_MAX_TOKENS` max tokens to predict, default: `512`
 58- `SERVER_BENCH_DATASET` path to the benchmark dataset file
 59- `SERVER_BENCH_MAX_PROMPT_TOKENS` maximum prompt tokens to filter out in the dataset: default `1024`
 60- `SERVER_BENCH_MAX_CONTEXT` maximum context size of the completions request to filter out in the dataset: prompt + predicted tokens, default `2048`
 61
 62Note: the local tokenizer is just a string space split, real number of tokens will differ.
 63
 64Or with [k6 options](https://k6.io/docs/using-k6/k6-options/reference/):
 65
 66```shell
 67SERVER_BENCH_N_PROMPTS=500 k6 run script.js --duration 10m --iterations 500 --vus 8
 68```
 69
 70To [debug http request](https://k6.io/docs/using-k6/http-debugging/) use `--http-debug="full"`.
 71
 72#### Metrics
 73
 74Following metrics are available computed from the OAI chat completions response `usage`:
 75- `llamacpp_tokens_second` Trend of `usage.total_tokens / request duration`
 76- `llamacpp_prompt_tokens` Trend of `usage.prompt_tokens`
 77- `llamacpp_prompt_tokens_total_counter` Counter of `usage.prompt_tokens`
 78- `llamacpp_completion_tokens` Trend of `usage.completion_tokens`
 79- `llamacpp_completion_tokens_total_counter` Counter of `usage.completion_tokens`
 80- `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'`
 81- `llamacpp_completions_stop_rate` Rate of completions stopped by the model, i.e. if `finish_reason === 'stop'`
 82
 83The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`.
 84
 85K6 metrics might be compared against [server metrics](../README.md), with:
 86
 87```shell
 88curl http://localhost:8080/metrics
 89```
 90
 91### Using the CI python script
 92The `bench.py` script does several steps:
 93- start the server
 94- define good variable for k6
 95- run k6 script
 96- extract metrics from prometheus
 97
 98It aims to be used in the CI, but you can run it manually:
 99
100```shell
101LLAMA_SERVER_BIN_PATH=../../../cmake-build-release/bin/llama-server python bench.py \
102              --runner-label local \
103              --name local \
104              --branch `git rev-parse --abbrev-ref HEAD` \
105              --commit `git rev-parse HEAD` \
106              --scenario script.js \
107              --duration 5m \
108              --hf-repo ggml-org/models	 \
109              --hf-file phi-2/ggml-model-q4_0.gguf \
110              --model-path-prefix models \
111              --parallel 4 \
112              -ngl 33 \
113              --batch-size 2048 \
114              --ubatch-size	256 \
115              --ctx-size 4096 \
116              --n-prompts 200 \
117              --max-prompt-tokens 256 \
118              --max-tokens 256
119```