llmnpc - llama.cpp/tools/server/README.md

Path: llmnpc / llama.cpp / tools / server / README.md (raw)
   1# LLaMA.cpp HTTP Server
   2
   3Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**.
   4
   5Set of LLM REST APIs and a web UI to interact with llama.cpp.
   6
   7**Features:**
   8 * LLM inference of F16 and quantized models on GPU and CPU
   9 * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions, responses, and embeddings routes
  10 * [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) compatible chat completions
  11 * Reranking endpoint (https://github.com/ggml-org/llama.cpp/pull/9510)
  12 * Parallel decoding with multi-user support
  13 * Continuous batching
  14 * Multimodal ([documentation](../../docs/multimodal.md)) / with OpenAI-compatible API support
  15 * Monitoring endpoints
  16 * Schema-constrained JSON response format
  17 * Prefilling of assistant messages similar to the Claude API
  18 * [Function calling](../../docs/function-calling.md) / tool use for ~any model
  19 * Speculative decoding
  20 * Easy-to-use web UI
  21
  22For the ful list of features, please refer to [server's changelog](https://github.com/ggml-org/llama.cpp/issues/9291)
  23
  24## Usage
  25
  26<!-- HELP_START -->
  27
  28<!-- IMPORTANT: The list below is auto-generated by llama-gen-docs; do NOT modify it manually -->
  29
  30### Common params
  31
  32| Argument | Explanation |
  33| -------- | ----------- |
  34| `-h, --help, --usage` | print usage and exit |
  35| `--version` | show version and build info |
  36| `--license` | show source code license and dependencies |
  37| `-cl, --cache-list` | show list of models in cache |
  38| `--completion-bash` | print source-able bash completion script for llama.cpp |
  39| `--verbose-prompt` | print a verbose prompt before generation (default: false) |
  40| `-t, --threads N` | number of CPU threads to use during generation (default: -1)<br/>(env: LLAMA_ARG_THREADS) |
  41| `-tb, --threads-batch N` | number of threads to use during batch and prompt processing (default: same as --threads) |
  42| `-C, --cpu-mask M` | CPU affinity mask: arbitrarily long hex. Complements cpu-range (default: "") |
  43| `-Cr, --cpu-range lo-hi` | range of CPUs for affinity. Complements --cpu-mask |
  44| `--cpu-strict <0\|1>` | use strict CPU placement (default: 0) |
  45| `--prio N` | set process/thread priority : low(-1), normal(0), medium(1), high(2), realtime(3) (default: 0) |
  46| `--poll <0...100>` | use polling level to wait for work (0 - no polling, default: 50) |
  47| `-Cb, --cpu-mask-batch M` | CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch (default: same as --cpu-mask) |
  48| `-Crb, --cpu-range-batch lo-hi` | ranges of CPUs for affinity. Complements --cpu-mask-batch |
  49| `--cpu-strict-batch <0\|1>` | use strict CPU placement (default: same as --cpu-strict) |
  50| `--prio-batch N` | set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime (default: 0) |
  51| `--poll-batch <0\|1>` | use polling to wait for work (default: same as --poll) |
  52| `-c, --ctx-size N` | size of the prompt context (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE) |
  53| `-n, --predict, --n-predict N` | number of tokens to predict (default: -1, -1 = infinity)<br/>(env: LLAMA_ARG_N_PREDICT) |
  54| `-b, --batch-size N` | logical maximum batch size (default: 2048)<br/>(env: LLAMA_ARG_BATCH) |
  55| `-ub, --ubatch-size N` | physical maximum batch size (default: 512)<br/>(env: LLAMA_ARG_UBATCH) |
  56| `--keep N` | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
  57| `--swa-full` | use full-size SWA cache (default: false)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)<br/>(env: LLAMA_ARG_SWA_FULL) |
  58| `-fa, --flash-attn [on\|off\|auto]` | set Flash Attention use ('on', 'off', or 'auto', default: 'auto')<br/>(env: LLAMA_ARG_FLASH_ATTN) |
  59| `--perf, --no-perf` | whether to enable internal libllama performance timings (default: false)<br/>(env: LLAMA_ARG_PERF) |
  60| `-e, --escape, --no-escape` | whether to process escapes sequences (\n, \r, \t, \', \", \\) (default: true) |
  61| `--rope-scaling {none,linear,yarn}` | RoPE frequency scaling method, defaults to linear unless specified by the model<br/>(env: LLAMA_ARG_ROPE_SCALING_TYPE) |
  62| `--rope-scale N` | RoPE context scaling factor, expands context by a factor of N<br/>(env: LLAMA_ARG_ROPE_SCALE) |
  63| `--rope-freq-base N` | RoPE base frequency, used by NTK-aware scaling (default: loaded from model)<br/>(env: LLAMA_ARG_ROPE_FREQ_BASE) |
  64| `--rope-freq-scale N` | RoPE frequency scaling factor, expands context by a factor of 1/N<br/>(env: LLAMA_ARG_ROPE_FREQ_SCALE) |
  65| `--yarn-orig-ctx N` | YaRN: original context size of model (default: 0 = model training context size)<br/>(env: LLAMA_ARG_YARN_ORIG_CTX) |
  66| `--yarn-ext-factor N` | YaRN: extrapolation mix factor (default: -1.00, 0.0 = full interpolation)<br/>(env: LLAMA_ARG_YARN_EXT_FACTOR) |
  67| `--yarn-attn-factor N` | YaRN: scale sqrt(t) or attention magnitude (default: -1.00)<br/>(env: LLAMA_ARG_YARN_ATTN_FACTOR) |
  68| `--yarn-beta-slow N` | YaRN: high correction dim or alpha (default: -1.00)<br/>(env: LLAMA_ARG_YARN_BETA_SLOW) |
  69| `--yarn-beta-fast N` | YaRN: low correction dim or beta (default: -1.00)<br/>(env: LLAMA_ARG_YARN_BETA_FAST) |
  70| `-kvo, --kv-offload, -nkvo, --no-kv-offload` | whether to enable KV cache offloading (default: enabled)<br/>(env: LLAMA_ARG_KV_OFFLOAD) |
  71| `--repack, -nr, --no-repack` | whether to enable weight repacking (default: enabled)<br/>(env: LLAMA_ARG_REPACK) |
  72| `--no-host` | bypass host buffer allowing extra buffers to be used<br/>(env: LLAMA_ARG_NO_HOST) |
  73| `-ctk, --cache-type-k TYPE` | KV cache data type for K<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_K) |
  74| `-ctv, --cache-type-v TYPE` | KV cache data type for V<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_V) |
  75| `-dt, --defrag-thold N` | KV cache defragmentation threshold (DEPRECATED)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
  76| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
  77| `--mmap, --no-mmap` | whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
  78| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. Takes precedence over --mmap (default: enabled)<br/>(env: LLAMA_ARG_DIO) |
  79| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
  80| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
  81| `--list-devices` | print list of available devices and exit |
  82| `-ot, --override-tensor <tensor name pattern>=<buffer type>,...` | override tensor buffer type<br/>(env: LLAMA_ARG_OVERRIDE_TENSOR) |
  83| `-cmoe, --cpu-moe` | keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
  84| `-ncmoe, --n-cpu-moe N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
  85| `-ngl, --gpu-layers, --n-gpu-layers N` | max. number of layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
  86| `-sm, --split-mode {none,layer,row}` | how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
  87| `-ts, --tensor-split N0,N1,N2,...` | fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
  88| `-mg, --main-gpu INDEX` | the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0)<br/>(env: LLAMA_ARG_MAIN_GPU) |
  89| `-fit, --fit [on\|off]` | whether to adjust unset arguments to fit in device memory ('on' or 'off', default: 'on')<br/>(env: LLAMA_ARG_FIT) |
  90| `-fitt, --fit-target MiB0,MiB1,MiB2,...` | target margin per device for --fit, comma-separated list of values, single value is broadcast across all devices, default: 1024<br/>(env: LLAMA_ARG_FIT_TARGET) |
  91| `-fitc, --fit-ctx N` | minimum ctx size that can be set by --fit option, default: 4096<br/>(env: LLAMA_ARG_FIT_CTX) |
  92| `--check-tensors` | check model tensor data for invalid values (default: false) |
  93| `--override-kv KEY=TYPE:VALUE,...` | advanced option to override model metadata by key. to specify multiple overrides, either use comma-separated values.<br/>types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false |
  94| `--op-offload, --no-op-offload` | whether to offload host tensor operations to device (default: true) |
  95| `--lora FNAME` | path to LoRA adapter (use comma-separated values to load multiple adapters) |
  96| `--lora-scaled FNAME:SCALE,...` | path to LoRA adapter with user defined scaling (format: FNAME:SCALE,...)<br/>note: use comma-separated values |
  97| `--control-vector FNAME` | add a control vector<br/>note: use comma-separated values to add multiple control vectors |
  98| `--control-vector-scaled FNAME:SCALE,...` | add a control vector with user defined scaling SCALE<br/>note: use comma-separated values (format: FNAME:SCALE,...) |
  99| `--control-vector-layer-range START END` | layer range to apply the control vector(s) to, start and end inclusive |
 100| `-m, --model FNAME` | model path to load<br/>(env: LLAMA_ARG_MODEL) |
 101| `-mu, --model-url MODEL_URL` | model download url (default: unused)<br/>(env: LLAMA_ARG_MODEL_URL) |
 102| `-dr, --docker-repo [<repo>/]<model>[:quant]` | Docker Hub model repository. repo is optional, default to ai/. quant is optional, default to :latest.<br/>example: gemma3<br/>(default: unused)<br/>(env: LLAMA_ARG_DOCKER_REPO) |
 103| `-hf, -hfr, --hf-repo <user>/<model>[:quant]` | Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.<br/>mmproj is also downloaded automatically if available. to disable, add --no-mmproj<br/>example: unsloth/phi-4-GGUF:q4_k_m<br/>(default: unused)<br/>(env: LLAMA_ARG_HF_REPO) |
 104| `-hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]` | Same as --hf-repo, but for the draft model (default: unused)<br/>(env: LLAMA_ARG_HFD_REPO) |
 105| `-hff, --hf-file FILE` | Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused)<br/>(env: LLAMA_ARG_HF_FILE) |
 106| `-hfv, -hfrv, --hf-repo-v <user>/<model>[:quant]` | Hugging Face model repository for the vocoder model (default: unused)<br/>(env: LLAMA_ARG_HF_REPO_V) |
 107| `-hffv, --hf-file-v FILE` | Hugging Face model file for the vocoder model (default: unused)<br/>(env: LLAMA_ARG_HF_FILE_V) |
 108| `-hft, --hf-token TOKEN` | Hugging Face access token (default: value from HF_TOKEN environment variable)<br/>(env: HF_TOKEN) |
 109| `--log-disable` | Log disable |
 110| `--log-file FNAME` | Log to file<br/>(env: LLAMA_LOG_FILE) |
 111| `--log-colors [on\|off\|auto]` | Set colored logging ('on', 'off', or 'auto', default: 'auto')<br/>'auto' enables colors when output is to a terminal<br/>(env: LLAMA_LOG_COLORS) |
 112| `-v, --verbose, --log-verbose` | Set verbosity level to infinity (i.e. log all messages, useful for debugging) |
 113| `--offline` | Offline mode: forces use of cache, prevents network access<br/>(env: LLAMA_OFFLINE) |
 114| `-lv, --verbosity, --log-verbosity N` | Set the verbosity threshold. Messages with a higher verbosity will be ignored. Values:<br/> - 0: generic output<br/> - 1: error<br/> - 2: warning<br/> - 3: info<br/> - 4: debug<br/>(default: 3)<br/><br/>(env: LLAMA_LOG_VERBOSITY) |
 115| `--log-prefix` | Enable prefix in log messages<br/>(env: LLAMA_LOG_PREFIX) |
 116| `--log-timestamps` | Enable timestamps in log messages<br/>(env: LLAMA_LOG_TIMESTAMPS) |
 117| `-ctkd, --cache-type-k-draft TYPE` | KV cache data type for K for the draft model<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_K_DRAFT) |
 118| `-ctvd, --cache-type-v-draft TYPE` | KV cache data type for V for the draft model<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_V_DRAFT) |
 119
 120
 121### Sampling params
 122
 123| Argument | Explanation |
 124| -------- | ----------- |
 125| `--samplers SAMPLERS` | samplers that will be used for generation in the order, separated by ';'<br/>(default: penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature) |
 126| `-s, --seed SEED` | RNG seed (default: -1, use random seed for -1) |
 127| `--sampler-seq, --sampling-seq SEQUENCE` | simplified sequence for samplers that will be used (default: edskypmxt) |
 128| `--ignore-eos` | ignore end of stream token and continue generating (implies --logit-bias EOS-inf) |
 129| `--temp N` | temperature (default: 0.80) |
 130| `--top-k N` | top-k sampling (default: 40, 0 = disabled)<br/>(env: LLAMA_ARG_TOP_K) |
 131| `--top-p N` | top-p sampling (default: 0.95, 1.0 = disabled) |
 132| `--min-p N` | min-p sampling (default: 0.05, 0.0 = disabled) |
 133| `--top-nsigma N` | top-n-sigma sampling (default: -1.00, -1.0 = disabled) |
 134| `--xtc-probability N` | xtc probability (default: 0.00, 0.0 = disabled) |
 135| `--xtc-threshold N` | xtc threshold (default: 0.10, 1.0 = disabled) |
 136| `--typical N` | locally typical sampling, parameter p (default: 1.00, 1.0 = disabled) |
 137| `--repeat-last-n N` | last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) |
 138| `--repeat-penalty N` | penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled) |
 139| `--presence-penalty N` | repeat alpha presence penalty (default: 0.00, 0.0 = disabled) |
 140| `--frequency-penalty N` | repeat alpha frequency penalty (default: 0.00, 0.0 = disabled) |
 141| `--dry-multiplier N` | set DRY sampling multiplier (default: 0.00, 0.0 = disabled) |
 142| `--dry-base N` | set DRY sampling base value (default: 1.75) |
 143| `--dry-allowed-length N` | set allowed length for DRY sampling (default: 2) |
 144| `--dry-penalty-last-n N` | set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 = context size) |
 145| `--dry-sequence-breaker STRING` | add sequence breaker for DRY sampling, clearing out default breakers ('\n', ':', '"', '*') in the process; use "none" to not use any sequence breakers |
 146| `--adaptive-target N` | adaptive-p: select tokens near this probability (valid range 0.0 to 1.0; negative = disabled) (default: -1.00)<br/>[(more info)](https://github.com/ggml-org/llama.cpp/pull/17927) |
 147| `--adaptive-decay N` | adaptive-p: decay rate for target adaptation over time. lower values are more reactive, higher values are more stable.<br/>(valid range 0.0 to 0.99) (default: 0.90) |
 148| `--dynatemp-range N` | dynamic temperature range (default: 0.00, 0.0 = disabled) |
 149| `--dynatemp-exp N` | dynamic temperature exponent (default: 1.00) |
 150| `--mirostat N` | use Mirostat sampling.<br/>Top K, Nucleus and Locally Typical samplers are ignored if used.<br/>(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
 151| `--mirostat-lr N` | Mirostat learning rate, parameter eta (default: 0.10) |
 152| `--mirostat-ent N` | Mirostat target entropy, parameter tau (default: 5.00) |
 153| `-l, --logit-bias TOKEN_ID(+/-)BIAS` | modifies the likelihood of token appearing in the completion,<br/>i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',<br/>or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' |
 154| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') |
 155| `--grammar-file FNAME` | file to read grammar from |
 156| `-j, --json-schema SCHEMA` | JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead |
 157| `-jf, --json-schema-file FILE` | File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead |
 158| `-bs, --backend-sampling` | enable backend sampling (experimental) (default: disabled)<br/>(env: LLAMA_ARG_BACKEND_SAMPLING) |
 159
 160
 161### Server-specific params
 162
 163| Argument | Explanation |
 164| -------- | ----------- |
 165| `--ctx-checkpoints, --swa-checkpoints N` | max number of context checkpoints to create per slot (default: 8)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_CTX_CHECKPOINTS) |
 166| `-cram, --cache-ram N` | set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
 167| `-kvu, --kv-unified` | use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
 168| `--context-shift, --no-context-shift` | whether to use context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_CONTEXT_SHIFT) |
 169| `-r, --reverse-prompt PROMPT` | halt generation at PROMPT, return control in interactive mode |
 170| `-sp, --special` | special tokens output enabled (default: false) |
 171| `--warmup, --no-warmup` | whether to perform warmup with an empty run (default: enabled) |
 172| `--spm-infill` | use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this. (default: disabled) |
 173| `--pooling {none,mean,cls,last,rank}` | pooling type for embeddings, use model default if unspecified<br/>(env: LLAMA_ARG_POOLING) |
 174| `-np, --parallel N` | number of server slots (default: -1, -1 = auto)<br/>(env: LLAMA_ARG_N_PARALLEL) |
 175| `-cb, --cont-batching, -nocb, --no-cont-batching` | whether to enable continuous batching (a.k.a dynamic batching) (default: enabled)<br/>(env: LLAMA_ARG_CONT_BATCHING) |
 176| `-mm, --mmproj FILE` | path to a multimodal projector file. see tools/mtmd/README.md<br/>note: if -hf is used, this argument can be omitted<br/>(env: LLAMA_ARG_MMPROJ) |
 177| `-mmu, --mmproj-url URL` | URL to a multimodal projector file. see tools/mtmd/README.md<br/>(env: LLAMA_ARG_MMPROJ_URL) |
 178| `--mmproj-auto, --no-mmproj, --no-mmproj-auto` | whether to use multimodal projector file (if available), useful when using -hf (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_AUTO) |
 179| `--mmproj-offload, --no-mmproj-offload` | whether to enable GPU offloading for multimodal projector (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_OFFLOAD) |
 180| `--image-min-tokens N` | minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
 181| `--image-max-tokens N` | maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
 182| `-otd, --override-tensor-draft <tensor name pattern>=<buffer type>,...` | override tensor buffer type for draft model |
 183| `-cmoed, --cpu-moe-draft` | keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
 184| `-ncmoed, --n-cpu-moe-draft N` | keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
 185| `-a, --alias STRING` | set alias for model name (to be used by REST API)<br/>(env: LLAMA_ARG_ALIAS) |
 186| `--host HOST` | ip address to listen, or bind to an UNIX socket if the address ends with .sock (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
 187| `--port PORT` | port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
 188| `--path PATH` | path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH) |
 189| `--api-prefix PREFIX` | prefix path the server serves from, without the trailing slash (default: )<br/>(env: LLAMA_ARG_API_PREFIX) |
 190| `--webui-config JSON` | JSON that provides default WebUI settings (overrides WebUI defaults)<br/>(env: LLAMA_ARG_WEBUI_CONFIG) |
 191| `--webui-config-file PATH` | JSON file that provides default WebUI settings (overrides WebUI defaults)<br/>(env: LLAMA_ARG_WEBUI_CONFIG_FILE) |
 192| `--webui, --no-webui` | whether to enable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_WEBUI) |
 193| `--embedding, --embeddings` | restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
 194| `--rerank, --reranking` | enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
 195| `--api-key KEY` | API key to use for authentication, multiple keys can be provided as a comma-separated list (default: none)<br/>(env: LLAMA_API_KEY) |
 196| `--api-key-file FNAME` | path to file containing API keys (default: none) |
 197| `--ssl-key-file FNAME` | path to file a PEM-encoded SSL private key<br/>(env: LLAMA_ARG_SSL_KEY_FILE) |
 198| `--ssl-cert-file FNAME` | path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE) |
 199| `--chat-template-kwargs STRING` | sets additional params for the json template parser, must be a valid json object string, e.g. '{"key1":"value1","key2":"value2"}'<br/>(env: LLAMA_CHAT_TEMPLATE_KWARGS) |
 200| `-to, --timeout N` | server read/write timeout in seconds (default: 600)<br/>(env: LLAMA_ARG_TIMEOUT) |
 201| `--threads-http N` | number of threads used to process HTTP requests (default: -1)<br/>(env: LLAMA_ARG_THREADS_HTTP) |
 202| `--cache-prompt, --no-cache-prompt` | whether to enable prompt caching (default: enabled)<br/>(env: LLAMA_ARG_CACHE_PROMPT) |
 203| `--cache-reuse N` | min chunk size to attempt reusing from the cache via KV shifting, requires prompt caching to be enabled (default: 0)<br/>[(card)](https://ggml.ai/f0.png)<br/>(env: LLAMA_ARG_CACHE_REUSE) |
 204| `--metrics` | enable prometheus compatible metrics endpoint (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_METRICS) |
 205| `--props` | enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
 206| `--slots, --no-slots` | expose slots monitoring endpoint (default: enabled)<br/>(env: LLAMA_ARG_ENDPOINT_SLOTS) |
 207| `--slot-save-path PATH` | path to save slot kv cache (default: disabled) |
 208| `--media-path PATH` | directory for loading local media files; files can be accessed via file:// URLs using relative paths (default: disabled) |
 209| `--models-dir PATH` | directory containing models for the router server (default: disabled)<br/>(env: LLAMA_ARG_MODELS_DIR) |
 210| `--models-preset PATH` | path to INI file containing model presets for the router server (default: disabled)<br/>(env: LLAMA_ARG_MODELS_PRESET) |
 211| `--models-max N` | for router server, maximum number of models to load simultaneously (default: 4, 0 = unlimited)<br/>(env: LLAMA_ARG_MODELS_MAX) |
 212| `--models-autoload, --no-models-autoload` | for router server, whether to automatically load models (default: enabled)<br/>(env: LLAMA_ARG_MODELS_AUTOLOAD) |
 213| `--jinja, --no-jinja` | whether to use jinja template engine for chat (default: enabled)<br/>(env: LLAMA_ARG_JINJA) |
 214| `--reasoning-format FORMAT` | controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:<br/>- none: leaves thoughts unparsed in `message.content`<br/>- deepseek: puts thoughts in `message.reasoning_content`<br/>- deepseek-legacy: keeps `<think>` tags in `message.content` while also populating `message.reasoning_content`<br/>(default: auto)<br/>(env: LLAMA_ARG_THINK) |
 215| `--reasoning-budget N` | controls the amount of thinking allowed; currently only one of: -1 for unrestricted thinking budget, or 0 to disable thinking (default: -1)<br/>(env: LLAMA_ARG_THINK_BUDGET) |
 216| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
 217| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone-moe, exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense, hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
 218| `--prefill-assistant, --no-prefill-assistant` | whether to prefill the assistant's response if the last message is an assistant message (default: prefill enabled)<br/>when this flag is set, if the last message is an assistant message then it will be treated as a full message and not prefilled<br/><br/>(env: LLAMA_ARG_PREFILL_ASSISTANT) |
 219| `-sps, --slot-prompt-similarity SIMILARITY` | how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.10, 0.0 = disabled) |
 220| `--lora-init-without-apply` | load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
 221| `--sleep-idle-seconds SECONDS` | number of seconds of idleness after which the server will sleep (default: -1; -1 = disabled) |
 222| `-td, --threads-draft N` | number of threads to use during generation (default: same as --threads) |
 223| `-tbd, --threads-batch-draft N` | number of threads to use during batch and prompt processing (default: same as --threads-draft) |
 224| `--draft, --draft-n, --draft-max N` | number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |
 225| `--draft-min, --draft-n-min N` | minimum number of draft tokens to use for speculative decoding (default: 0)<br/>(env: LLAMA_ARG_DRAFT_MIN) |
 226| `--draft-p-min P` | minimum speculative decoding probability (greedy) (default: 0.75)<br/>(env: LLAMA_ARG_DRAFT_P_MIN) |
 227| `-cd, --ctx-size-draft N` | size of the prompt context for the draft model (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE_DRAFT) |
 228| `-devd, --device-draft <dev1,dev2,..>` | comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
 229| `-ngld, --gpu-layers-draft, --n-gpu-layers-draft N` | max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) |
 230| `-md, --model-draft FNAME` | draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_MODEL_DRAFT) |
 231| `--spec-replace TARGET DRAFT` | translate the string in TARGET into DRAFT if the draft model and main model are not compatible |
 232| `-mv, --model-vocoder FNAME` | vocoder model for audio generation (default: unused) |
 233| `--tts-use-guide-tokens` | Use guide tokens to improve TTS word recall |
 234| `--embd-gemma-default` | use default EmbeddingGemma model (note: can download weights from the internet) |
 235| `--fim-qwen-1.5b-default` | use default Qwen 2.5 Coder 1.5B (note: can download weights from the internet) |
 236| `--fim-qwen-3b-default` | use default Qwen 2.5 Coder 3B (note: can download weights from the internet) |
 237| `--fim-qwen-7b-default` | use default Qwen 2.5 Coder 7B (note: can download weights from the internet) |
 238| `--fim-qwen-7b-spec` | use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
 239| `--fim-qwen-14b-spec` | use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note: can download weights from the internet) |
 240| `--fim-qwen-30b-default` | use default Qwen 3 Coder 30B A3B Instruct (note: can download weights from the internet) |
 241| `--gpt-oss-20b-default` | use gpt-oss-20b (note: can download weights from the internet) |
 242| `--gpt-oss-120b-default` | use gpt-oss-120b (note: can download weights from the internet) |
 243| `--vision-gemma-4b-default` | use Gemma 3 4B QAT (note: can download weights from the internet) |
 244| `--vision-gemma-12b-default` | use Gemma 3 12B QAT (note: can download weights from the internet) |
 245
 246<!-- HELP_END -->
 247
 248Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
 249
 250For boolean options like `--mmap` or `--kv-offload`, the environment variable is handled as shown in this example:
 251- `LLAMA_ARG_MMAP=true` means enabled, other accepted values are: `1`, `on`, `enabled`
 252- `LLAMA_ARG_MMAP=false` means disabled, other accepted values are: `0`, `off`, `disabled`
 253- If `LLAMA_ARG_NO_MMAP` is present (no matter the value), it means disabling mmap
 254
 255Example usage of docker compose with environment variables:
 256
 257```yml
 258services:
 259  llamacpp-server:
 260    image: ghcr.io/ggml-org/llama.cpp:server
 261    ports:
 262      - 8080:8080
 263    volumes:
 264      - ./models:/models
 265    environment:
 266      # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
 267      LLAMA_ARG_MODEL: /models/my_model.gguf
 268      LLAMA_ARG_CTX_SIZE: 4096
 269      LLAMA_ARG_N_PARALLEL: 2
 270      LLAMA_ARG_ENDPOINT_METRICS: 1
 271      LLAMA_ARG_PORT: 8080
 272```
 273
 274### Multimodal support
 275
 276Multimodal support was added in [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) and is currently an experimental feature.
 277It is currently available in the following endpoints:
 278- The OAI-compatible chat endpoint.
 279- The non-OAI-compatible completions endpoint.
 280- The non-OAI-compatible embeddings endpoint.
 281
 282For more details, please refer to [multimodal documentation](../../docs/multimodal.md)
 283
 284## Build
 285
 286`llama-server` is built alongside everything else from the root of the project
 287
 288- Using `CMake`:
 289
 290  ```bash
 291  cmake -B build
 292  cmake --build build --config Release -t llama-server
 293  ```
 294
 295  Binary is at `./build/bin/llama-server`
 296
 297## Build with SSL
 298
 299`llama-server` can also be built with SSL support using OpenSSL 3
 300
 301- Using `CMake`:
 302
 303  ```bash
 304  cmake -B build -DLLAMA_OPENSSL=ON
 305  cmake --build build --config Release -t llama-server
 306  ```
 307
 308## Quick Start
 309
 310To get started right away, run the following command, making sure to use the correct path for the model you have:
 311
 312### Unix-based systems (Linux, macOS, etc.)
 313
 314```bash
 315./llama-server -m models/7B/ggml-model.gguf -c 2048
 316```
 317
 318### Windows
 319
 320```powershell
 321llama-server.exe -m models\7B\ggml-model.gguf -c 2048
 322```
 323
 324The above command will start a server that by default listens on `127.0.0.1:8080`.
 325You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.
 326
 327### Docker
 328
 329```bash
 330docker run -p 8080:8080 -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
 331
 332# or, with CUDA:
 333docker run -p 8080:8080 -v /path/to/models:/models --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
 334```
 335
 336## Using with CURL
 337
 338Using [curl](https://curl.se/). On Windows, `curl.exe` should be available in the base OS.
 339
 340```sh
 341curl --request POST \
 342    --url http://localhost:8080/completion \
 343    --header "Content-Type: application/json" \
 344    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
 345```
 346
 347## API Endpoints
 348
 349### GET `/health`: Returns health check result
 350
 351This endpoint is public (no API key check). `/v1/health` also works.
 352
 353**Response format**
 354
 355- HTTP status code 503
 356  - Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
 357  - Explanation: the model is still being loaded.
 358- HTTP status code 200
 359  - Body: `{"status": "ok" }`
 360  - Explanation: the model is successfully loaded and the server is ready.
 361
 362### POST `/completion`: Given a `prompt`, it returns the predicted completion.
 363
 364> [!IMPORTANT]
 365>
 366> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/completions` instead.
 367
 368*Options:*
 369
 370`prompt`: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if `cache_prompt` is `true`, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. A `BOS` token is inserted at the start, if all of the following conditions are true:
 371
 372  - The prompt is a string or an array with the first element given as a string
 373  - The model's `tokenizer.ggml.add_bos_token` metadata is `true`
 374
 375These input shapes and data type are allowed for `prompt`:
 376
 377  - Single string: `"string"`
 378  - Single sequence of tokens: `[12, 34, 56]`
 379  - Mixed tokens and strings: `[12, 34, "string", 56, 78]`
 380  - A JSON object which optionally contains multimodal data: `{ "prompt_string": "string", "multimodal_data": ["base64"] }`
 381
 382Multiple prompts are also supported. In this case, the completion result will be an array.
 383
 384  - Only strings: `["string1", "string2"]`
 385  - Strings, JSON objects, and sequences of tokens: `["string1", [12, 34, 56], { "prompt_string": "string", "multimodal_data": ["base64"]}]`
 386  - Mixed types: `[[12, 34, "string", 56, 78], [12, 34, 56], "string", { "prompt_string": "string" }]`
 387
 388Note for `multimodal_data` in JSON object prompts. This should be an array of strings, containing base64 encoded multimodal data such as images and audio. There must be an identical number of MTMD media markers in the string prompt element which act as placeholders for the data provided to this parameter. The multimodal data files will be substituted in order. The marker string (e.g. `<__media__>`) can be found by calling `mtmd_default_marker()` defined in [the MTMD C API](https://github.com/ggml-org/llama.cpp/blob/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0/tools/mtmd/mtmd.h#L87). A client *must not* specify this field unless the server has the multimodal capability. Clients should check `/models` or `/v1/models` for the `multimodal` capability before a multimodal request.
 389
 390`temperature`: Adjust the randomness of the generated text. Default: `0.8`
 391
 392`dynatemp_range`: Dynamic temperature range. The final temperature will be in the range of `[temperature - dynatemp_range; temperature + dynatemp_range]` Default: `0.0`, which is disabled.
 393
 394`dynatemp_exponent`: Dynamic temperature exponent. Default: `1.0`
 395
 396`top_k`: Limit the next token selection to the K most probable tokens.  Default: `40`
 397
 398`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P. Default: `0.95`
 399
 400`min_p`: The minimum probability for a token to be considered, relative to the probability of the most likely token. Default: `0.05`
 401
 402`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. Default: `-1`, where `-1` is infinity.
 403
 404`n_indent`: Specify the minimum line indentation for the generated text in number of whitespace characters. Useful for code completion tasks. Default: `0`
 405
 406`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
 407By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
 408
 409`n_cmpl`: Number of completions to generate from the current prompt. If input has multiple prompts, the output will have N prompts times `n_cmpl` entries.
 410
 411`n_cache_reuse`: Min chunk size to attempt reusing from the cache via KV shifting. For more info, see `--cache-reuse` arg. Default: `0`, which is disabled.
 412
 413`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
 414
 415`stop`: Specify a JSON array of stopping strings.
 416These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: `[]`
 417
 418`typical_p`: Enable locally typical sampling with parameter p. Default: `1.0`, which is disabled.
 419
 420`repeat_penalty`: Control the repetition of token sequences in the generated text. Default: `1.1`
 421
 422`repeat_last_n`: Last n tokens to consider for penalizing repetition. Default: `64`, where `0` is disabled and `-1` is ctx-size.
 423
 424`presence_penalty`: Repeat alpha presence penalty. Default: `0.0`, which is disabled.
 425
 426`frequency_penalty`: Repeat alpha frequency penalty. Default: `0.0`, which is disabled.
 427
 428`dry_multiplier`: Set the DRY (Don't Repeat Yourself) repetition penalty multiplier. Default: `0.0`, which is disabled.
 429
 430`dry_base`: Set the DRY repetition penalty base value. Default: `1.75`
 431
 432`dry_allowed_length`: Tokens that extend repetition beyond this receive exponentially increasing penalty: multiplier * base ^ (length of repeating sequence before token - allowed length). Default: `2`
 433
 434`dry_penalty_last_n`: How many tokens to scan for repetitions. Default: `-1`, where `0` is disabled and `-1` is context size.
 435
 436`dry_sequence_breakers`: Specify an array of sequence breakers for DRY sampling. Only a JSON array of strings is accepted. Default: `['\n', ':', '"', '*']`
 437
 438`xtc_probability`: Set the chance for token removal via XTC sampler. Default: `0.0`, which is disabled.
 439
 440`xtc_threshold`: Set a minimum probability threshold for tokens to be removed via XTC sampler. Default: `0.1` (> `0.5` disables XTC)
 441
 442`mirostat`: Enable Mirostat sampling, controlling perplexity during text generation. Default: `0`, where `0` is disabled, `1` is Mirostat, and `2` is Mirostat 2.0.
 443
 444`mirostat_tau`: Set the Mirostat target entropy, parameter tau. Default: `5.0`
 445
 446`mirostat_eta`: Set the Mirostat learning rate, parameter eta.  Default: `0.1`
 447
 448`grammar`: Set grammar for grammar-based sampling.  Default: no grammar
 449
 450`json_schema`: Set a JSON schema for grammar-based sampling (e.g. `{"items": {"type": "string"}, "minItems": 10, "maxItems": 100}` of a list of strings, or `{}` for any JSON). See [tests](../../tests/test-json-schema-to-grammar.cpp) for supported features.  Default: no JSON schema.
 451
 452`seed`: Set the random number generator (RNG) seed.  Default: `-1`, which is a random seed.
 453
 454`ignore_eos`: Ignore end of stream token and continue generating.  Default: `false`
 455
 456`logit_bias`: Modify the likelihood of a token appearing in the generated text completion. For example, use `"logit_bias": [[15043,1.0]]` to increase the likelihood of the token 'Hello', or `"logit_bias": [[15043,-1.0]]` to decrease its likelihood. Setting the value to false, `"logit_bias": [[15043,false]]` ensures that the token `Hello` is never produced. The tokens can also be represented as strings, e.g. `[["Hello, World!",-0.5]]` will reduce the likelihood of all the individual tokens that represent the string `Hello, World!`, just like the `presence_penalty` does. For compatibility with the OpenAI API, a JSON object {"<string or token id>": bias, ...} can also be passed. Default: `[]`
 457
 458`n_probs`: If greater than 0, the response also contains the probabilities of top N tokens for each generated token given the sampling settings. Note that for temperature < 0 the tokens are sampled greedily but token probabilities are still being calculated via a simple softmax of the logits without considering any other sampler settings. Default: `0`
 459
 460`min_keep`: If greater than 0, force samplers to return N possible tokens at minimum. Default: `0`
 461
 462`t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.
 463
 464`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot.  Default: `-1`
 465
 466`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `true`
 467
 468`return_tokens`: Return the raw generated token ids in the `tokens` field. Otherwise `tokens` remains empty. Default: `false`
 469
 470`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["dry", "top_k", "typ_p", "top_p", "min_p", "xtc", "temperature"]` - these are all the available values.
 471
 472`timings_per_token`: Include prompt processing and text generation speed information in each response.  Default: `false`
 473
 474`return_progress`: Include prompt processing progress in `stream` mode. The progress will be contained inside `prompt_progress` with 4 values: `total`, `cache`, `processed`, and `time_ms`. The overall progress is `processed/total`, while the actual timed progress is `(processed-cache)/(total-cache)`. The `time_ms` field contains the elapsed time in milliseconds since prompt processing started. Default: `false`
 475
 476`post_sampling_probs`: Returns the probabilities of top `n_probs` tokens after applying sampling chain.
 477
 478`response_fields`: A list of response fields, for example: `"response_fields": ["content", "generation_settings/n_predict"]`. If the specified field is missing, it will simply be omitted from the response without triggering an error. Note that fields with a slash will be unnested; for example, `generation_settings/n_predict` will move the field `n_predict` from the `generation_settings` object to the root of the response and give it a new name.
 479
 480`lora`: A list of LoRA adapters to be applied to this specific request. Each object in the list must contain `id` and `scale` fields. For example: `[{"id": 0, "scale": 0.5}, {"id": 1, "scale": 1.1}]`. If a LoRA adapter is not specified in the list, its scale will default to `0.0`. Please note that requests with different LoRA configurations will not be batched together, which may result in performance degradation.
 481
 482**Response format**
 483
 484- Note: In streaming mode (`stream`), only `content`, `tokens` and `stop` will be returned until end of completion. Responses are sent using the [Server-sent events](https://html.spec.whatwg.org/multipage/server-sent-events.html) standard. Note: the browser's `EventSource` interface cannot be used due to its lack of `POST` request support.
 485
 486- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has a nested array `top_logprobs`. It contains at **maximum** `n_probs` elements:
 487  ```
 488  {
 489    "content": "<the generated completion text>",
 490    "tokens": [ generated token ids if requested ],
 491    ...
 492    "probs": [
 493      {
 494        "id": <token id>,
 495        "logprob": float,
 496        "token": "<most likely token>",
 497        "bytes": [int, int, ...],
 498        "top_logprobs": [
 499          {
 500            "id": <token id>,
 501            "logprob": float,
 502            "token": "<token text>",
 503            "bytes": [int, int, ...],
 504          },
 505          {
 506            "id": <token id>,
 507            "logprob": float,
 508            "token": "<token text>",
 509            "bytes": [int, int, ...],
 510          },
 511          ...
 512        ]
 513      },
 514      {
 515        "id": <token id>,
 516        "logprob": float,
 517        "token": "<most likely token>",
 518        "bytes": [int, int, ...],
 519        "top_logprobs": [
 520          ...
 521        ]
 522      },
 523      ...
 524    ]
 525  },
 526  ```
 527  Please note that if `post_sampling_probs` is set to `true`:
 528    - `logprob` will be replaced with `prob`, with the value between 0.0 and 1.0
 529    - `top_logprobs` will be replaced with `top_probs`. Each element contains:
 530      - `id`: token ID
 531      - `token`: token in string
 532      - `bytes`: token in bytes
 533      - `prob`: token probability, with the value between 0.0 and 1.0
 534    - Number of elements in `top_probs` may be less than `n_probs`
 535
 536- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
 537- `tokens`: Same as `content` but represented as raw token ids. Only populated if `"return_tokens": true` or `"stream": true` in the request.
 538- `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
 539- `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`. These options may differ from the original ones in some way (e.g. bad values filtered out, strings converted to tokens, etc.).
 540- `model`: The model alias (for model path, please use `/props` endpoint)
 541- `prompt`: The processed `prompt` (special tokens may be added)
 542- `stop_type`: Indicating whether the completion has stopped. Possible values are:
 543  - `none`: Generating (not stopped)
 544  - `eos`: Stopped because it encountered the EOS token
 545  - `limit`: Stopped because `n_predict` tokens were generated before stop words or EOS was encountered
 546  - `word`: Stopped due to encountering a stopping word from `stop` JSON array provided
 547- `stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)
 548- `timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second`
 549- `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion
 550- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
 551- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
 552
 553
 554### POST `/tokenize`: Tokenize a given text
 555
 556*Options:*
 557
 558`content`: (Required) The text to tokenize.
 559
 560`add_special`: (Optional) Boolean indicating if special tokens, i.e. `BOS`, should be inserted.  Default: `false`
 561
 562`parse_special`: (Optional) Boolean indicating if special tokens should be tokenized. When `false` special tokens are treated as plaintext.  Default: `true`
 563
 564`with_pieces`: (Optional) Boolean indicating whether to return token pieces along with IDs.  Default: `false`
 565
 566**Response:**
 567
 568Returns a JSON object with a `tokens` field containing the tokenization result. The `tokens` array contains either just token IDs or objects with `id` and `piece` fields, depending on the `with_pieces` parameter. The piece field is a string if the piece is valid unicode or a list of bytes otherwise.
 569
 570
 571If `with_pieces` is `false`:
 572```json
 573{
 574  "tokens": [123, 456, 789]
 575}
 576```
 577
 578If `with_pieces` is `true`:
 579```json
 580{
 581  "tokens": [
 582    {"id": 123, "piece": "Hello"},
 583    {"id": 456, "piece": " world"},
 584    {"id": 789, "piece": "!"}
 585  ]
 586}
 587```
 588
 589With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
 590```
 591{
 592  "tokens": [
 593    {"id": 198, "piece": [195]}, // hex C3
 594    {"id": 164, "piece": [161]} // hex A1
 595  ]
 596}
 597```
 598
 599### POST `/detokenize`: Convert tokens to text
 600
 601*Options:*
 602
 603`tokens`: Set the tokens to detokenize.
 604
 605### POST `/apply-template`: Apply chat template to a conversation
 606
 607Uses the server's prompt template formatting functionality to convert chat messages to a single string expected by a chat model as input, but does not perform inference. Instead, the prompt string is returned in the `prompt` field of the JSON response. The prompt can then be modified as desired (for example, to insert "Sure!" at the beginning of the model's response) before sending to `/completion` to generate the chat response.
 608
 609*Options:*
 610
 611`messages`: (Required) Chat turns in the same format as `/v1/chat/completions`.
 612
 613**Response format**
 614
 615Returns a JSON object with a field `prompt` containing a string of the input messages formatted according to the model's chat template format.
 616
 617### POST `/embedding`: Generate embedding of a given text
 618
 619> [!IMPORTANT]
 620>
 621> This endpoint is **not** OAI-compatible. For OAI-compatible client, use `/v1/embeddings` instead.
 622
 623The same as [the embedding example](../embedding) does.
 624
 625This endpoint also supports multimodal embeddings. See the documentation for the `/completions` endpoint for details on how to send a multimodal prompt.
 626
 627*Options:*
 628
 629`content`: Set the text to process.
 630
 631`embd_normalize`: Normalization for pooled embeddings. Can be one of the following values:
 632```
 633  -1: No normalization
 634   0: Max absolute
 635   1: Taxicab
 636   2: Euclidean/L2
 637  >2: P-Norm
 638```
 639
 640### POST `/reranking`: Rerank documents according to a given query
 641
 642Similar to https://jina.ai/reranker/ but might change in the future.
 643Requires a reranker model (such as [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) and the `--embedding --pooling rank` options.
 644
 645*Options:*
 646
 647`query`: The query against which the documents will be ranked.
 648
 649`documents`: An array strings representing the documents to be ranked.
 650
 651*Aliases:*
 652  - `/rerank`
 653  - `/v1/rerank`
 654  - `/v1/reranking`
 655
 656*Examples:*
 657
 658```shell
 659curl http://127.0.0.1:8012/v1/rerank \
 660    -H "Content-Type: application/json" \
 661    -d '{
 662        "model": "some-model",
 663            "query": "What is panda?",
 664            "top_n": 3,
 665            "documents": [
 666                "hi",
 667            "it is a bear",
 668            "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
 669            ]
 670    }' | jq
 671```
 672
 673### POST `/infill`: For code infilling.
 674
 675Takes a prefix and a suffix and returns the predicted completion as stream.
 676
 677*Options:*
 678
 679- `input_prefix`: Set the prefix of the code to infill.
 680- `input_suffix`: Set the suffix of the code to infill.
 681- `input_extra`:  Additional context inserted before the FIM prefix.
 682- `prompt`:       Added after the `FIM_MID` token
 683
 684`input_extra` is array of `{"filename": string, "text": string}` objects.
 685
 686The endpoint also accepts all the options of `/completion`.
 687
 688If the model has `FIM_REPO` and `FIM_FILE_SEP` tokens, the [repo-level pattern](https://arxiv.org/pdf/2409.12186) is used:
 689
 690```txt
 691<FIM_REP>myproject
 692<FIM_SEP>{chunk 0 filename}
 693{chunk 0 text}
 694<FIM_SEP>{chunk 1 filename}
 695{chunk 1 text}
 696...
 697<FIM_SEP>filename
 698<FIM_PRE>[input_prefix]<FIM_SUF>[input_suffix]<FIM_MID>[prompt]
 699```
 700
 701If the tokens are missing, then the extra context is simply prefixed at the start:
 702
 703```txt
 704[input_extra]<FIM_PRE>[input_prefix]<FIM_SUF>[input_suffix]<FIM_MID>[prompt]
 705```
 706
 707### **GET** `/props`: Get server global properties.
 708
 709By default, it is read-only. To make POST request to change global properties, you need to start server with `--props`
 710
 711**Response format**
 712
 713```json
 714{
 715  "default_generation_settings": {
 716    "id": 0,
 717    "id_task": -1,
 718    "n_ctx": 1024,
 719    "speculative": false,
 720    "is_processing": false,
 721    "params": {
 722      "n_predict": -1,
 723      "seed": 4294967295,
 724      "temperature": 0.800000011920929,
 725      "dynatemp_range": 0.0,
 726      "dynatemp_exponent": 1.0,
 727      "top_k": 40,
 728      "top_p": 0.949999988079071,
 729      "min_p": 0.05000000074505806,
 730      "xtc_probability": 0.0,
 731      "xtc_threshold": 0.10000000149011612,
 732      "typical_p": 1.0,
 733      "repeat_last_n": 64,
 734      "repeat_penalty": 1.0,
 735      "presence_penalty": 0.0,
 736      "frequency_penalty": 0.0,
 737      "dry_multiplier": 0.0,
 738      "dry_base": 1.75,
 739      "dry_allowed_length": 2,
 740      "dry_penalty_last_n": -1,
 741      "dry_sequence_breakers": [
 742        "\n",
 743        ":",
 744        "\"",
 745        "*"
 746      ],
 747      "mirostat": 0,
 748      "mirostat_tau": 5.0,
 749      "mirostat_eta": 0.10000000149011612,
 750      "stop": [],
 751      "max_tokens": -1,
 752      "n_keep": 0,
 753      "n_discard": 0,
 754      "ignore_eos": false,
 755      "stream": true,
 756      "n_probs": 0,
 757      "min_keep": 0,
 758      "grammar": "",
 759      "samplers": [
 760        "dry",
 761        "top_k",
 762        "typ_p",
 763        "top_p",
 764        "min_p",
 765        "xtc",
 766        "temperature"
 767      ],
 768      "speculative.n_max": 16,
 769      "speculative.n_min": 5,
 770      "speculative.p_min": 0.8999999761581421,
 771      "timings_per_token": false
 772    },
 773    "prompt": "",
 774    "next_token": {
 775      "has_next_token": true,
 776      "has_new_line": false,
 777      "n_remain": -1,
 778      "n_decoded": 0,
 779      "stopping_word": ""
 780    }
 781  },
 782  "total_slots": 1,
 783  "model_path": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
 784  "chat_template": "...",
 785  "chat_template_caps": {},
 786  "modalities": {
 787    "vision": false
 788  },
 789  "build_info": "b(build number)-(build commit hash)",
 790  "is_sleeping": false
 791}
 792```
 793
 794- `default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
 795- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
 796- `model_path` - the path to model file (same with `-m` argument)
 797- `chat_template` - the model's original Jinja2 prompt template
 798- `chat_template_caps` - capabilities of the chat template (see `common/jinja/caps.h` for more info)
 799- `modalities` - the list of supported modalities
 800- `is_sleeping` - sleeping status, see [Sleeping on idle](#sleeping-on-idle)
 801
 802### POST `/props`: Change server global properties.
 803
 804To use this endpoint with POST method, you need to start server with `--props`
 805
 806*Options:*
 807
 808- None yet
 809
 810### POST `/embeddings`: non-OpenAI-compatible embeddings API
 811
 812This endpoint supports all poolings, including `--pooling none`. When the pooling is `none`, the responses will contain the *unnormalized* embeddings for *all* input tokens. For all other pooling types, only the pooled embeddings are returned, normalized using Euclidean norm.
 813
 814Note that the response format of this endpoint is different from `/v1/embeddings`.
 815
 816*Options:*
 817
 818Same as the `/v1/embeddings` endpoint.
 819
 820*Examples:*
 821
 822Same as the `/v1/embeddings` endpoint.
 823
 824**Response format**
 825
 826```
 827[
 828  {
 829    "index": 0,
 830    "embedding": [
 831      [ ... embeddings for token 0   ... ],
 832      [ ... embeddings for token 1   ... ],
 833      [ ... ]
 834      [ ... embeddings for token N-1 ... ],
 835    ]
 836  },
 837  ...
 838  {
 839    "index": P,
 840    "embedding": [
 841      [ ... embeddings for token 0   ... ],
 842      [ ... embeddings for token 1   ... ],
 843      [ ... ]
 844      [ ... embeddings for token N-1 ... ],
 845    ]
 846  }
 847]
 848```
 849
 850### GET `/slots`: Returns the current slots processing state
 851
 852This endpoint is enabled by default and can be disabled with `--no-slots`. It can be used to query various per-slot metrics, such as speed, processed tokens, sampling parameters, etc.
 853
 854If query param `?fail_on_no_slot=1` is set, this endpoint will respond with status code 503 if there is no available slots.
 855
 856**Response format**
 857
 858<details>
 859<summary>Example with 2 slots</summary>
 860
 861```json
 862[
 863  {
 864    "id": 0,
 865    "id_task": 135,
 866    "n_ctx": 65536,
 867    "speculative": false,
 868    "is_processing": true,
 869    "params": {
 870      "n_predict": -1,
 871      "seed": 4294967295,
 872      "temperature": 0.800000011920929,
 873      "dynatemp_range": 0.0,
 874      "dynatemp_exponent": 1.0,
 875      "top_k": 40,
 876      "top_p": 0.949999988079071,
 877      "min_p": 0.05000000074505806,
 878      "top_n_sigma": -1.0,
 879      "xtc_probability": 0.0,
 880      "xtc_threshold": 0.10000000149011612,
 881      "typical_p": 1.0,
 882      "repeat_last_n": 64,
 883      "repeat_penalty": 1.0,
 884      "presence_penalty": 0.0,
 885      "frequency_penalty": 0.0,
 886      "dry_multiplier": 0.0,
 887      "dry_base": 1.75,
 888      "dry_allowed_length": 2,
 889      "dry_penalty_last_n": 131072,
 890      "mirostat": 0,
 891      "mirostat_tau": 5.0,
 892      "mirostat_eta": 0.10000000149011612,
 893      "max_tokens": -1,
 894      "n_keep": 0,
 895      "n_discard": 0,
 896      "ignore_eos": false,
 897      "stream": true,
 898      "n_probs": 0,
 899      "min_keep": 0,
 900      "chat_format": "GPT-OSS",
 901      "reasoning_format": "none",
 902      "reasoning_in_content": false,
 903      "thinking_forced_open": false,
 904      "samplers": [
 905        "penalties",
 906        "dry",
 907        "top_k",
 908        "typ_p",
 909        "top_p",
 910        "min_p",
 911        "xtc",
 912        "temperature"
 913      ],
 914      "speculative.n_max": 16,
 915      "speculative.n_min": 0,
 916      "speculative.p_min": 0.75,
 917      "timings_per_token": false,
 918      "post_sampling_probs": false,
 919      "lora": []
 920    },
 921    "next_token": {
 922      "has_next_token": true,
 923      "has_new_line": false,
 924      "n_remain": -1,
 925      "n_decoded": 0
 926    }
 927  },
 928  {
 929    "id": 1,
 930    "id_task": 0,
 931    "n_ctx": 65536,
 932    "speculative": false,
 933    "is_processing": true,
 934    "params": {
 935      "n_predict": -1,
 936      "seed": 4294967295,
 937      "temperature": 0.800000011920929,
 938      "dynatemp_range": 0.0,
 939      "dynatemp_exponent": 1.0,
 940      "top_k": 40,
 941      "top_p": 0.949999988079071,
 942      "min_p": 0.05000000074505806,
 943      "top_n_sigma": -1.0,
 944      "xtc_probability": 0.0,
 945      "xtc_threshold": 0.10000000149011612,
 946      "typical_p": 1.0,
 947      "repeat_last_n": 64,
 948      "repeat_penalty": 1.0,
 949      "presence_penalty": 0.0,
 950      "frequency_penalty": 0.0,
 951      "dry_multiplier": 0.0,
 952      "dry_base": 1.75,
 953      "dry_allowed_length": 2,
 954      "dry_penalty_last_n": 131072,
 955      "mirostat": 0,
 956      "mirostat_tau": 5.0,
 957      "mirostat_eta": 0.10000000149011612,
 958      "max_tokens": -1,
 959      "n_keep": 0,
 960      "n_discard": 0,
 961      "ignore_eos": false,
 962      "stream": true,
 963      "n_probs": 0,
 964      "min_keep": 0,
 965      "chat_format": "GPT-OSS",
 966      "reasoning_format": "none",
 967      "reasoning_in_content": false,
 968      "thinking_forced_open": false,
 969      "samplers": [
 970        "penalties",
 971        "dry",
 972        "top_k",
 973        "typ_p",
 974        "top_p",
 975        "min_p",
 976        "xtc",
 977        "temperature"
 978      ],
 979      "speculative.n_max": 16,
 980      "speculative.n_min": 0,
 981      "speculative.p_min": 0.75,
 982      "timings_per_token": false,
 983      "post_sampling_probs": false,
 984      "lora": []
 985    },
 986    "next_token": {
 987      "has_next_token": true,
 988      "has_new_line": true,
 989      "n_remain": -1,
 990      "n_decoded": 136
 991    }
 992  }
 993]
 994```
 995
 996</details>
 997
 998### GET `/metrics`: Prometheus compatible metrics exporter
 999
1000This endpoint is only accessible if `--metrics` is set.
1001
1002Available metrics:
1003- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
1004- `llamacpp:tokens_predicted_total`: Number of generation tokens processed.
1005- `llamacpp:prompt_tokens_seconds`: Average prompt throughput in tokens/s.
1006- `llamacpp:predicted_tokens_seconds`: Average generation throughput in tokens/s.
1007- `llamacpp:kv_cache_usage_ratio`: KV-cache usage. `1` means 100 percent usage.
1008- `llamacpp:kv_cache_tokens`: KV-cache tokens.
1009- `llamacpp:requests_processing`: Number of requests processing.
1010- `llamacpp:requests_deferred`: Number of requests deferred.
1011- `llamacpp:n_tokens_max`: High watermark of the context size observed.
1012
1013### POST `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
1014
1015*Options:*
1016
1017`filename`: Name of the file to save the slot's prompt cache. The file will be saved in the directory specified by the `--slot-save-path` server parameter.
1018
1019**Response format**
1020
1021```json
1022{
1023    "id_slot": 0,
1024    "filename": "slot_save_file.bin",
1025    "n_saved": 1745,
1026    "n_written": 14309796,
1027    "timings": {
1028        "save_ms": 49.865
1029    }
1030}
1031```
1032
1033### POST `/slots/{id_slot}?action=restore`: Restore the prompt cache of the specified slot from a file.
1034
1035*Options:*
1036
1037`filename`: Name of the file to restore the slot's prompt cache from. The file should be located in the directory specified by the `--slot-save-path` server parameter.
1038
1039**Response format**
1040
1041```json
1042{
1043    "id_slot": 0,
1044    "filename": "slot_save_file.bin",
1045    "n_restored": 1745,
1046    "n_read": 14309796,
1047    "timings": {
1048        "restore_ms": 42.937
1049    }
1050}
1051```
1052
1053### POST `/slots/{id_slot}?action=erase`: Erase the prompt cache of the specified slot.
1054
1055**Response format**
1056
1057```json
1058{
1059    "id_slot": 0,
1060    "n_erased": 1745
1061}
1062```
1063
1064### GET `/lora-adapters`: Get list of all LoRA adapters
1065
1066This endpoint returns the loaded LoRA adapters. You can add adapters using `--lora` when starting the server, for example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...`
1067
1068By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`
1069
1070Please note that this value will be overwritten by the `lora` field for each request.
1071
1072If an adapter is disabled, the scale will be set to 0.
1073
1074**Response format**
1075
1076```json
1077[
1078    {
1079        "id": 0,
1080        "path": "my_adapter_1.gguf",
1081        "scale": 0.0
1082    },
1083    {
1084        "id": 1,
1085        "path": "my_adapter_2.gguf",
1086        "scale": 0.0
1087    }
1088]
1089```
1090
1091### POST `/lora-adapters`: Set list of LoRA adapters
1092
1093This sets the global scale for LoRA adapters. Please note that this value will be overwritten by the `lora` field for each request.
1094
1095To disable an adapter, either remove it from the list below, or set scale to 0.
1096
1097**Request format**
1098
1099To know the `id` of the adapter, use GET `/lora-adapters`
1100
1101```json
1102[
1103  {"id": 0, "scale": 0.2},
1104  {"id": 1, "scale": 0.8}
1105]
1106```
1107
1108## OpenAI-compatible API Endpoints
1109
1110### GET `/v1/models`: OpenAI-compatible Model Info API
1111
1112Returns information about the loaded model. See [OpenAI Models API documentation](https://platform.openai.com/docs/api-reference/models).
1113
1114The returned list always has one single element. The `meta` field can be `null` (for example, while the model is still loading).
1115
1116By default, model `id` field is the path to model file, specified via `-m`. You can set a custom value for model `id` field via `--alias` argument. For example, `--alias gpt-4o-mini`.
1117
1118Example:
1119
1120```json
1121{
1122    "object": "list",
1123    "data": [
1124        {
1125            "id": "../models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf",
1126            "object": "model",
1127            "created": 1735142223,
1128            "owned_by": "llamacpp",
1129            "meta": {
1130                "vocab_type": 2,
1131                "n_vocab": 128256,
1132                "n_ctx_train": 131072,
1133                "n_embd": 4096,
1134                "n_params": 8030261312,
1135                "size": 4912898304
1136            }
1137        }
1138    ]
1139}
1140```
1141
1142### POST `/v1/completions`: OpenAI-compatible Completions API
1143
1144Given an input `prompt`, it returns the predicted completion. Streaming mode is also supported. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps.
1145
1146*Options:*
1147
1148See [OpenAI Completions API documentation](https://platform.openai.com/docs/api-reference/completions).
1149
1150llama.cpp `/completion`-specific features such as `mirostat` are supported.
1151
1152*Examples:*
1153
1154Example usage with `openai` python library:
1155
1156```python
1157import openai
1158
1159client = openai.OpenAI(
1160    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
1161    api_key = "sk-no-key-required"
1162)
1163
1164completion = client.completions.create(
1165  model="davinci-002",
1166  prompt="I believe the meaning of life is",
1167  max_tokens=8
1168)
1169
1170print(completion.choices[0].text)
1171```
1172
1173### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
1174
1175Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
1176
1177If model supports multimodal, you can input the media file via `image_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.
1178
1179*Options:*
1180
1181See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). llama.cpp `/completion`-specific features such as `mirostat` are also supported.
1182
1183The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name",  "type": "string" }, "date": { "title": "Date",  "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants",  "type": "string" } } } }`), similar to other OpenAI-inspired API providers.
1184
1185`chat_template_kwargs`: Allows sending additional parameters to the json templating system. For example: `{"enable_thinking": false}`
1186
1187`reasoning_format`: The reasoning format to be parsed. If set to `none`, it will output the raw generated text.
1188
1189`thinking_forced_open`: Force a reasoning model to always output the reasoning. Only works on certain models.
1190
1191`parse_tool_calls`: Whether to parse the generated tool call.
1192
1193`parallel_tool_calls` : Whether to enable parallel/multiple tool calls (only supported on some models, verification is based on jinja template).
1194
1195*Examples:*
1196
1197You can use either Python `openai` library with appropriate checkpoints:
1198
1199```python
1200import openai
1201
1202client = openai.OpenAI(
1203    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
1204    api_key = "sk-no-key-required"
1205)
1206
1207completion = client.chat.completions.create(
1208  model="gpt-3.5-turbo",
1209  messages=[
1210    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
1211    {"role": "user", "content": "Write a limerick about python exceptions"}
1212  ]
1213)
1214
1215print(completion.choices[0].message)
1216```
1217
1218... or raw HTTP requests:
1219
1220```shell
1221curl http://localhost:8080/v1/chat/completions \
1222-H "Content-Type: application/json" \
1223-H "Authorization: Bearer no-key" \
1224-d '{
1225"model": "gpt-3.5-turbo",
1226"messages": [
1227{
1228    "role": "system",
1229    "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
1230},
1231{
1232    "role": "user",
1233    "content": "Write a limerick about python exceptions"
1234}
1235]
1236}'
1237```
1238
1239*Tool call support*
1240
1241[OpenAI-style function calling](https://platform.openai.com/docs/guides/function-calling) is supported with the `--jinja` flag (and may require a `--chat-template-file` override to get the right tool-use compatible Jinja template; worst case, `--chat-template chatml` may also work).
1242
1243**See our [Function calling](../../docs/function-calling.md) docs** for more details, supported native tool call styles (generic tool call style is used as fallback) / examples of use.
1244
1245*Timings and context usage*
1246
1247The response contains a `timings` object, for example:
1248
1249```js
1250{
1251  "choices": [],
1252  "created": 1757141666,
1253  "id": "chatcmpl-ecQULm0WqPrftUqjPZO1CFYeDjGZNbDu",
1254  // ...
1255  "timings": {
1256    "cache_n": 236, // number of prompt tokens reused from cache
1257    "prompt_n": 1, // number of prompt tokens being processed
1258    "prompt_ms": 30.958,
1259    "prompt_per_token_ms": 30.958,
1260    "prompt_per_second": 32.301828283480845,
1261    "predicted_n": 35, // number of predicted tokens
1262    "predicted_ms": 661.064,
1263    "predicted_per_token_ms": 18.887542857142858,
1264    "predicted_per_second": 52.94494935437416
1265  }
1266}
1267```
1268
1269This provides information on the performance of the server. It also allows calculating the current context usage.
1270
1271The total number of tokens in context is equal to `prompt_n + cache_n + predicted_n`
1272
1273*Reasoning support*
1274
1275The server supports parsing and returning reasoning via the `reasoning_content` field, similar to Deepseek API.
1276
1277Reasoning input (preserve reasoning in history) is also supported by some specific templates. For more details, please refer to [PR#18994](https://github.com/ggml-org/llama.cpp/pull/18994).
1278
1279### POST `/v1/responses`: OpenAI-compatible Responses API
1280
1281*Options:*
1282
1283See [OpenAI Responses API documentation](https://platform.openai.com/docs/api-reference/responses).
1284
1285*Examples:*
1286
1287You can use either Python `openai` library with appropriate checkpoints:
1288
1289```python
1290import openai
1291
1292client = openai.OpenAI(
1293    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
1294    api_key = "sk-no-key-required"
1295)
1296
1297response = client.responses.create(
1298  model="gpt-4.1",
1299  instructions="You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.",
1300  input="Write a limerick about python exceptions"
1301)
1302
1303print(response.output_text)
1304```
1305
1306... or raw HTTP requests:
1307
1308```shell
1309curl http://localhost:8080/v1/responses \
1310-H "Content-Type: application/json" \
1311-H "Authorization: Bearer no-key" \
1312-d '{
1313"model": "gpt-4.1",
1314"instructions": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.",
1315"input": "Write a limerick about python exceptions"
1316}'
1317```
1318
1319This endpoint works by converting Responses request into Chat Completions request.
1320
1321
1322### POST `/v1/embeddings`: OpenAI-compatible embeddings API
1323
1324This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
1325
1326*Options:*
1327
1328See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).
1329
1330*Examples:*
1331
1332- input as string
1333
1334  ```shell
1335  curl http://localhost:8080/v1/embeddings \
1336  -H "Content-Type: application/json" \
1337  -H "Authorization: Bearer no-key" \
1338  -d '{
1339          "input": "hello",
1340          "model":"GPT-4",
1341          "encoding_format": "float"
1342  }'
1343  ```
1344
1345- `input` as string array
1346
1347  ```shell
1348  curl http://localhost:8080/v1/embeddings \
1349  -H "Content-Type: application/json" \
1350  -H "Authorization: Bearer no-key" \
1351  -d '{
1352          "input": ["hello", "world"],
1353          "model":"GPT-4",
1354          "encoding_format": "float"
1355  }'
1356  ```
1357
1358### POST `/v1/messages`: Anthropic-compatible Messages API
1359
1360Given a list of `messages`, returns the assistant's response. Streaming is supported via Server-Sent Events. While no strong claims of compatibility with the Anthropic API spec are made, in our experience it suffices to support many apps.
1361
1362*Options:*
1363
1364See [Anthropic Messages API documentation](https://docs.anthropic.com/en/api/messages). Tool use requires `--jinja` flag.
1365
1366`model`: Model identifier (required)
1367
1368`messages`: Array of message objects with `role` and `content` (required)
1369
1370`max_tokens`: Maximum tokens to generate (default: 4096)
1371
1372`system`: System prompt as string or array of content blocks
1373
1374`temperature`: Sampling temperature 0-1 (default: 1.0)
1375
1376`top_p`: Nucleus sampling (default: 1.0)
1377
1378`top_k`: Top-k sampling
1379
1380`stop_sequences`: Array of stop sequences
1381
1382`stream`: Enable streaming (default: false)
1383
1384`tools`: Array of tool definitions (requires `--jinja`)
1385
1386`tool_choice`: Tool selection mode (`{"type": "auto"}`, `{"type": "any"}`, or `{"type": "tool", "name": "..."}`)
1387
1388*Examples:*
1389
1390```shell
1391curl http://localhost:8080/v1/messages \
1392  -H "Content-Type: application/json" \
1393  -H "x-api-key: your-api-key" \
1394  -d '{
1395    "model": "gpt-4",
1396    "max_tokens": 1024,
1397    "system": "You are a helpful assistant.",
1398    "messages": [
1399      {"role": "user", "content": "Hello!"}
1400    ]
1401  }'
1402```
1403
1404### POST `/v1/messages/count_tokens`: Token Counting
1405
1406Counts the number of tokens in a request without generating a response.
1407
1408Accepts the same parameters as `/v1/messages`. The `max_tokens` parameter is not required.
1409
1410*Example:*
1411
1412```shell
1413curl http://localhost:8080/v1/messages/count_tokens \
1414  -H "Content-Type: application/json" \
1415  -d '{
1416    "model": "gpt-4",
1417    "messages": [
1418      {"role": "user", "content": "Hello!"}
1419    ]
1420  }'
1421```
1422
1423*Response:*
1424
1425```json
1426{"input_tokens": 10}
1427```
1428
1429## Using multiple models
1430
1431`llama-server` can be launched in a **router mode** that exposes an API for dynamically loading and unloading models. The main process (the "router") automatically forwards each request to the appropriate model instance.
1432
1433To start in router mode, launch `llama-server` **without specifying any model**:
1434
1435```sh
1436llama-server
1437```
1438
1439### Model sources
1440
1441There are 3 possible sources for model files:
14421. Cached models (controlled by the `LLAMA_CACHE` environment variable)
14432. Custom model directory (set via the `--models-dir` argument)
14443. Custom preset (set via the `--models-preset` argument)
1445
1446By default, the router looks for models in the cache. You can add Hugging Face models to the cache with:
1447
1448```sh
1449llama-server -hf <user>/<model>:<tag>
1450```
1451
1452*The server must be restarted after adding a new model.*
1453
1454Alternatively, you can point the router to a local directory containing your GGUF files using `--models-dir`. Example command:
1455
1456```sh
1457llama-server --models-dir ./models_directory
1458```
1459
1460If the model contains multiple GGUF (for multimodal or multi-shard), files should be put into a subdirectory. The directory structure should look like this:
1461
1462```sh
1463models_directory
1464 │
1465 │  # single file
1466 ├─ llama-3.2-1b-Q4_K_M.gguf
1467 ├─ Qwen3-8B-Q4_K_M.gguf
1468 │
1469 │  # multimodal
1470 ├─ gemma-3-4b-it-Q8_0
1471 │    ├─ gemma-3-4b-it-Q8_0.gguf
1472 │    └─ mmproj-F16.gguf   # file name must start with "mmproj"
1473 │
1474 │  # multi-shard
1475 ├─ Kimi-K2-Thinking-UD-IQ1_S
1476 │    ├─ Kimi-K2-Thinking-UD-IQ1_S-00001-of-00006.gguf
1477 │    ├─ Kimi-K2-Thinking-UD-IQ1_S-00002-of-00006.gguf
1478 │    ├─ ...
1479 │    └─ Kimi-K2-Thinking-UD-IQ1_S-00006-of-00006.gguf
1480```
1481
1482You may also specify default arguments that will be passed to every model instance:
1483
1484```sh
1485llama-server -ctx 8192 -n 1024 -np 2
1486```
1487
1488Note: model instances inherit both command line arguments and environment variables from the router server.
1489
1490Alternatively, you can also add GGUF based preset (see next section)
1491
1492### Model presets
1493
1494Model presets allow advanced users to define custom configurations using an `.ini` file:
1495
1496```sh
1497llama-server --models-preset ./my-models.ini
1498```
1499
1500Each section in the file defines a new preset. Keys within a section correspond to command-line arguments (without leading dashes). For example, the argument `--n-gpu-layers 123` is written as `n-gpu-layers = 123`.
1501
1502Short argument forms (e.g., `c`, `ngl`) and environment variable names (e.g., `LLAMA_ARG_N_GPU_LAYERS`) are also supported as keys.
1503
1504Example:
1505
1506```ini
1507version = 1
1508
1509; (Optional) This section provides global settings shared across all presets.
1510; If the same key is defined in a specific preset, it will override the value in this global section.
1511[*]
1512c = 8192
1513n-gpu-layer = 8
1514
1515; If the key corresponds to an existing model on the server,
1516; this will be used as the default config for that model
1517[ggml-org/MY-MODEL-GGUF:Q8_0]
1518; string value
1519chat-template = chatml
1520; numeric value
1521n-gpu-layers = 123
1522; flag value (for certain flags, you need to use the "no-" prefix for negation)
1523jinja = true
1524; shorthand argument (for example, context size)
1525c = 4096
1526; environment variable name
1527LLAMA_ARG_CACHE_RAM = 0
1528; file paths are relative to server's CWD
1529model-draft = ./my-models/draft.gguf
1530; but it's RECOMMENDED to use absolute path
1531model-draft = /Users/abc/my-models/draft.gguf
1532
1533; If the key does NOT correspond to an existing model,
1534; you need to specify at least the model path or HF repo
1535[custom_model]
1536model = /Users/abc/my-awesome-model-Q4_K_M.gguf
1537```
1538
1539Note: some arguments are controlled by router (e.g., host, port, API key, HF repo, model alias). They will be removed or overwritten upon loading.
1540
1541The precedence rule for preset options is as follows:
15421. **Command-line arguments** passed to `llama-server` (highest priority)
15432. **Model-specific options** defined in the preset file (e.g. `[ggml-org/MY-MODEL...]`)
15443. **Global options** defined in the preset file (`[*]`)
1545
1546We also offer additional options that are exclusive to presets (these aren't treated as command-line arguments):
1547- `load-on-startup` (boolean): Controls whether the model loads automatically when the server starts
1548- `stop-timeout` (int, seconds): After requested unload, wait for this many seconds before forcing termination (default: 10)
1549
1550### Routing requests
1551
1552Requests are routed according to the requested model name.
1553
1554For **POST** endpoints (`/v1/chat/completions`, `/v1/completions`, `/infill`, etc.) The router uses the `"model"` field in the JSON body:
1555
1556```json
1557{
1558  "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
1559  "messages": [
1560    {
1561      "role": "user",
1562      "content": "hello"
1563    }
1564  ]
1565}
1566```
1567
1568For **GET** endpoints (`/props`, `/metrics`, etc.) The router uses the `model` query parameter (URL-encoded):
1569
1570```
1571GET /props?model=ggml-org%2Fgemma-3-4b-it-GGUF%3AQ4_K_M
1572```
1573
1574By default, the model will be loaded automatically if it's not loaded. To disable this, add `--no-models-autoload` when starting the server. Additionally, you can include `?autoload=true|false` in the query param to control this behavior per-request.
1575
1576### GET `/models`: List available models
1577
1578Listing all models in cache. The model metadata will also include a field to indicate the status of the model:
1579
1580```json
1581{
1582  "data": [{
1583    "id": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
1584    "in_cache": true,
1585    "path": "/Users/REDACTED/Library/Caches/llama.cpp/ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf",
1586    "status": {
1587      "value": "loaded",
1588      "args": ["llama-server", "-ctx", "4096"]
1589    },
1590    ...
1591  }]
1592}
1593```
1594
1595Note: For a local GGUF (stored offline in a custom directory), the model object will have `"in_cache": false`.
1596
1597The `status` object can be:
1598
1599```json
1600"status": {
1601  "value": "unloaded"
1602}
1603```
1604
1605```json
1606"status": {
1607  "value": "loading",
1608  "args": ["llama-server", "-ctx", "4096"]
1609}
1610```
1611
1612```json
1613"status": {
1614  "value": "unloaded",
1615  "args": ["llama-server", "-ctx", "4096"],
1616  "failed": true,
1617  "exit_code": 1
1618}
1619```
1620
1621```json
1622"status": {
1623  "value": "loaded",
1624  "args": ["llama-server", "-ctx", "4096"]
1625}
1626```
1627
1628### POST `/models/load`: Load a model
1629
1630Load a model
1631
1632Payload:
1633- `model`: name of the model to be loaded.
1634
1635```json
1636{
1637  "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
1638}
1639```
1640
1641Response:
1642
1643```json
1644{
1645  "success": true
1646}
1647```
1648
1649
1650### POST `/models/unload`: Unload a model
1651
1652Unload a model
1653
1654Payload:
1655
1656```json
1657{
1658  "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
1659}
1660```
1661
1662Response:
1663
1664```json
1665{
1666  "success": true
1667}
1668```
1669
1670## API errors
1671
1672`llama-server` returns errors in the same format as OAI: https://github.com/openai/openai-openapi
1673
1674Example of an error:
1675
1676```json
1677{
1678    "error": {
1679        "code": 401,
1680        "message": "Invalid API Key",
1681        "type": "authentication_error"
1682    }
1683}
1684```
1685
1686## Sleeping on Idle
1687
1688The server supports an automatic sleep mode that activates after a specified period of inactivity (no incoming tasks). This feature, introduced in [PR #18228](https://github.com/ggml-org/llama.cpp/pull/18228), can be enabled using the `--sleep-idle-seconds` command-line argument. It works seamlessly in both single-model and multi-model configurations.
1689
1690When the server enters sleep mode, the model and its associated memory (including the KV cache) are unloaded from RAM to conserve resources. Any new incoming task will automatically trigger the model to reload.
1691
1692The sleeping status can be retrieved from the `GET /props` endpoint (or `/props?model=(model_name)` in router mode).
1693
1694Note that the following endpoints are exempt from being considered as incoming tasks. They do not trigger model reloading and do not reset the idle timer:
1695- `GET /health`
1696- `GET /props`
1697- `GET /models`
1698
1699## More examples
1700
1701### Interactive mode
1702
1703Check the sample in [chat.mjs](chat.mjs).
1704Run with NodeJS version 16 or later:
1705
1706```sh
1707node chat.mjs
1708```
1709
1710Another sample in [chat.sh](chat.sh).
1711Requires [bash](https://www.gnu.org/software/bash/), [curl](https://curl.se) and [jq](https://jqlang.github.io/jq/).
1712Run with bash:
1713
1714```sh
1715bash chat.sh
1716```
1717
1718Apart from error types supported by OAI, we also have custom types that are specific to functionalities of llama.cpp:
1719
1720**When /metrics or /slots endpoint is disabled**
1721
1722```json
1723{
1724    "error": {
1725        "code": 501,
1726        "message": "This server does not support metrics endpoint.",
1727        "type": "not_supported_error"
1728    }
1729}
1730```
1731
1732**When the server receives invalid grammar via */completions endpoint**
1733
1734```json
1735{
1736    "error": {
1737        "code": 400,
1738        "message": "Failed to parse grammar",
1739        "type": "invalid_request_error"
1740    }
1741}
1742```
1743
1744### Legacy completion web UI
1745
1746A new chat-based UI has replaced the old completion-based since [this PR](https://github.com/ggml-org/llama.cpp/pull/10175). If you want to use the old completion, start the server with `--path ./tools/server/public_legacy`
1747
1748For example:
1749
1750```sh
1751./llama-server -m my_model.gguf -c 8192 --path ./tools/server/public_legacy
1752```
1753
1754### Extending or building alternative Web Front End
1755
1756You can extend the front end by running the server binary with `--path` set to `./your-directory` and importing `/completion.js` to get access to the llamaComplete() method.
1757
1758Read the documentation in `/completion.js` to see convenient ways to access llama.
1759
1760A simple example is below:
1761
1762```html
1763<html>
1764  <body>
1765    <pre>
1766      <script type="module">
1767        import { llama } from '/completion.js'
1768
1769        const prompt = `### Instruction:
1770Write dad jokes, each one paragraph.
1771You can use html formatting if needed.
1772
1773### Response:`
1774
1775        for await (const chunk of llama(prompt)) {
1776          document.write(chunk.data.content)
1777        }
1778      </script>
1779    </pre>
1780  </body>
1781</html>
1782```