llmnpc - llama.cpp/tools/rpc/README.md

Path: llmnpc / llama.cpp / tools / rpc / README.md (raw)
  1## Overview
  2
  3> [!IMPORTANT]
  4> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
  5> insecure. **Never run the RPC server on an open network or in a sensitive environment!**
  6
  7The `rpc-server` allows exposing `ggml` devices on a remote host.
  8The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
  9This can be used for distributed LLM inference with `llama.cpp` in the following way:
 10
 11```mermaid
 12flowchart TD
 13    rpcb<-->|TCP|srva
 14    rpcb<-->|TCP|srvb
 15    rpcb<-.->|TCP|srvn
 16    subgraph hostn[Host N]
 17    srvn[rpc-server]<-.->dev4["CUDA0"]
 18    srvn[rpc-server]<-.->dev5["CPU"]
 19    end
 20    subgraph hostb[Host B]
 21    srvb[rpc-server]<-->dev3["Metal"]
 22    end
 23    subgraph hosta[Host A]
 24    srva[rpc-server]<-->dev["CUDA0"]
 25    srva[rpc-server]<-->dev2["CUDA1"]
 26    end
 27    subgraph host[Main Host]
 28    local["Local devices"]<-->ggml[llama-cli]
 29    ggml[llama-cli]<-->rpcb[RPC backend]
 30    end
 31    style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
 32    classDef devcls fill:#5B9BD5
 33    class local,dev,dev2,dev3,dev4,dev5 devcls
 34```
 35
 36By default, `rpc-server` exposes all available accelerator devices on the host.
 37If there are no accelerators, it exposes a single `CPU` device.
 38
 39## Usage
 40
 41### Remote hosts
 42
 43On each remote host, build the backends for each accelerator by adding `-DGGML_RPC=ON` to the build options.
 44For example, to build the `rpc-server` with support for CUDA accelerators:
 45
 46```bash
 47mkdir build-rpc-cuda
 48cd build-rpc-cuda
 49cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
 50cmake --build . --config Release
 51```
 52
 53When started, the `rpc-server` will detect and expose all available `CUDA` devices:
 54
 55```bash
 56$ bin/rpc-server
 57ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
 58ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
 59ggml_cuda_init: found 1 CUDA devices:
 60  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
 61Starting RPC server v3.0.0
 62  endpoint       : 127.0.0.1:50052
 63  local cache    : n/a
 64Devices:
 65  CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31588 MiB free)
 66```
 67
 68You can control the set of exposed CUDA devices with the `CUDA_VISIBLE_DEVICES` environment variable or the `--device` command line option. The following two commands have the same effect:
 69```bash
 70$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
 71$ bin/rpc-server --device CUDA0 -p 50052
 72```
 73
 74### Main host
 75
 76On the main host build `llama.cpp` with the backends for the local devices and add `-DGGML_RPC=ON` to the build options.
 77Finally, when running `llama-cli` or `llama-server`, use the `--rpc` option to specify the host and port of each `rpc-server`:
 78
 79```bash
 80$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 --rpc 192.168.88.10:50052,192.168.88.11:50052
 81```
 82
 83By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory.
 84You can override this behavior with the `--tensor-split` option and set custom proportions when splitting tensor data across devices.
 85
 86### Local cache
 87
 88The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
 89This can speed up model loading significantly, especially when using large models.
 90To enable the cache, use the `-c` option:
 91
 92```bash
 93$ bin/rpc-server -c
 94```
 95
 96By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.
 97
 98### Troubleshooting
 99
100Use the `GGML_RPC_DEBUG` environment variable to enable debug messages from `rpc-server`:
101```bash
102$ GGML_RPC_DEBUG=1 bin/rpc-server
103```
104