1## Overview
2
3> [!IMPORTANT]
4> This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and
5> insecure. **Never run the RPC server on an open network or in a sensitive environment!**
6
7The `rpc-server` allows exposing `ggml` devices on a remote host.
8The RPC backend communicates with one or several instances of `rpc-server` and offloads computations to them.
9This can be used for distributed LLM inference with `llama.cpp` in the following way:
10
11```mermaid
12flowchart TD
13 rpcb<-->|TCP|srva
14 rpcb<-->|TCP|srvb
15 rpcb<-.->|TCP|srvn
16 subgraph hostn[Host N]
17 srvn[rpc-server]<-.->dev4["CUDA0"]
18 srvn[rpc-server]<-.->dev5["CPU"]
19 end
20 subgraph hostb[Host B]
21 srvb[rpc-server]<-->dev3["Metal"]
22 end
23 subgraph hosta[Host A]
24 srva[rpc-server]<-->dev["CUDA0"]
25 srva[rpc-server]<-->dev2["CUDA1"]
26 end
27 subgraph host[Main Host]
28 local["Local devices"]<-->ggml[llama-cli]
29 ggml[llama-cli]<-->rpcb[RPC backend]
30 end
31 style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
32 classDef devcls fill:#5B9BD5
33 class local,dev,dev2,dev3,dev4,dev5 devcls
34```
35
36By default, `rpc-server` exposes all available accelerator devices on the host.
37If there are no accelerators, it exposes a single `CPU` device.
38
39## Usage
40
41### Remote hosts
42
43On each remote host, build the backends for each accelerator by adding `-DGGML_RPC=ON` to the build options.
44For example, to build the `rpc-server` with support for CUDA accelerators:
45
46```bash
47mkdir build-rpc-cuda
48cd build-rpc-cuda
49cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
50cmake --build . --config Release
51```
52
53When started, the `rpc-server` will detect and expose all available `CUDA` devices:
54
55```bash
56$ bin/rpc-server
57ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
58ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
59ggml_cuda_init: found 1 CUDA devices:
60 Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
61Starting RPC server v3.0.0
62 endpoint : 127.0.0.1:50052
63 local cache : n/a
64Devices:
65 CUDA0: NVIDIA GeForce RTX 5090 (32109 MiB, 31588 MiB free)
66```
67
68You can control the set of exposed CUDA devices with the `CUDA_VISIBLE_DEVICES` environment variable or the `--device` command line option. The following two commands have the same effect:
69```bash
70$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
71$ bin/rpc-server --device CUDA0 -p 50052
72```
73
74### Main host
75
76On the main host build `llama.cpp` with the backends for the local devices and add `-DGGML_RPC=ON` to the build options.
77Finally, when running `llama-cli` or `llama-server`, use the `--rpc` option to specify the host and port of each `rpc-server`:
78
79```bash
80$ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -ngl 99 --rpc 192.168.88.10:50052,192.168.88.11:50052
81```
82
83By default, llama.cpp distributes model weights and the KV cache across all available devices -- both local and remote -- in proportion to each device's available memory.
84You can override this behavior with the `--tensor-split` option and set custom proportions when splitting tensor data across devices.
85
86### Local cache
87
88The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
89This can speed up model loading significantly, especially when using large models.
90To enable the cache, use the `-c` option:
91
92```bash
93$ bin/rpc-server -c
94```
95
96By default, the cache is stored in the `$HOME/.cache/llama.cpp/rpc` directory and can be controlled via the `LLAMA_CACHE` environment variable.
97
98### Troubleshooting
99
100Use the `GGML_RPC_DEBUG` environment variable to enable debug messages from `rpc-server`:
101```bash
102$ GGML_RPC_DEBUG=1 bin/rpc-server
103```
104