llmnpc - llama.cpp/docs/backend/snapdragon/README.md

Path: llmnpc / llama.cpp / docs / backend / snapdragon / README.md (raw)
  1# Snapdragon-based devices
  2
  3## Setup
  4
  5### Android
  6
  7The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain).
  8This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.
  9
 10This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.
 11
 12```
 13~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3
 14[d]/> cd /workspace
 15```
 16
 17Note: The rest of the **Android** build process assumes that you're running inside the toolchain container.
 18
 19### Windows On Snapdragon
 20
 21Native Windows 11 arm64 builds has the following tools dependencies:
 22- MS Visual Studio 2026 (Community Edition or Pro)
 23  - MSVC arm64 standard and runtime libraries
 24  - UCRT and Driver Kit
 25- LLVM core libraries and Clang compiler (winget)
 26- CMake, Git, Python (winget)
 27- Hexagon SDK Community Edition 6.4 or later (see windows.md)
 28- OpenCL SDK 2.3 or later (see windows.md)
 29
 30Note: The rest of the **Windows** build process assumes that you're running natively in Powershell.
 31Adapt below build commands accordingly.
 32
 33## How to Build
 34
 35Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:
 36
 37```
 38[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json .
 39
 40[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon
 41Preset CMake variables:
 42  ANDROID_ABI="arm64-v8a"
 43  ...
 44  CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake"
 45  GGML_HEXAGON="ON"
 46  GGML_OPENCL="ON"
 47  GGML_OPENMP="OFF"
 48  HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"
 49...
 50-- Including OpenCL backend
 51-- Including Hexagon backend
 52...
 53-- Build files have been written to: /workspace/build-snapdragon
 54
 55[d]/workspace> cmake --build build-snapdragon
 56...
 57[144/356] Performing build step for 'htp-v73'
 58[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h
 59[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj
 60[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj
 61[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj
 62...
 63-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so
 64-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so
 65...
 66```
 67
 68To generate an installable "package" simply use cmake --install:
 69
 70```
 71[d]/workspace> cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp
 72-- Install configuration: "Release"
 73-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-cpu.so
 74-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-opencl.so
 75-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-hexagon.so
 76-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v73.so
 77-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v75.so
 78-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v79.so
 79-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v81.so
 80-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml.so
 81...
 82-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-bench
 83-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-cli
 84...
 85```
 86
 87## How to Install
 88
 89### Android
 90
 91For this step, your device needs to be configured for on-device development.
 92Please see https://developer.android.com/studio/debug/dev-options for details.
 93
 94Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device.
 95**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.**
 96
 97```
 98~/src/llama.cpp$ adb push pkg-snapdragon/llama.cpp /data/local/tmp/
 99pkg-snapdragon/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s)
100pkg-snapdragon/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s)
101pkg-snapdragon/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s)
102102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s)
103```
104
105At this point, you should also install some models:
106
107```
108~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
109...
1102025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920]
111
112~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf
113Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s)
114```
115
116### Windows
117
118All artifacts are already installed in the `pkg-snapdragon` folder.
119To run, adapt below instructions to use Powershell scrits in `scripts/snapdragon/windows`.
120
121## How to Run
122
123The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.
124
125llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).
126You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.
127
128Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.
129
130Here are some examples of running various llama.cpp tools via ADB.
131
132Simple question for Llama-3.2-1B
133
134```
135~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -p "what is the most popular cookie in the world?"
136...
137ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
138ggml-hex: Hexagon Arch version v79
139ggml-hex: allocating new session: HTP0
140ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50
141...
142load_tensors: offloading output layer to GPU
143load_tensors: offloaded 17/17 layers to GPU
144load_tensors:          CPU model buffer size =   225.49 MiB
145load_tensors:         HTP0 model buffer size =     0.26 MiB
146load_tensors:  HTP0-REPACK model buffer size =   504.00 MiB
147...
148I hope this helps you understand the world's most popular cookies! [end of text]
149...
150llama_perf_sampler_print:    sampling time =      30.08 ms /   487 runs   (    0.06 ms per token, 16191.77 tokens per second)
151llama_perf_context_print:        load time =     617.94 ms
152llama_perf_context_print: prompt eval time =      80.76 ms /    11 tokens (    7.34 ms per token,   136.21 tokens per second)
153llama_perf_context_print:        eval time =    9210.59 ms /   475 runs   (   19.39 ms per token,    51.57 tokens per second)
154llama_perf_context_print:       total time =    9454.92 ms /   486 tokens
155llama_perf_context_print:    graphs reused =        473
156llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
157llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
158llama_memory_breakdown_print: |   - Host               |                  439 =   225 +     136 +      77                |
159llama_memory_breakdown_print: |   - HTP0-REPACK        |                  504 =   504 +       0 +       0                |
160```
161
162Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices
163
164```
165~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt
166...
167ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
168ggml-hex: Hexagon Arch version v81
169ggml-hex: allocating new session: HTP0
170ggml-hex: allocating new session: HTP1
171...
172load_tensors: offloading output layer to GPU
173load_tensors: offloaded 17/17 layers to GPU
174load_tensors:          CPU model buffer size =   143.86 MiB
175load_tensors:         HTP1 model buffer size =     0.23 MiB
176load_tensors:  HTP1-REPACK model buffer size =  1575.00 MiB
177load_tensors:         HTP0 model buffer size =     0.28 MiB
178load_tensors:  HTP0-REPACK model buffer size =  2025.00 MiB
179...
180llama_context:        CPU  output buffer size =     0.19 MiB
181llama_kv_cache:       HTP1 KV buffer size =   238.00 MiB
182llama_kv_cache:       HTP0 KV buffer size =   306.00 MiB
183llama_kv_cache: size =  544.00 MiB (  8192 cells,  16 layers,  1/1 seqs), K (q8_0):  272.00 MiB, V (q8_0):  272.00 MiB
184llama_context:       HTP0 compute buffer size =    15.00 MiB
185llama_context:       HTP1 compute buffer size =    15.00 MiB
186llama_context:        CPU compute buffer size =    24.56 MiB
187...
188llama_perf_context_print: prompt eval time =    1730.57 ms /   212 tokens (    8.16 ms per token,   122.50 tokens per second)
189llama_perf_context_print:        eval time =    5624.75 ms /   257 runs   (   21.89 ms per token,    45.69 tokens per second)
190llama_perf_context_print:       total time =    7377.33 ms /   469 tokens
191llama_perf_context_print:    graphs reused =        255
192llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
193llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
194llama_memory_breakdown_print: |   - HTP1 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
195llama_memory_breakdown_print: |   - Host               |                  742 =   144 +     544 +      54                |
196llama_memory_breakdown_print: |   - HTP1-REPACK        |                 1575 =  1575 +       0 +       0                |
197llama_memory_breakdown_print: |   - HTP0-REPACK        |                 2025 =  2025 +       0 +       0                |
198```
199
200Op test for MUL_MAT
201
202```
203~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT
204...
205Backend 2/3: HTP0
206Device description: Hexagon
207Device memory: 2048 MB (2048 MB free)
208MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
209MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
210MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
211
212~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64
213...
214ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
215ggml-hex: Hexagon Arch version v79
216ggml-hex: allocating new session: HTP0
217ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090
218| model          |       size | params | backend    | ngl | threads | n_batch | mmap |  test |           t/s |
219| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: |
220| llama 1B Q4_0  | 729.75 MiB | 1.24 B | HTP        |  99 |       4 |     128 |    0 | pp128 | 169.42 ± 1.75 |
221| llama 1B Q4_0  | 729.75 MiB | 1.24 B | HTP        |  99 |       4 |     128 |    0 |  tg64 |  51.54 ± 1.13 |
222
223build: 6a8cf8914 (6733)
224```
225
226## Environment variables
227
228- `GGML_HEXAGON_NDEV=1`
229  Controls the number of devices/sessions to allocate. The default is 1.
230  Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four.
231
232- `GGML_HEXAGON_NHVX=0`
233  Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version).
234
235- `GGML_HEXAGON_HOSTBUF=1`
236  Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.
237  This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).
238
239- `GGML_HEXAGON_EXPERIMENTAL=1`
240  Controls whether the Hexagon backend enables experimental features.
241  This option is required for enabling/testing experimental Ops (FLASH_ATTN_EXT).
242
243- `GGML_HEXAGON_VERBOSE=1`
244  Enables verbose logging of Ops from the backend. Example output:
245
246  ```
247  ggml-hex: HTP0 graph-compute n_nodes 2
248  ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1
249  ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3
250  ggml-hex: HTP0 graph-compute n_nodes 1
251  ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0
252  ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024
253  ```
254
255- `GGML_HEXAGON_PROFILE=1`
256  Generates a host-side profile for the ggml-hexagon Ops.
257
258- `GGML_HEXAGON_OPMASK=0x0`
259  Allows enabling specific stages of the processing pipeline:
260
261  - `0x1` Enable Op Queue (i.e., queuing Ops into NPU)
262  - `0x2` Enable Dynamic Quantizer (if needed for the Op)
263  - `0x4` Enable Op Compute (MUL_MAT, etc.)
264
265  Examples:
266
267      `GGML_HEXAGON_OPMASK=0x1 llama-completion ...` - Ops are enqueued but NPU-side processing is stubbed out
268      `GGML_HEXAGON_OPMASK=0x3 llama-completion ...` - NPU performs dynamic quantization and skips the rest
269      `GGML_HEXAGON_OPMASK=0x7 llama-completion ...` - Full queuing and processing of Ops (default)