1# Snapdragon-based devices
2
3## Setup
4
5### Android
6
7The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain).
8This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.
9
10This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.
11
12```
13~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3
14[d]/> cd /workspace
15```
16
17Note: The rest of the **Android** build process assumes that you're running inside the toolchain container.
18
19### Windows On Snapdragon
20
21Native Windows 11 arm64 builds has the following tools dependencies:
22- MS Visual Studio 2026 (Community Edition or Pro)
23 - MSVC arm64 standard and runtime libraries
24 - UCRT and Driver Kit
25- LLVM core libraries and Clang compiler (winget)
26- CMake, Git, Python (winget)
27- Hexagon SDK Community Edition 6.4 or later (see windows.md)
28- OpenCL SDK 2.3 or later (see windows.md)
29
30Note: The rest of the **Windows** build process assumes that you're running natively in Powershell.
31Adapt below build commands accordingly.
32
33## How to Build
34
35Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:
36
37```
38[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json .
39
40[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon
41Preset CMake variables:
42 ANDROID_ABI="arm64-v8a"
43 ...
44 CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake"
45 GGML_HEXAGON="ON"
46 GGML_OPENCL="ON"
47 GGML_OPENMP="OFF"
48 HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"
49...
50-- Including OpenCL backend
51-- Including Hexagon backend
52...
53-- Build files have been written to: /workspace/build-snapdragon
54
55[d]/workspace> cmake --build build-snapdragon
56...
57[144/356] Performing build step for 'htp-v73'
58[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h
59[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj
60[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj
61[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj
62...
63-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so
64-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so
65...
66```
67
68To generate an installable "package" simply use cmake --install:
69
70```
71[d]/workspace> cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp
72-- Install configuration: "Release"
73-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-cpu.so
74-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-opencl.so
75-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-hexagon.so
76-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v73.so
77-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v75.so
78-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v79.so
79-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v81.so
80-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml.so
81...
82-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-bench
83-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-cli
84...
85```
86
87## How to Install
88
89### Android
90
91For this step, your device needs to be configured for on-device development.
92Please see https://developer.android.com/studio/debug/dev-options for details.
93
94Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device.
95**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.**
96
97```
98~/src/llama.cpp$ adb push pkg-snapdragon/llama.cpp /data/local/tmp/
99pkg-snapdragon/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s)
100pkg-snapdragon/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s)
101pkg-snapdragon/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s)
102102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s)
103```
104
105At this point, you should also install some models:
106
107```
108~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
109...
1102025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920]
111
112~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf
113Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s)
114```
115
116### Windows
117
118All artifacts are already installed in the `pkg-snapdragon` folder.
119To run, adapt below instructions to use Powershell scrits in `scripts/snapdragon/windows`.
120
121## How to Run
122
123The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.
124
125llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).
126You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.
127
128Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.
129
130Here are some examples of running various llama.cpp tools via ADB.
131
132Simple question for Llama-3.2-1B
133
134```
135~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -p "what is the most popular cookie in the world?"
136...
137ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
138ggml-hex: Hexagon Arch version v79
139ggml-hex: allocating new session: HTP0
140ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50
141...
142load_tensors: offloading output layer to GPU
143load_tensors: offloaded 17/17 layers to GPU
144load_tensors: CPU model buffer size = 225.49 MiB
145load_tensors: HTP0 model buffer size = 0.26 MiB
146load_tensors: HTP0-REPACK model buffer size = 504.00 MiB
147...
148I hope this helps you understand the world's most popular cookies! [end of text]
149...
150llama_perf_sampler_print: sampling time = 30.08 ms / 487 runs ( 0.06 ms per token, 16191.77 tokens per second)
151llama_perf_context_print: load time = 617.94 ms
152llama_perf_context_print: prompt eval time = 80.76 ms / 11 tokens ( 7.34 ms per token, 136.21 tokens per second)
153llama_perf_context_print: eval time = 9210.59 ms / 475 runs ( 19.39 ms per token, 51.57 tokens per second)
154llama_perf_context_print: total time = 9454.92 ms / 486 tokens
155llama_perf_context_print: graphs reused = 473
156llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
157llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
158llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 |
159llama_memory_breakdown_print: | - HTP0-REPACK | 504 = 504 + 0 + 0 |
160```
161
162Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices
163
164```
165~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt
166...
167ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
168ggml-hex: Hexagon Arch version v81
169ggml-hex: allocating new session: HTP0
170ggml-hex: allocating new session: HTP1
171...
172load_tensors: offloading output layer to GPU
173load_tensors: offloaded 17/17 layers to GPU
174load_tensors: CPU model buffer size = 143.86 MiB
175load_tensors: HTP1 model buffer size = 0.23 MiB
176load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB
177load_tensors: HTP0 model buffer size = 0.28 MiB
178load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB
179...
180llama_context: CPU output buffer size = 0.19 MiB
181llama_kv_cache: HTP1 KV buffer size = 238.00 MiB
182llama_kv_cache: HTP0 KV buffer size = 306.00 MiB
183llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
184llama_context: HTP0 compute buffer size = 15.00 MiB
185llama_context: HTP1 compute buffer size = 15.00 MiB
186llama_context: CPU compute buffer size = 24.56 MiB
187...
188llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second)
189llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second)
190llama_perf_context_print: total time = 7377.33 ms / 469 tokens
191llama_perf_context_print: graphs reused = 255
192llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
193llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
194llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
195llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 |
196llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 |
197llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 |
198```
199
200Op test for MUL_MAT
201
202```
203~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT
204...
205Backend 2/3: HTP0
206Device description: Hexagon
207Device memory: 2048 MB (2048 MB free)
208MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
209MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
210MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
211
212~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64
213...
214ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
215ggml-hex: Hexagon Arch version v79
216ggml-hex: allocating new session: HTP0
217ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090
218| model | size | params | backend | ngl | threads | n_batch | mmap | test | t/s |
219| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: |
220| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | pp128 | 169.42 ± 1.75 |
221| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | tg64 | 51.54 ± 1.13 |
222
223build: 6a8cf8914 (6733)
224```
225
226## Environment variables
227
228- `GGML_HEXAGON_NDEV=1`
229 Controls the number of devices/sessions to allocate. The default is 1.
230 Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four.
231
232- `GGML_HEXAGON_NHVX=0`
233 Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version).
234
235- `GGML_HEXAGON_HOSTBUF=1`
236 Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.
237 This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).
238
239- `GGML_HEXAGON_EXPERIMENTAL=1`
240 Controls whether the Hexagon backend enables experimental features.
241 This option is required for enabling/testing experimental Ops (FLASH_ATTN_EXT).
242
243- `GGML_HEXAGON_VERBOSE=1`
244 Enables verbose logging of Ops from the backend. Example output:
245
246 ```
247 ggml-hex: HTP0 graph-compute n_nodes 2
248 ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1
249 ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3
250 ggml-hex: HTP0 graph-compute n_nodes 1
251 ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0
252 ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024
253 ```
254
255- `GGML_HEXAGON_PROFILE=1`
256 Generates a host-side profile for the ggml-hexagon Ops.
257
258- `GGML_HEXAGON_OPMASK=0x0`
259 Allows enabling specific stages of the processing pipeline:
260
261 - `0x1` Enable Op Queue (i.e., queuing Ops into NPU)
262 - `0x2` Enable Dynamic Quantizer (if needed for the Op)
263 - `0x4` Enable Op Compute (MUL_MAT, etc.)
264
265 Examples:
266
267 `GGML_HEXAGON_OPMASK=0x1 llama-completion ...` - Ops are enqueued but NPU-side processing is stubbed out
268 `GGML_HEXAGON_OPMASK=0x3 llama-completion ...` - NPU performs dynamic quantization and skips the rest
269 `GGML_HEXAGON_OPMASK=0x7 llama-completion ...` - Full queuing and processing of Ops (default)