llmnpc - llama.cpp/docs/backend/snapdragon/developer.md

Path: llmnpc / llama.cpp / docs / backend / snapdragon / developer.md (raw)
  1# Hexagon backend developer details
  2
  3## Backend libraries
  4
  5The Hexagon backend consist of two parts:
  6
  7  - `libggml-hexagon`
  8    This is the regular CPU-side GGML backend library, either shared or statically linked
  9
 10  - `libggml-htp-vNN`
 11    This is the NPU-side (HTP stands for Hexagon Tensor Processor) shared library that contains the Op dispatcher and kernels.
 12    The correct library is selected automatically at runtime based on the HW version.
 13
 14Here is an example of the build artifacts
 15
 16```
 17~/src/llama.cpp$ ls -l pkg-adb/llama.cpp/lib/libggml*
 18pkg-adb/llama.cpp/lib/libggml-base.so
 19pkg-adb/llama.cpp/lib/libggml-cpu.so
 20pkg-adb/llama.cpp/lib/libggml-hexagon.so      <<< CPU library
 21pkg-adb/llama.cpp/lib/libggml-htp-v73.so      <<< HTP op/kernels for Hexagon v73
 22pkg-adb/llama.cpp/lib/libggml-htp-v75.so
 23pkg-adb/llama.cpp/lib/libggml-htp-v79.so
 24pkg-adb/llama.cpp/lib/libggml-htp-v81.so
 25```
 26
 27## Memory buffers
 28
 29Hexagon NPU backend takes advantage of the Snapdragon's unified memory model where all buffers are fully accessible by the CPU and GPU.
 30The NPU does have a dedicated tightly-coupled memory called VTCM but that memory is used only for intermediate data (e.g. dynamically
 31quantized tensors) or temporary data (chunks of the weight tensors fetched via DMA).
 32
 33Please note that currently the Hexagon backend does not implement SET/GET_ROWS Ops because there is no advantage in offloading those
 34to the NPU at this point.
 35
 36The backend does allocates non-host buffers for the tensors with datatypes that require repacking: Q4_0, Q8_0, MXFP4.
 37From the MMU perspective these buffers are still regular buffers (normal access by the CPU) they are marked as non-host simply to force
 38the repacking.
 39
 40## Large model handling
 41
 42Hexagon NPU session (aka Process Domain (PD) in the Hexagon docs) is limited to a memory mapping of around 3.5GB.
 43In llama.cpp/GGML the Hexagon session is mapped to a single GGML backend device (HTP0, HTP1, etc).
 44
 45In order to map models larger than 3.5GB we need to allocate multiple devices and split the model.
 46For this we're taking advantage of the llama.cpp/GGML multi-GPU layer-splitting support.
 47Each Hexagon device behaves like a GPU from the offload and model splitting perspective.
 48
 49Here is an example of running GPT-OSS-20B model on a newer Snapdragon device with 16GB of DDR.
 50
 51```
 52M=gpt-oss-20b-Q4_0.gguf NDEV=4 D=HTP0,HTP1,HTP2,HTP3 P=surfing.txt scripts/snapdragon/adb/run-completion.sh -f surfing.txt -n 32
 53...
 54LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
 55ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
 56GGML_HEXAGON_NDEV=4 ./bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf
 57      -t 4 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0,HTP1,HTP2,HTP3 -no-cnv -f surfing.txt
 58...
 59llama_model_loader: - type  f32:  289 tensors
 60llama_model_loader: - type q4_0:   96 tensors
 61llama_model_loader: - type q8_0:    2 tensors
 62llama_model_loader: - type mxfp4:  72 tensors
 63...
 64load_tensors: offloaded 25/25 layers to GPU
 65load_tensors:          CPU model buffer size =  1182.09 MiB
 66load_tensors:         HTP1 model buffer size =     6.64 MiB
 67load_tensors:  HTP1-REPACK model buffer size =  2505.94 MiB
 68load_tensors:         HTP3 model buffer size =     5.55 MiB
 69load_tensors:  HTP3-REPACK model buffer size =  2088.28 MiB
 70load_tensors:         HTP0 model buffer size =     7.75 MiB
 71load_tensors:  HTP0-REPACK model buffer size =  2923.59 MiB
 72load_tensors:         HTP2 model buffer size =     6.64 MiB
 73load_tensors:  HTP2-REPACK model buffer size =  2505.94 MiB
 74...
 75llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
 76llama_context:        CPU  output buffer size =     0.77 MiB
 77llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells
 78llama_kv_cache:       HTP1 KV buffer size =    25.50 MiB
 79llama_kv_cache:       HTP3 KV buffer size =    25.50 MiB
 80llama_kv_cache:       HTP0 KV buffer size =    25.50 MiB
 81llama_kv_cache:       HTP2 KV buffer size =    25.50 MiB
 82llama_kv_cache: size =  102.00 MiB (  8192 cells,  12 layers,  1/1 seqs), K (q8_0):   51.00 MiB, V (q8_0):   51.00 MiB
 83llama_kv_cache_iswa: creating     SWA KV cache, size = 256 cells
 84llama_kv_cache:       HTP1 KV buffer size =     0.80 MiB
 85llama_kv_cache:       HTP3 KV buffer size =     0.53 MiB
 86llama_kv_cache:       HTP0 KV buffer size =     1.06 MiB
 87llama_kv_cache:       HTP2 KV buffer size =     0.80 MiB
 88llama_kv_cache: size =    3.19 MiB (   256 cells,  12 layers,  1/1 seqs), K (q8_0):    1.59 MiB, V (q8_0):    1.59 MiB
 89llama_context:       HTP0 compute buffer size =    16.06 MiB
 90llama_context:       HTP1 compute buffer size =    16.06 MiB
 91llama_context:       HTP2 compute buffer size =    16.06 MiB
 92llama_context:       HTP3 compute buffer size =    16.06 MiB
 93llama_context:        CPU compute buffer size =    98.19 MiB
 94...
 95llama_perf_context_print: prompt eval time =    3843.67 ms /   197 tokens ( 19.51 ms per token, 51.25 tokens per second)
 96llama_perf_context_print:        eval time =    1686.13 ms /    31 runs   ( 54.39 ms per token, 18.39 tokens per second)
 97llama_perf_context_print:       total time =    6266.30 ms /   228 tokens
 98llama_perf_context_print:    graphs reused =         30
 99llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
100llama_memory_breakdown_print: |   - HTP0 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
101llama_memory_breakdown_print: |   - HTP1 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
102llama_memory_breakdown_print: |   - HTP2 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
103llama_memory_breakdown_print: |   - HTP3 (Hexagon)     |  2048 = 2048 + (   0 =     0 +       0 +       0) +           0 |
104llama_memory_breakdown_print: |   - Host               |                 1476 =  1208 +     105 +     162                |
105llama_memory_breakdown_print: |   - HTP1-REPACK        |                 2505 =  2505 +       0 +       0                |
106llama_memory_breakdown_print: |   - HTP3-REPACK        |                 2088 =  2088 +       0 +       0                |
107llama_memory_breakdown_print: |   - HTP0-REPACK        |                 2923 =  2923 +       0 +       0                |
108llama_memory_breakdown_print: |   - HTP2-REPACK        |                 2505 =  2505 +       0 +       0                |
109```