1# llama.cpp for CANN
2
3 - [Background](#background)
4 - [News](#news)
5 - [OS](#os)
6 - [Hardware](#hardware)
7 - [Model Supports](#model-supports)
8 - [DataType Supports](#datatype-supports)
9 - [Docker](#docker)
10 - [Linux](#linux)
11 - [Environment variable setup](#environment-variable-setup)
12 - [TODO](#todo)
13
14
15## Background
16
17**Ascend NPU** is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars.
18
19**CANN** (Compute Architecture for Neural Networks) is a heterogeneous computing architecture for AI scenarios, providing support for multiple AI frameworks on the top and serving AI processors and programming at the bottom. It plays a crucial role in bridging the gap between upper and lower layers, and is a key platform for improving the computing efficiency of Ascend AI processors. Meanwhile, it offers a highly efficient and easy-to-use programming interface for diverse application scenarios, allowing users to rapidly build AI applications and services based on the Ascend platform.
20
21**Llama.cpp + CANN**
22
23The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the ability of AscendC and ACLNN which are intergrated to CANN Toolkit and kernels to using Ascend NPU directly.
24
25## News
26
27- 2024.11
28 - Support F16 and F32 data type model for Ascend 310P NPU.
29- 2024.8
30 - Support `Q4_0` and `Q8_0` data type for Ascend NPU.
31- 2024.7
32 - Create CANN backend for Ascend NPU.
33
34## OS
35
36| OS | Status | Verified |
37|:-------:|:-------:|:----------------------------------------------:|
38| Linux | Support | Ubuntu 22.04, OpenEuler22.03 |
39
40
41## Hardware
42
43### Ascend NPU
44
45**Verified devices**
46
47| Ascend NPU | Status |
48|:-----------------------------:|:-------:|
49| Atlas 300T A2 | Support |
50| Atlas 300I Duo | Support |
51
52*Notes:*
53
54- If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag.
55- If you run successfully with your Ascend NPU device, please help update the upper table.
56
57
58## Model Supports
59
60| Model Name | FP16 | Q4_0 | Q8_0 |
61|:----------------------------|:-----:|:----:|:----:|
62| Llama-2 | √ | √ | √ |
63| Llama-3 | √ | √ | √ |
64| Mistral-7B | √ | √ | √ |
65| Mistral MOE | √ | √ | √ |
66| DBRX | - | - | - |
67| Falcon | √ | √ | √ |
68| Chinese LLaMA/Alpaca | √ | √ | √ |
69| Vigogne(French) | √ | √ | √ |
70| BERT | x | x | x |
71| Koala | √ | √ | √ |
72| Baichuan | √ | √ | √ |
73| Aquila 1 & 2 | √ | √ | √ |
74| Starcoder models | √ | √ | √ |
75| Refact | √ | √ | √ |
76| MPT | √ | √ | √ |
77| Bloom | √ | √ | √ |
78| Yi models | √ | √ | √ |
79| stablelm models | √ | √ | √ |
80| DeepSeek models | x | x | x |
81| Qwen models | √ | √ | √ |
82| PLaMo-13B | √ | √ | √ |
83| Phi models | √ | √ | √ |
84| PhiMoE | √ | √ | √ |
85| GPT-2 | √ | √ | √ |
86| Orion | √ | √ | √ |
87| InternlLM2 | √ | √ | √ |
88| CodeShell | √ | √ | √ |
89| Gemma | √ | √ | √ |
90| Mamba | √ | √ | √ |
91| Xverse | √ | √ | √ |
92| command-r models | √ | √ | √ |
93| Grok-1 | - | - | - |
94| SEA-LION | √ | √ | √ |
95| GritLM-7B | √ | √ | √ |
96| OLMo | √ | √ | √ |
97| OLMo 2 | √ | √ | √ |
98| OLMoE | √ | √ | √ |
99| Granite models | √ | √ | √ |
100| GPT-NeoX | √ | √ | √ |
101| Pythia | √ | √ | √ |
102| Snowflake-Arctic MoE | - | - | - |
103| Smaug | √ | √ | √ |
104| Poro 34B | √ | √ | √ |
105| Bitnet b1.58 models | √ | x | x |
106| Flan-T5 | √ | √ | √ |
107| Open Elm models | x | √ | √ |
108| chatGLM3-6B + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b | √ | √ | √ |
109| GLM-4-0414 | √ | √ | √ |
110| SmolLM | √ | √ | √ |
111| EXAONE-3.0-7.8B-Instruct | √ | √ | √ |
112| FalconMamba Models | √ | √ | √ |
113| Jais Models | - | x | x |
114| Bielik-11B-v2.3 | √ | √ | √ |
115| RWKV-6 | - | √ | √ |
116| QRWKV-6 | √ | √ | √ |
117| GigaChat-20B-A3B | x | x | x |
118| Trillion-7B-preview | √ | √ | √ |
119| Ling models | √ | √ | √ |
120
121
122**Multimodal**
123| Model Name | FP16 | Q4_0 | Q8_0 |
124|:----------------------------|:-----:|:----:|:----:|
125| LLaVA 1.5 models, LLaVA 1.6 models | x | x | x |
126| BakLLaVA | √ | √ | √ |
127| Obsidian | √ | - | - |
128| ShareGPT4V | x | - | - |
129| MobileVLM 1.7B/3B models | - | - | - |
130| Yi-VL | - | - | - |
131| Mini CPM | √ | √ | √ |
132| Moondream | √ | √ | √ |
133| Bunny | √ | - | - |
134| GLM-EDGE | √ | √ | √ |
135| Qwen2-VL | √ | √ | √ |
136
137
138
139## DataType Supports
140
141| DataType | Status |
142|:----------------------:|:-------:|
143| FP16 | Support |
144| Q8_0 | Support |
145| Q4_0 | Support |
146
147## Docker
148
149### Build Images
150You can get a image with llama.cpp in one command.
151```sh
152docker build -t llama-cpp-cann -f .devops/llama-cli-cann.Dockerfile .
153```
154
155### Run container
156
157```sh
158# Find all cards.
159npu-smi info
160
161# Select the cards that you want to use, make sure these cards are not used by someone.
162# Following using cards of device0.
163docker run --name llamacpp --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -v /PATH_TO_YOUR_MODELS/:/app/models -it llama-cpp-cann -m /app/models/MODEL_PATH -ngl 32 -p "Building a website can be done in 10 simple steps:"
164```
165
166*Notes:*
167
168- You may need to install Ascend Driver and firmware on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
169
170## Linux
171
172### I. Setup Environment
173
1741. **Install Ascend Driver and firmware**
175
176 ```sh
177 # create driver running user.
178 sudo groupadd -g HwHiAiUser
179 sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash
180 sudo usermod -aG HwHiAiUser $USER
181
182 # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system
183 # and install driver.
184 sudo sh Ascend-hdk-910b-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all
185 ```
186
187 Once installed, run `npu-smi info` to check whether driver is installed successfully.
188 ```sh
189 +-------------------------------------------------------------------------------------------+
190 | npu-smi 24.1.rc2 Version: 24.1.rc2 |
191 +----------------------+---------------+----------------------------------------------------+
192 | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
193 | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
194 +======================+===============+====================================================+
195 | 2 xxx | OK | 64.4 51 15 / 15 |
196 | 0 | 0000:01:00.0 | 0 1873 / 15077 0 / 32768 |
197 +======================+===============+====================================================+
198 | 5 xxx | OK | 64.0 52 15 / 15 |
199 | 0 | 0000:81:00.0 | 0 1874 / 15077 0 / 32768 |
200 +======================+===============+====================================================+
201 | No running processes found in NPU 2 |
202 +======================+===============+====================================================+
203 | No running processes found in NPU 5 |
204 +======================+===============+====================================================+
205 ```
206
2072. **Install Ascend Firmware**
208 ```sh
209 # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system
210 # and install driver.
211 sudo sh Ascend-hdk-910b-npu-firmware_x.x.x.x.X.run --full
212 ```
213 If the following messaage appers, firmware is installed successfully.
214 ```sh
215 Firmware package installed successfully!
216 ```
217
218
2193. **Install CANN toolkit and kernels**
220
221 CANN toolkit and kernels can be obtained from the official [CANN Toolkit](https://www.hiascend.com/zh/developer/download/community/result?module=cann) page.
222
223 Please download the corresponding version that satified your system. The minimum version required is 8.0.RC2.alpha002 and here is the install command.
224 ```sh
225 pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions
226 sh Ascend-cann-toolkit_8.0.RC2.alpha002_linux-aarch64.run --install
227 sh Ascend-cann-kernels-910b_8.0.RC2.alpha002_linux.run --install
228 ```
229
230 Set Ascend Variables:
231 ```sh
232 echo "source ~/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc
233 source ~/.bashrc
234 ```
235
236Upon a successful installation, CANN is enabled for the available ascend devices.
237
238### II. Build llama.cpp
239
240```sh
241cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
242cmake --build build --config release
243```
244
245### III. Run the inference
246
2471. **Retrieve and prepare model**
248
249 You can refer to the general [*Prepare and Quantize*](../../README.md#prepare-and-quantize) guide for model prepration.
250
251 **Notes**:
252
253 - CANN backend only supports FP16/Q4_0/Q8_0 models currently.
254
2552. **Launch inference**
256
257 There are two device selection modes:
258
259 - Single device: Use one device target specified by the user.
260 - Multiple devices: Automatically choose the devices with the same backend.
261
262 | Device selection | Parameter |
263 |:----------------:|:--------------------------------------:|
264 | Single device | --split-mode none --main-gpu DEVICE_ID |
265 | Multiple devices | --split-mode layer (default) |
266
267 Examples:
268
269 - Use device 0:
270
271 ```sh
272 ./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
273 ```
274
275 - Use multiple devices:
276
277 ```sh
278 ./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
279 ```
280
281### **GitHub contribution**:
282Please add the **[CANN]** prefix/tag in issues/PRs titles to help the CANN-team check/address them without delay.
283
284## Updates
285### Basic Flash Attention Support
286The basic FA kernel with aclnnops has been added in aclnn_ops.cpp.
287Currently, the FA only supports the cases with FP16 KV tensors and NO logit softcap.
288Since the aclnn interface for flash attention cannot support the logit softcap, we will only update the quantized version in the future.
289
290Authors from Peking University: Bizhao Shi (bshi@pku.edu.cn), Yuxin Yang (yxyang@pku.edu.cn), Ruiyang Ma (ruiyang@stu.pku.edu.cn), and Guojie Luo (gluo@pku.edu.cn).
291
292We would like to thank Tuo Dai, Shanni Li, and all of the project maintainers from Huawei Technologies Co., Ltd for their help during the code development and pull request.
293
294## Environment variable setup
295
296### GGML_CANN_MEM_POOL
297
298Specifies the memory pool management strategy, Default is vmm.
299
300- vmm: Utilizes a virtual memory manager pool. If hardware support for VMM is unavailable, falls back to the legacy (leg) memory pool.
301
302- prio: Employs a priority queue-based memory pool management.
303
304- leg: Uses a fixed-size buffer pool.
305
306### GGML_CANN_DISABLE_BUF_POOL_CLEAN
307
308Controls automatic cleanup of the memory pool. This option is only effective when using the prio or leg memory pool strategies.
309
310### GGML_CANN_WEIGHT_NZ
311
312Converting the matmul weight format from ND to NZ to improve performance. Enabled by default.
313
314### GGML_CANN_ACL_GRAPH
315
316Operators are executed using ACL graph execution, rather than in op-by-op (eager) mode. Enabled by default. This option is only effective if `USE_ACL_GRAPH` was enabled at compilation time. To enable it, recompile using:
317
318```sh
319cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release -DUSE_ACL_GRAPH=ON
320cmake --build build --config release
321```
322
323### GGML_CANN_GRAPH_CACHE_CAPACITY
324
325Maximum number of compiled CANN graphs kept in the LRU cache, default is 12. When the number of cached graphs exceeds this capacity, the least recently used graph will be evicted.
326
327### GGML_CANN_PREFILL_USE_GRAPH
328
329Enable ACL graph execution during the prefill stage, default is false. This option is only effective when FA is enabled.
330
331### GGML_CANN_OPERATOR_FUSION
332
333Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance.