llmnpc - llama.cpp/docs/backend/OPENCL.md

Path: llmnpc / llama.cpp / docs / backend / OPENCL.md (raw)
  1# llama.cpp for OpenCL
  2
  3- [Background](#background)
  4- [OS](#os)
  5- [Hardware](#hardware)
  6- [DataType Supports](#datatype-supports)
  7- [Model Preparation](#model-preparation)
  8- [CMake Options](#cmake-options)
  9- [Android](#android)
 10- [Windows 11 Arm64](#windows-11-arm64)
 11- [Linux](#Linux)
 12- [Known Issue](#known-issues)
 13- [TODO](#todo)
 14
 15## Background
 16
 17OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. Similar to CUDA, OpenCL has been widely used to program GPUs and is supported by most GPU vendors.
 18
 19### Llama.cpp + OpenCL
 20
 21The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adreno GPU** firstly via OpenCL. Thanks to the portabilty of OpenCL, the OpenCL backend can also run on certain Intel GPUs such as those that do not have [SYCL](/docs/backend/SYCL.md) support although the performance is not optimal.
 22
 23## OS
 24
 25| OS      | Status  | Verified                                       |
 26|---------|---------|------------------------------------------------|
 27| Android | Support | Snapdragon 8 Gen 3, Snapdragon 8 Elite         |
 28| Windows | Support | Windows 11 Arm64 with Snapdragon X Elite       |
 29| Linux   | Support | Ubuntu 22.04 WSL2 with Intel 12700H            |
 30
 31## Hardware
 32
 33### Adreno GPU
 34
 35**Verified devices**
 36
 37| Adreno GPU                           | Status  |
 38|:------------------------------------:|:-------:|
 39| Adreno 750 (Snapdragon 8 Gen 3)      | Support |
 40| Adreno 830 (Snapdragon 8 Elite)      | Support |
 41| Adreno X85 (Snapdragon X Elite)      | Support |
 42
 43> A6x GPUs with a recent driver and compiler are supported; they are usually found in IoT platforms.
 44However, A6x GPUs in phones are likely not supported due to the outdated driver and compiler.
 45
 46## DataType Supports
 47
 48| DataType               | Status                     |
 49|:----------------------:|:--------------------------:|
 50| Q4_0                   | Support                    |
 51| Q6_K                   | Support, but not optimized |
 52| Q8_0                   | Support                    |
 53| MXFP4                  | Support                    |
 54
 55## Model Preparation
 56
 57You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
 58
 59Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example,
 60
 61```sh
 62./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
 63```
 64
 65Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
 66
 67### `MXFP4` MoE Models
 68
 69OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in `MXFP4_MOE`, a mixture of `MXFP4` and `Q8_0`.
 70For this quantization, there is no need to specify `--pure`.
 71For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face.
 72
 73Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization.
 74Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model.
 75
 76> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
 77
 78## CMake Options
 79
 80The OpenCL backend has the following CMake options that control the behavior of the backend.
 81
 82| CMake options                     | Default value  | Description                               |
 83|:---------------------------------:|:--------------:|:------------------------------------------|
 84| `GGML_OPENCL_EMBED_KERNELS`       | `ON`           | Embed OpenCL kernels into the executable. |
 85| `GGML_OPENCL_USE_ADRENO_KERNELS`  | `ON`           | Use kernels optimized for Adreno.         |
 86
 87## Android
 88
 89Ubuntu 22.04 is used for targeting Android. Make sure the following tools are accessible from command line,
 90
 91* Git
 92* CMake 3.29
 93* Ninja
 94* Python3
 95
 96### I. Setup Environment
 97
 981. **Install NDK**
 99
100```sh
101cd ~
102wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \
103unzip commandlinetools-linux-8512546_latest.zip && \
104mkdir -p ~/android-sdk/cmdline-tools && \
105mv cmdline-tools latest && \
106mv latest ~/android-sdk/cmdline-tools/ && \
107rm -rf commandlinetools-linux-8512546_latest.zip
108
109yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"
110```
111
1122. **Install OpenCL Headers and Library**
113
114```sh
115mkdir -p ~/dev/llm
116cd ~/dev/llm
117
118git clone https://github.com/KhronosGroup/OpenCL-Headers && \
119cd OpenCL-Headers && \
120cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
121
122cd ~/dev/llm
123
124git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
125cd OpenCL-ICD-Loader && \
126mkdir build_ndk26 && cd build_ndk26 && \
127cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
128  -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
129  -DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
130  -DANDROID_ABI=arm64-v8a \
131  -DANDROID_PLATFORM=24 \
132  -DANDROID_STL=c++_shared && \
133ninja && \
134cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
135```
136
137### II. Build llama.cpp
138
139```sh
140cd ~/dev/llm
141
142git clone https://github.com/ggml-org/llama.cpp && \
143cd llama.cpp && \
144mkdir build-android && cd build-android
145
146cmake .. -G Ninja \
147  -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
148  -DANDROID_ABI=arm64-v8a \
149  -DANDROID_PLATFORM=android-28 \
150  -DBUILD_SHARED_LIBS=OFF \
151  -DGGML_OPENCL=ON
152
153ninja
154```
155
156## Windows 11 Arm64
157
158A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the following tools are accessible from command line,
159
160* Git
161* CMake 3.29
162* Clang 19
163* Ninja
164* Visual Studio 2022
165* Powershell 7
166* Python
167
168Visual Studio provides necessary headers and libraries although it is not directly used for building.
169Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
170
171> Note that building using Visual Studio's cl compiler is not supported. Clang must be used. Clang depends on libraries provided by Visual Studio to work. Therefore, Visual Studio must be installed. Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
172
173Powershell 7 is used for the following commands.
174If an older version of Powershell is used, these commands may not work as they are.
175
176### I. Setup Environment
177
1781. **Install OpenCL Headers and Library**
179
180```powershell
181mkdir -p ~/dev/llm
182
183cd ~/dev/llm
184git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
185mkdir build && cd build
186cmake .. -G Ninja `
187  -DBUILD_TESTING=OFF `
188  -DOPENCL_HEADERS_BUILD_TESTING=OFF `
189  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
190  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
191cmake --build . --target install
192
193cd ~/dev/llm
194git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
195mkdir build && cd build
196cmake .. -G Ninja `
197  -DCMAKE_BUILD_TYPE=Release `
198  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
199  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
200cmake --build . --target install
201```
202
203### II. Build llama.cpp
204
205```powershell
206
207mkdir -p ~/dev/llm
208cd ~/dev/llm
209
210git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
211mkdir build && cd build
212
213cmake .. -G Ninja `
214  -DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" `
215  -DCMAKE_BUILD_TYPE=Release `
216  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
217  -DBUILD_SHARED_LIBS=OFF `
218  -DGGML_OPENCL=ON
219ninja
220```
221
222## Linux
223
224The two steps just above also apply to Linux. When building for linux, the commands are mostly the same as those for PowerShell on Windows, but in the second step they do not have the `-DCMAKE_TOOLCHAIN_FILE` parameter, and then in both steps the backticks are replaced with back slashes.
225
226If not installed already, install Git, CMake, Clang, Ninja and Python, then run in the terminal the following:
227
228### I. Setup Environment
229
2301. **Install OpenCL Headers and Library**
231
232```bash
233mkdir -p ~/dev/llm
234
235cd ~/dev/llm
236git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
237mkdir build && cd build
238cmake .. -G Ninja \
239  -DBUILD_TESTING=OFF \
240  -DOPENCL_HEADERS_BUILD_TESTING=OFF \
241  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
242  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
243cmake --build . --target install
244
245cd ~/dev/llm
246git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
247mkdir build && cd build
248cmake .. -G Ninja \
249  -DCMAKE_BUILD_TYPE=Release \
250  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
251  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
252cmake --build . --target install
253```
254
255### II. Build llama.cpp
256
257```bash
258mkdir -p ~/dev/llm
259cd ~/dev/llm
260
261git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
262mkdir build && cd build
263
264cmake .. -G Ninja \
265  -DCMAKE_BUILD_TYPE=Release \
266  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
267  -DBUILD_SHARED_LIBS=OFF \
268  -DGGML_OPENCL=ON
269ninja
270```
271
272## Known Issues
273
274- Flash attention does not always improve performance.
275- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms).
276  However, it does not work on A6xx GPUs found in phones with old drivers and compilers.
277
278## TODO
279
280- Optimization for Q6_K
281- Support and optimization for Q4_K
282- Improve flash attention