llmnpc - llama.cpp/docs/build.md

Path: llmnpc / llama.cpp / docs / build.md (raw)
  1# Build llama.cpp locally
  2
  3The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](../include/llama.h).
  4
  5The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
  6
  7**To get the Code:**
  8
  9```bash
 10git clone https://github.com/ggml-org/llama.cpp
 11cd llama.cpp
 12```
 13
 14The following sections describe how to build with different backends and options.
 15
 16## CPU Build
 17
 18Build llama.cpp using `CMake`:
 19
 20```bash
 21cmake -B build
 22cmake --build build --config Release
 23```
 24
 25**Notes**:
 26
 27- For faster compilation, add the `-j` argument to run multiple jobs in parallel, or use a generator that does this automatically such as Ninja. For example, `cmake --build build --config Release -j 8` will run 8 jobs in parallel.
 28- For faster repeated compilation, install [ccache](https://ccache.dev/)
 29- For debug builds, there are two cases:
 30
 31    1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
 32
 33       ```bash
 34       cmake -B build -DCMAKE_BUILD_TYPE=Debug
 35       cmake --build build
 36       ```
 37
 38    2. Multi-config generators (`-G` param set to Visual Studio, XCode...):
 39
 40       ```bash
 41       cmake -B build -G "Xcode"
 42       cmake --build build --config Debug
 43       ```
 44
 45    For more details and a list of supported generators, see the [CMake documentation](https://cmake.org/cmake/help/latest/manual/cmake-generators.7.html).
 46- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
 47  ```
 48  cmake -B build -DBUILD_SHARED_LIBS=OFF
 49  cmake --build build --config Release
 50  ```
 51
 52- Building for Windows (x86, x64 and arm64) with MSVC or clang as compilers:
 53    - Install Visual Studio 2022, e.g. via the [Community Edition](https://visualstudio.microsoft.com/vs/community/). In the installer, select at least the following options (this also automatically installs the required additional tools like CMake,...):
 54    - Tab Workload: Desktop-development with C++
 55    - Tab Components (select quickly via search): C++-_CMake_ Tools for Windows, _Git_ for Windows, C++-_Clang_ Compiler for Windows, MS-Build Support for LLVM-Toolset (clang)
 56    - Please remember to always use a Developer Command Prompt / PowerShell for VS2022 for git, build, test
 57    - For Windows on ARM (arm64, WoA) build with:
 58    ```bash
 59    cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
 60    cmake --build build-arm64-windows-llvm-release
 61    ```
 62    For building with ninja generator and clang compiler as default:
 63      -set path:set LIB=C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\um\x64;C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\lib\x64\uwp;C:\Program Files (x86)\Windows Kits\10\Lib\10.0.22621.0\ucrt\x64
 64      ```bash
 65      cmake --preset x64-windows-llvm-release
 66      cmake --build build-x64-windows-llvm-release
 67      ```
 68- If you want HTTPS/TLS features, you may install OpenSSL development libraries. If not installed, the project will build and run without SSL support.
 69  - **Debian / Ubuntu:** `sudo apt-get install libssl-dev`
 70  - **Fedora / RHEL / Rocky / Alma:** `sudo dnf install openssl-devel`
 71  - **Arch / Manjaro:** `sudo pacman -S openssl`
 72
 73## BLAS Build
 74
 75Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Using BLAS doesn't affect the generation performance. There are currently several different BLAS implementations available for build and use:
 76
 77### Accelerate Framework
 78
 79This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.
 80
 81### OpenBLAS
 82
 83This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.
 84
 85- Using `CMake` on Linux:
 86
 87    ```bash
 88    cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 89    cmake --build build --config Release
 90    ```
 91
 92### BLIS
 93
 94Check [BLIS.md](./backend/BLIS.md) for more information.
 95
 96### Intel oneMKL
 97
 98Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni. Please note that this build config **does not support Intel GPU**. For Intel GPU support, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
 99
100- Using manual oneAPI installation:
101  By default, `GGML_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DGGML_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
102    ```bash
103    source /opt/intel/oneapi/setvars.sh # You can skip this step if  in oneapi-basekit docker image, only required for manual installation
104    cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON
105    cmake --build build --config Release
106    ```
107
108- Using oneAPI docker image:
109  If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: [oneAPI-basekit](https://hub.docker.com/r/intel/oneapi-basekit). Then, you can use the commands given above.
110
111Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
112
113### Other BLAS libraries
114
115Any other BLAS library can be used by setting the `GGML_BLAS_VENDOR` option. See the [CMake documentation](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors) for a list of supported vendors.
116
117## Metal Build
118
119On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
120To disable the Metal build at compile time use the `-DGGML_METAL=OFF` cmake option.
121
122When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers 0` command-line argument.
123
124## SYCL
125
126SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
127
128llama.cpp based on SYCL is used to **support Intel GPU** (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
129
130For detailed info, please refer to [llama.cpp for SYCL](./backend/SYCL.md).
131
132## CUDA
133
134This provides GPU acceleration using an NVIDIA GPU. Make sure to have the [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) installed.
135
136#### Download directly from NVIDIA
137You may find the official downloads here: [NVIDIA developer site](https://developer.nvidia.com/cuda-downloads).
138
139
140#### Compile and run inside a Fedora Toolbox Container
141We also have a [guide](./backend/CUDA-FEDORA.md) for setting up CUDA toolkit in a Fedora [toolbox container](https://containertoolbx.org/).
142
143**Recommended for:**
144- ***Necessary*** for users of [Atomic Desktops for Fedora](https://fedoraproject.org/atomic-desktops/); such as: [Silverblue](https://fedoraproject.org/atomic-desktops/silverblue/) and [Kinoite](https://fedoraproject.org/atomic-desktops/kinoite/).
145  - (there are no supported CUDA packages for these systems)
146- ***Necessary*** for users that have a host that is not a: [Supported Nvidia CUDA Release Platform](https://developer.nvidia.com/cuda-downloads).
147  - (for example, you may have [Fedora 42 Beta](https://fedoramagazine.org/announcing-fedora-linux-42-beta/) as your host operating system)
148- ***Convenient*** For those running [Fedora Workstation](https://fedoraproject.org/workstation/) or [Fedora KDE Plasma Desktop](https://fedoraproject.org/spins/kde), and want to keep their host system clean.
149- *Optionally* toolbox packages are available: [Arch Linux](https://archlinux.org/), [Red Hat Enterprise Linux >= 8.5](https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux), or [Ubuntu](https://ubuntu.com/download)
150
151
152### Compilation
153
154Make sure to read the notes about the CPU build for general instructions for e.g. speeding up the compilation.
155
156```bash
157cmake -B build -DGGML_CUDA=ON
158cmake --build build --config Release
159```
160
161### Non-Native Builds
162
163By default llama.cpp will be built for the hardware that is connected to the system at that time.
164For a build covering all CUDA GPUs, disable `GGML_NATIVE`:
165
166```bash
167cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF
168```
169
170The resulting binary should run on all CUDA GPUs with optimal performance, though some just-in-time compilation may be required.
171
172### Override Compute Capability Specifications
173
174If `nvcc` cannot detect your gpu, you may get compile warnings such as:
175 ```text
176nvcc warning : Cannot find valid GPU for '-arch=native', default arch is used
177```
178
179One option is to do a non-native build as described above.
180However, this will result in a large binary that takes a long time to compile.
181Alternatively it is also possible to explicitly specify CUDA architectures.
182This may also make sense for a non-native build, for that one should look at the logic in `ggml/src/ggml-cuda/CMakeLists.txt` as a starting point.
183
184To override the default CUDA architectures:
185
186#### 1. Take note of the `Compute Capability` of your NVIDIA devices: ["CUDA: Your GPU Compute > Capability"](https://developer.nvidia.com/cuda-gpus).
187
188```text
189GeForce RTX 4090      8.9
190GeForce RTX 3080 Ti   8.6
191GeForce RTX 3070      8.6
192```
193
194#### 2. Manually list each varying `Compute Capability` in the `CMAKE_CUDA_ARCHITECTURES` list.
195
196```bash
197cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="86;89"
198```
199
200### Overriding the CUDA Version
201
202If you have multiple CUDA installations on your system and want to compile llama.cpp for a specific one, e.g. for CUDA 11.7 installed under `/opt/cuda-11.7`:
203
204```bash
205cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/opt/cuda-11.7/bin/nvcc -DCMAKE_INSTALL_RPATH="/opt/cuda-11.7/lib64;\$ORIGIN" -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON
206```
207
208#### Fixing Compatibility Issues with Old CUDA and New glibc
209
210If you try to use an old CUDA version (e.g. v11.7) with a new glibc version you can get errors like this:
211
212```
213/usr/include/bits/mathcalls.h(83): error: exception specification is
214  incompatible with that of previous function "cospi"
215
216
217  /opt/cuda-11.7/bin/../targets/x86_64-linux/include/crt/math_functions.h(5545):
218  here
219```
220
221It seems the least bad solution is to patch the CUDA installation to declare the correct signatures.
222Replace the following lines in `/path/to/your/cuda/installation/targets/x86_64-linux/include/crt/math_functions.h`:
223
224```C++
225// original lines
226extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x);
227extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x);
228extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x);
229extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x);
230extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 rsqrt(double x);
231extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  rsqrtf(float x);
232
233// edited lines
234extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 cospi(double x) noexcept (true);
235extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  cospif(float x) noexcept (true);
236extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 sinpi(double x) noexcept (true);
237extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  sinpif(float x) noexcept (true);
238extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ double                 rsqrt(double x) noexcept (true);
239extern __DEVICE_FUNCTIONS_DECL__ __device_builtin__ float                  rsqrtf(float x) noexcept (true);
240```
241
242### Runtime CUDA environmental variables
243
244You may set the [cuda environmental variables](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) at runtime.
245
246```bash
247# Use `CUDA_VISIBLE_DEVICES` to hide the first compute device.
248CUDA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
249```
250
251#### CUDA_SCALE_LAUNCH_QUEUES
252
253The environment variable [`CUDA_SCALE_LAUNCH_QUEUES`](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/environment-variables.html#cuda-scale-launch-queues) controls the size of CUDA's command buffer, which determines how many GPU operations can be queued before the CPU must wait for the GPU to catch up. A larger buffer reduces CPU-side stalls and allows more work to be queued on a GPU.
254
255Consider setting `CUDA_SCALE_LAUNCH_QUEUES=4x`, which increases the CUDA command buffer to 4 times its default size. This optimization is particularly beneficial for **Multi-GPU setups with pipeline parallelism**, where it significantly improves prompt processing throughput by allowing more operations to be enqueued across GPUs.
256
257### Unified Memory
258
259The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
260
261### Performance Tuning
262
263The following compilation options are also available to tweak performance:
264
265| Option                        | Legal values           | Default | Description                                                                                                                                                                                                                                                                                                                                                                      |
266|-------------------------------|------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
267| GGML_CUDA_FORCE_MMQ           | Boolean                | false   | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
268| GGML_CUDA_FORCE_CUBLAS        | Boolean                | false   | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for CDNA and RDNA4) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000).                                            |
269| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer       | 128     | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.                                                                                                                                                                  |
270| GGML_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer.                                                                                                                                                                                           |
271
272## MUSA
273
274This provides GPU acceleration using a Moore Threads GPU. Make sure to have the [MUSA SDK](https://developer.mthreads.com/musa/musa-sdk) installed.
275
276#### Download directly from Moore Threads
277
278You may find the official downloads here: [Moore Threads developer site](https://developer.mthreads.com/sdk/download/musa).
279
280### Compilation
281
282```bash
283cmake -B build -DGGML_MUSA=ON
284cmake --build build --config Release
285```
286
287#### Override Compute Capability Specifications
288
289By default, all supported compute capabilities are enabled. To customize this behavior, you can specify the `MUSA_ARCHITECTURES` option in the CMake command:
290
291```bash
292cmake -B build -DGGML_MUSA=ON -DMUSA_ARCHITECTURES="21"
293cmake --build build --config Release
294```
295
296This configuration enables only compute capability `2.1` (MTT S80) during compilation, which can help reduce compilation time.
297
298#### Compilation options
299
300Most of the compilation options available for CUDA should also be available for MUSA, though they haven't been thoroughly tested yet.
301
302- For static builds, add `-DBUILD_SHARED_LIBS=OFF` and `-DCMAKE_POSITION_INDEPENDENT_CODE=ON`:
303  ```
304  cmake -B build -DGGML_MUSA=ON \
305    -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
306  cmake --build build --config Release
307  ```
308
309### Runtime MUSA environmental variables
310
311You may set the [musa environmental variables](https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Z%E9%99%84%E5%BD%95/) at runtime.
312
313```bash
314# Use `MUSA_VISIBLE_DEVICES` to hide the first compute device.
315MUSA_VISIBLE_DEVICES="-0" ./build/bin/llama-server --model /srv/models/llama.gguf
316```
317
318### Unified Memory
319
320The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.
321
322## HIP
323
324This provides GPU acceleration on HIP-supported AMD GPUs.
325Make sure to have ROCm installed.
326You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html#rocm-install-quick).
327
328- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
329  ```bash
330  HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
331      cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
332      && cmake --build build --config Release -- -j 16
333  ```
334
335  Note: `GPU_TARGETS` is optional, omitting it will build the code for all GPUs in the current system.
336
337  To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the `-DGGML_HIP_ROCWMMA_FATTN=ON` option. This requires rocWMMA headers to be installed on the build system.
338
339  The rocWMMA library is included by default when installing the ROCm SDK using the `rocm` meta package provided by AMD. Alternatively, if you are not using the meta package, you can install the library using the `rocwmma-dev` or `rocwmma-devel` package, depending on your system's package manager.
340
341  As an alternative, you can manually install the library by cloning it from the official [GitHub repository](https://github.com/ROCm/rocWMMA), checkout the corresponding version tag (e.g. `rocm-6.2.4`) and set `-DCMAKE_CXX_FLAGS="-I<path/to/rocwmma>/library/include/"` in CMake. This also works under Windows despite not officially supported by AMD.
342
343  Note that if you get the following error:
344  ```
345  clang: error: cannot find ROCm device library; provide its path via '--rocm-path' or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
346  ```
347  Try searching for a directory under `HIP_PATH` that contains the file
348  `oclc_abi_version_400.bc`. Then, add the following to the start of the
349  command: `HIP_DEVICE_LIB_PATH=<directory-you-just-found>`, so something
350  like:
351  ```bash
352  HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -p)" \
353  HIP_DEVICE_LIB_PATH=<directory-you-just-found> \
354      cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
355      && cmake --build build -- -j 16
356  ```
357
358- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
359  ```bash
360  set PATH=%HIP_PATH%\bin;%PATH%
361  cmake -S . -B build -G Ninja -DGPU_TARGETS=gfx1100 -DGGML_HIP=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release
362  cmake --build build
363  ```
364  If necessary, adapt `GPU_TARGETS` to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
365  Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
366
367
368The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
369If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
370
371### Unified Memory
372
373On Linux it is possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`. However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
374
375## Vulkan
376
377### For Windows Users:
378**w64devkit**
379
380Download and extract [`w64devkit`](https://github.com/skeeto/w64devkit/releases).
381
382Download and install the [`Vulkan SDK`](https://vulkan.lunarg.com/sdk/home#windows) with the default settings.
383
384Launch `w64devkit.exe` and run the following commands to copy Vulkan dependencies:
385```sh
386SDK_VERSION=1.3.283.0
387cp /VulkanSDK/$SDK_VERSION/Bin/glslc.exe $W64DEVKIT_HOME/bin/
388cp /VulkanSDK/$SDK_VERSION/Lib/vulkan-1.lib $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/
389cp -r /VulkanSDK/$SDK_VERSION/Include/* $W64DEVKIT_HOME/x86_64-w64-mingw32/include/
390cat > $W64DEVKIT_HOME/x86_64-w64-mingw32/lib/pkgconfig/vulkan.pc <<EOF
391Name: Vulkan-Loader
392Description: Vulkan Loader
393Version: $SDK_VERSION
394Libs: -lvulkan-1
395EOF
396
397```
398
399Switch into the `llama.cpp` directory and build using CMake.
400```sh
401cmake -B build -DGGML_VULKAN=ON
402cmake --build build --config Release
403```
404
405**Git Bash MINGW64**
406
407Download and install [`Git-SCM`](https://git-scm.com/downloads/win) with the default settings
408
409Download and install [`Visual Studio Community Edition`](https://visualstudio.microsoft.com/) and make sure you select `C++`
410
411Download and install [`CMake`](https://cmake.org/download/) with the default settings
412
413Download and install the [`Vulkan SDK`](https://vulkan.lunarg.com/sdk/home#windows) with the default settings.
414
415Go into your `llama.cpp` directory and right click, select `Open Git Bash Here` and then run the following commands
416
417```
418cmake -B build -DGGML_VULKAN=ON
419cmake --build build --config Release
420```
421
422Now you can load the model in conversation mode using `Vulkan`
423
424```sh
425build/bin/Release/llama-cli -m "[PATH TO MODEL]" -ngl 100 -c 16384 -t 10 -n -2 -cnv
426```
427
428**MSYS2**
429
430Install [MSYS2](https://www.msys2.org/) and then run the following commands in a UCRT terminal to install dependencies.
431```sh
432pacman -S git \
433    mingw-w64-ucrt-x86_64-gcc \
434    mingw-w64-ucrt-x86_64-cmake \
435    mingw-w64-ucrt-x86_64-vulkan-devel \
436    mingw-w64-ucrt-x86_64-shaderc
437```
438
439Switch into the `llama.cpp` directory and build using CMake.
440```sh
441cmake -B build -DGGML_VULKAN=ON
442cmake --build build --config Release
443```
444
445### For Docker users:
446
447You don't need to install the Vulkan SDK. It will be installed inside the container.
448
449```sh
450# Build the image
451docker build -t llama-cpp-vulkan --target light -f .devops/vulkan.Dockerfile .
452
453# Then, use it:
454docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-vulkan -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
455```
456
457### For Linux users:
458
459#### Using the LunarG Vulkan SDK
460
461First, follow the official LunarG instructions for the installation and setup of the Vulkan SDK in the [Getting Started with the Linux Tarball Vulkan SDK](https://vulkan.lunarg.com/doc/sdk/latest/linux/getting_started.html) guide.
462
463> [!IMPORTANT]
464> After completing the first step, ensure that you have used the `source` command on the `setup_env.sh` file inside of the Vulkan SDK in your current terminal session. Otherwise, the build won't work. Additionally, if you close out of your terminal, you must perform this step again if you intend to perform a build. However, there are ways to make this persistent. Refer to the Vulkan SDK guide linked in the first step for more information about any of this.
465
466#### Using system packages
467
468On Debian / Ubuntu, you can install the required dependencies using:
469```sh
470sudo apt-get install libvulkan-dev glslc
471```
472
473#### Common steps
474
475Second, after verifying that you have followed all of the SDK installation/setup steps, use this command to make sure before proceeding:
476```bash
477vulkaninfo
478```
479
480Then, assuming you have `cd` into your llama.cpp folder and there are no errors with running `vulkaninfo`, you can proceed to build llama.cpp using the CMake commands below:
481```bash
482cmake -B build -DGGML_VULKAN=1
483cmake --build build --config Release
484```
485
486Finally, after finishing your build, you should be able to do something like this:
487```bash
488# Test the output binary
489# "-ngl 99" should offload all of the layers to GPU for most (if not all) models.
490./build/bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -ngl 99
491
492# You should see in the output, ggml_vulkan detected your GPU. For example:
493# ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
494```
495
496### For Mac users:
497
498Generally, follow LunarG's [Getting Started with the MacOS Vulkan SDK](https://vulkan.lunarg.com/doc/sdk/latest/mac/getting_started.html) guide for installation and setup of the Vulkan SDK. There are two options of Vulkan drivers on macOS, both of which implement translation layers to map Vulkan to Metal. They can be hot-swapped by setting the `VK_ICD_FILENAMES` environment variable to point to the respective ICD JSON file.
499
500Check the box for "KosmicKrisp" during the LunarG Vulkan SDK installation.
501
502Set environment variable for the LunarG Vulkan SDK after installation (and optionally add to your shell profile for persistence):
503```bash
504source /path/to/vulkan-sdk/setup-env.sh
505```
506
507#### Using MoltenVK
508
509MoltenVK is the default Vulkan driver installed with the LunarG Vulkan SDK on macOS, so you can use the above environment variable settings as is.
510
511#### Using KosmicKrisp
512
513Override the environment variable for KosmicKrisp:
514```bash
515export VK_ICD_FILENAMES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json
516export VK_DRIVER_FILES=$VULKAN_SDK/share/vulkan/icd.d/libkosmickrisp_icd.json
517```
518
519#### Build
520
521This is the only step different from [above](#common-steps) instructions.
522```bash
523cmake -B build -DGGML_VULKAN=1 -DGGML_METAL=OFF
524cmake --build build --config Release
525```
526
527## CANN
528This provides NPU acceleration using the AI cores of your Ascend NPU. And [CANN](https://www.hiascend.com/en/software/cann) is a hierarchical APIs to help you to quickly build AI applications and service based on Ascend NPU.
529
530For more information about Ascend NPU in [Ascend Community](https://www.hiascend.com/en/).
531
532Make sure to have the CANN toolkit installed. You can download it from here: [CANN Toolkit](https://www.hiascend.com/developer/download/community/result?module=cann)
533
534Go to `llama.cpp` directory and build using CMake.
535```bash
536cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
537cmake --build build --config release
538```
539
540You can test with:
541
542```bash
543./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 32
544```
545
546If the following info is output on screen, you are using `llama.cpp` with the CANN backend:
547```bash
548llm_load_tensors:       CANN model buffer size = 13313.00 MiB
549llama_new_context_with_model:       CANN compute buffer size =  1260.81 MiB
550```
551
552For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
553
554## ZenDNN
555
556ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.
557
558### Compilation
559
560- Using `CMake` on Linux (automatic build):
561
562    ```bash
563    cmake -B build -DGGML_ZENDNN=ON
564    cmake --build build --config Release
565    ```
566
567    The first build will automatically download and build ZenDNN, which may take 5-10 minutes. Subsequent builds will be much faster.
568
569- Using `CMake` with custom ZenDNN installation:
570
571    ```bash
572    cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
573    cmake --build build --config Release
574    ```
575
576### Testing
577
578You can test with:
579
580```bash
581./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -n 50
582```
583
584For detailed information about hardware support, setup instructions, and performance optimization, refer to [llama.cpp for ZenDNN](./backend/ZenDNN.md).
585
586## Arm® KleidiAI™
587KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
588
589To enable KleidiAI, go to the llama.cpp directory and build using CMake
590```bash
591cmake -B build -DGGML_CPU_KLEIDIAI=ON
592cmake --build build --config Release
593```
594You can verify that KleidiAI is being used by running
595```bash
596./build/bin/llama-cli -m PATH_TO_MODEL -p "What is a car?"
597```
598If KleidiAI is enabled, the ouput will contain a line similar to:
599```
600load_tensors: CPU_KLEIDIAI model buffer size =  3474.00 MiB
601```
602KleidiAI's microkernels implement optimized tensor operations using Arm CPU features such as dotprod, int8mm and SME. llama.cpp selects the most efficient kernel based on runtime CPU feature detection. However, on platforms that support SME, you must manually enable SME microkernels by setting the environment variable `GGML_KLEIDIAI_SME=1`.
603
604Depending on your build target, other higher priority backends may be enabled by default. To ensure the CPU backend is used, you must disable the higher priority backends either at compile time, e.g. -DGGML_METAL=OFF, or during run-time using the command line option `--device none`.
605
606## OpenCL
607
608This provides GPU acceleration through OpenCL on recent Adreno GPU.
609More information about OpenCL backend can be found in [OPENCL.md](./backend/OPENCL.md) for more information.
610
611### Android
612
613Assume NDK is available in `$ANDROID_NDK`. First, install OpenCL headers and ICD loader library if not available,
614
615```sh
616mkdir -p ~/dev/llm
617cd ~/dev/llm
618
619git clone https://github.com/KhronosGroup/OpenCL-Headers && \
620cd OpenCL-Headers && \
621cp -r CL $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
622
623cd ~/dev/llm
624
625git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
626cd OpenCL-ICD-Loader && \
627mkdir build_ndk && cd build_ndk && \
628cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
629  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
630  -DOPENCL_ICD_LOADER_HEADERS_DIR=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
631  -DANDROID_ABI=arm64-v8a \
632  -DANDROID_PLATFORM=24 \
633  -DANDROID_STL=c++_shared && \
634ninja && \
635cp libOpenCL.so $ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
636```
637
638Then build llama.cpp with OpenCL enabled,
639
640```sh
641cd ~/dev/llm
642
643git clone https://github.com/ggml-org/llama.cpp && \
644cd llama.cpp && \
645mkdir build-android && cd build-android
646
647cmake .. -G Ninja \
648  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
649  -DANDROID_ABI=arm64-v8a \
650  -DANDROID_PLATFORM=android-28 \
651  -DBUILD_SHARED_LIBS=OFF \
652  -DGGML_OPENCL=ON
653
654ninja
655```
656
657### Windows Arm64
658
659First, install OpenCL headers and ICD loader library if not available,
660
661```powershell
662mkdir -p ~/dev/llm
663
664cd ~/dev/llm
665git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
666mkdir build && cd build
667cmake .. -G Ninja `
668  -DBUILD_TESTING=OFF `
669  -DOPENCL_HEADERS_BUILD_TESTING=OFF `
670  -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
671  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
672cmake --build . --target install
673
674cd ~/dev/llm
675git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
676mkdir build && cd build
677cmake .. -G Ninja `
678  -DCMAKE_BUILD_TYPE=Release `
679  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
680  -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
681cmake --build . --target install
682```
683
684Then build llama.cpp with OpenCL enabled,
685
686```powershell
687cmake .. -G Ninja `
688  -DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" `
689  -DCMAKE_BUILD_TYPE=Release `
690  -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
691  -DBUILD_SHARED_LIBS=OFF `
692  -DGGML_OPENCL=ON
693ninja
694```
695
696## Android
697
698To read documentation for how to build on Android, [click here](./android.md)
699
700## WebGPU [In Progress]
701
702The WebGPU backend relies on [Dawn](https://dawn.googlesource.com/dawn). Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/docs/quickstart-cmake.md) to install Dawn locally so that llama.cpp can find it using CMake. The currrent implementation is up-to-date with Dawn commit `bed1a61`.
703
704In the llama.cpp directory, build with CMake:
705
706```
707cmake -B build -DGGML_WEBGPU=ON
708cmake --build build --config Release
709```
710
711### Browser Support
712
713WebGPU allows cross-platform access to the GPU from supported browsers. We utilize [Emscripten](https://emscripten.org/) to compile ggml's WebGPU backend to WebAssembly. Emscripten does not officially support WebGPU bindings yet, but Dawn currently maintains its own WebGPU bindings called emdawnwebgpu.
714
715Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/src/emdawnwebgpu/) to download or build the emdawnwebgpu package (Note that it might be safer to build the emdawbwebgpu package locally, so that it stays in sync with the version of Dawn you have installed above). When building using CMake, the path to the emdawnwebgpu port file needs to be set with the flag `EMDAWNWEBGPU_DIR`.
716
717## IBM Z & LinuxONE
718
719To read documentation for how to build on IBM Z & LinuxONE, [click here](./build-s390x.md)
720
721## Notes about GPU-accelerated backends
722
723The GPU may still be used to accelerate some parts of the computation even when using the `-ngl 0` option. You can fully disable GPU acceleration by using `--device none`.
724
725In most cases, it is possible to build and use multiple backends at the same time. For example, you can build llama.cpp with both CUDA and Vulkan support by using the `-DGGML_CUDA=ON -DGGML_VULKAN=ON` options with CMake. At runtime, you can specify which backend devices to use with the `--device` option. To see a list of available devices, use the `--list-devices` option.
726
727Backends can be built as dynamic libraries that can be loaded dynamically at runtime. This allows you to use the same llama.cpp binary on different machines with different GPUs. To enable this feature, use the `GGML_BACKEND_DL` option when building.