diff options
Diffstat (limited to 'llama.cpp/docs/backend')
| -rw-r--r-- | llama.cpp/docs/backend/BLIS.md | 60 | ||||
| -rwxr-xr-x | llama.cpp/docs/backend/CANN.md | 333 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/CUDA-FEDORA.md | 283 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/OPENCL.md | 282 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/SYCL.md | 753 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/VirtGPU.md | 180 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/VirtGPU/configuration.md | 174 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/VirtGPU/development.md | 220 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/ZenDNN.md | 258 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/snapdragon/CMakeUserPresets.json | 61 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/snapdragon/README.md | 269 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/snapdragon/developer.md | 109 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/snapdragon/windows.md | 161 | ||||
| -rw-r--r-- | llama.cpp/docs/backend/zDNN.md | 66 |
14 files changed, 3209 insertions, 0 deletions
diff --git a/llama.cpp/docs/backend/BLIS.md b/llama.cpp/docs/backend/BLIS.md new file mode 100644 index 0000000..9045485 --- /dev/null +++ b/llama.cpp/docs/backend/BLIS.md @@ -0,0 +1,60 @@ +BLIS Installation Manual +------------------------ + +BLIS is a portable software framework for high-performance BLAS-like dense linear algebra libraries. It has received awards and recognition, including the 2023 James H. Wilkinson Prize for Numerical Software and the 2020 SIAM Activity Group on Supercomputing Best Paper Prize. BLIS provides a new BLAS-like API and a compatibility layer for traditional BLAS routine calls. It offers features such as object-based API, typed API, BLAS and CBLAS compatibility layers. + +Project URL: https://github.com/flame/blis + +### Prepare: + +Compile BLIS: + +```bash +git clone https://github.com/flame/blis +cd blis +./configure --enable-cblas -t openmp,pthreads auto +# will install to /usr/local/ by default. +make -j +``` + +Install BLIS: + +```bash +sudo make install +``` + +We recommend using openmp since it's easier to modify the cores being used. + +### llama.cpp compilation + +CMake: + +```bash +mkdir build +cd build +cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=FLAME .. +make -j +``` + +### llama.cpp execution + +According to the BLIS documentation, we could set the following +environment variables to modify the behavior of openmp: + +```bash +export GOMP_CPU_AFFINITY="0-19" +export BLIS_NUM_THREADS=14 +``` + +And then run the binaries as normal. + + +### Intel specific issue + +Some might get the error message saying that `libimf.so` cannot be found. +Please follow this [stackoverflow page](https://stackoverflow.com/questions/70687930/intel-oneapi-2022-libimf-so-no-such-file-or-directory-during-openmpi-compila). + +### Reference: + +1. https://github.com/flame/blis#getting-started +2. https://github.com/flame/blis/blob/master/docs/Multithreading.md diff --git a/llama.cpp/docs/backend/CANN.md b/llama.cpp/docs/backend/CANN.md new file mode 100755 index 0000000..b03c2a1 --- /dev/null +++ b/llama.cpp/docs/backend/CANN.md @@ -0,0 +1,333 @@ +# llama.cpp for CANN + + - [Background](#background) + - [News](#news) + - [OS](#os) + - [Hardware](#hardware) + - [Model Supports](#model-supports) + - [DataType Supports](#datatype-supports) + - [Docker](#docker) + - [Linux](#linux) + - [Environment variable setup](#environment-variable-setup) + - [TODO](#todo) + + +## Background + +**Ascend NPU** is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars. + +**CANN** (Compute Architecture for Neural Networks) is a heterogeneous computing architecture for AI scenarios, providing support for multiple AI frameworks on the top and serving AI processors and programming at the bottom. It plays a crucial role in bridging the gap between upper and lower layers, and is a key platform for improving the computing efficiency of Ascend AI processors. Meanwhile, it offers a highly efficient and easy-to-use programming interface for diverse application scenarios, allowing users to rapidly build AI applications and services based on the Ascend platform. + +**Llama.cpp + CANN** + +The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the ability of AscendC and ACLNN which are intergrated to CANN Toolkit and kernels to using Ascend NPU directly. + +## News + +- 2024.11 + - Support F16 and F32 data type model for Ascend 310P NPU. +- 2024.8 + - Support `Q4_0` and `Q8_0` data type for Ascend NPU. +- 2024.7 + - Create CANN backend for Ascend NPU. + +## OS + +| OS | Status | Verified | +|:-------:|:-------:|:----------------------------------------------:| +| Linux | Support | Ubuntu 22.04, OpenEuler22.03 | + + +## Hardware + +### Ascend NPU + +**Verified devices** + +| Ascend NPU | Status | +|:-----------------------------:|:-------:| +| Atlas 300T A2 | Support | +| Atlas 300I Duo | Support | + +*Notes:* + +- If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag. +- If you run successfully with your Ascend NPU device, please help update the upper table. + + +## Model Supports + +| Model Name | FP16 | Q4_0 | Q8_0 | +|:----------------------------|:-----:|:----:|:----:| +| Llama-2 | √ | √ | √ | +| Llama-3 | √ | √ | √ | +| Mistral-7B | √ | √ | √ | +| Mistral MOE | √ | √ | √ | +| DBRX | - | - | - | +| Falcon | √ | √ | √ | +| Chinese LLaMA/Alpaca | √ | √ | √ | +| Vigogne(French) | √ | √ | √ | +| BERT | x | x | x | +| Koala | √ | √ | √ | +| Baichuan | √ | √ | √ | +| Aquila 1 & 2 | √ | √ | √ | +| Starcoder models | √ | √ | √ | +| Refact | √ | √ | √ | +| MPT | √ | √ | √ | +| Bloom | √ | √ | √ | +| Yi models | √ | √ | √ | +| stablelm models | √ | √ | √ | +| DeepSeek models | x | x | x | +| Qwen models | √ | √ | √ | +| PLaMo-13B | √ | √ | √ | +| Phi models | √ | √ | √ | +| PhiMoE | √ | √ | √ | +| GPT-2 | √ | √ | √ | +| Orion | √ | √ | √ | +| InternlLM2 | √ | √ | √ | +| CodeShell | √ | √ | √ | +| Gemma | √ | √ | √ | +| Mamba | √ | √ | √ | +| Xverse | √ | √ | √ | +| command-r models | √ | √ | √ | +| Grok-1 | - | - | - | +| SEA-LION | √ | √ | √ | +| GritLM-7B | √ | √ | √ | +| OLMo | √ | √ | √ | +| OLMo 2 | √ | √ | √ | +| OLMoE | √ | √ | √ | +| Granite models | √ | √ | √ | +| GPT-NeoX | √ | √ | √ | +| Pythia | √ | √ | √ | +| Snowflake-Arctic MoE | - | - | - | +| Smaug | √ | √ | √ | +| Poro 34B | √ | √ | √ | +| Bitnet b1.58 models | √ | x | x | +| Flan-T5 | √ | √ | √ | +| Open Elm models | x | √ | √ | +| chatGLM3-6B + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b | √ | √ | √ | +| GLM-4-0414 | √ | √ | √ | +| SmolLM | √ | √ | √ | +| EXAONE-3.0-7.8B-Instruct | √ | √ | √ | +| FalconMamba Models | √ | √ | √ | +| Jais Models | - | x | x | +| Bielik-11B-v2.3 | √ | √ | √ | +| RWKV-6 | - | √ | √ | +| QRWKV-6 | √ | √ | √ | +| GigaChat-20B-A3B | x | x | x | +| Trillion-7B-preview | √ | √ | √ | +| Ling models | √ | √ | √ | + + +**Multimodal** +| Model Name | FP16 | Q4_0 | Q8_0 | +|:----------------------------|:-----:|:----:|:----:| +| LLaVA 1.5 models, LLaVA 1.6 models | x | x | x | +| BakLLaVA | √ | √ | √ | +| Obsidian | √ | - | - | +| ShareGPT4V | x | - | - | +| MobileVLM 1.7B/3B models | - | - | - | +| Yi-VL | - | - | - | +| Mini CPM | √ | √ | √ | +| Moondream | √ | √ | √ | +| Bunny | √ | - | - | +| GLM-EDGE | √ | √ | √ | +| Qwen2-VL | √ | √ | √ | + + + +## DataType Supports + +| DataType | Status | +|:----------------------:|:-------:| +| FP16 | Support | +| Q8_0 | Support | +| Q4_0 | Support | + +## Docker + +### Build Images +You can get a image with llama.cpp in one command. +```sh +docker build -t llama-cpp-cann -f .devops/llama-cli-cann.Dockerfile . +``` + +### Run container + +```sh +# Find all cards. +npu-smi info + +# Select the cards that you want to use, make sure these cards are not used by someone. +# Following using cards of device0. +docker run --name llamacpp --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -v /PATH_TO_YOUR_MODELS/:/app/models -it llama-cpp-cann -m /app/models/MODEL_PATH -ngl 32 -p "Building a website can be done in 10 simple steps:" +``` + +*Notes:* + +- You may need to install Ascend Driver and firmware on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*. + +## Linux + +### I. Setup Environment + +1. **Install Ascend Driver and firmware** + + ```sh + # create driver running user. + sudo groupadd -g HwHiAiUser + sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash + sudo usermod -aG HwHiAiUser $USER + + # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system + # and install driver. + sudo sh Ascend-hdk-910b-npu-driver_x.x.x_linux-{arch}.run --full --install-for-all + ``` + + Once installed, run `npu-smi info` to check whether driver is installed successfully. + ```sh + +-------------------------------------------------------------------------------------------+ + | npu-smi 24.1.rc2 Version: 24.1.rc2 | + +----------------------+---------------+----------------------------------------------------+ + | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| + | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | + +======================+===============+====================================================+ + | 2 xxx | OK | 64.4 51 15 / 15 | + | 0 | 0000:01:00.0 | 0 1873 / 15077 0 / 32768 | + +======================+===============+====================================================+ + | 5 xxx | OK | 64.0 52 15 / 15 | + | 0 | 0000:81:00.0 | 0 1874 / 15077 0 / 32768 | + +======================+===============+====================================================+ + | No running processes found in NPU 2 | + +======================+===============+====================================================+ + | No running processes found in NPU 5 | + +======================+===============+====================================================+ + ``` + +2. **Install Ascend Firmware** + ```sh + # download driver from https://www.hiascend.com/hardware/firmware-drivers/community according to your system + # and install driver. + sudo sh Ascend-hdk-910b-npu-firmware_x.x.x.x.X.run --full + ``` + If the following messaage appers, firmware is installed successfully. + ```sh + Firmware package installed successfully! + ``` + + +3. **Install CANN toolkit and kernels** + + CANN toolkit and kernels can be obtained from the official [CANN Toolkit](https://www.hiascend.com/zh/developer/download/community/result?module=cann) page. + + Please download the corresponding version that satified your system. The minimum version required is 8.0.RC2.alpha002 and here is the install command. + ```sh + pip3 install attrs numpy decorator sympy cffi pyyaml pathlib2 psutil protobuf scipy requests absl-py wheel typing_extensions + sh Ascend-cann-toolkit_8.0.RC2.alpha002_linux-aarch64.run --install + sh Ascend-cann-kernels-910b_8.0.RC2.alpha002_linux.run --install + ``` + + Set Ascend Variables: + ```sh + echo "source ~/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc + source ~/.bashrc + ``` + +Upon a successful installation, CANN is enabled for the available ascend devices. + +### II. Build llama.cpp + +```sh +cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release +cmake --build build --config release +``` + +### III. Run the inference + +1. **Retrieve and prepare model** + + You can refer to the general [*Prepare and Quantize*](../../README.md#prepare-and-quantize) guide for model prepration. + + **Notes**: + + - CANN backend only supports FP16/Q4_0/Q8_0 models currently. + +2. **Launch inference** + + There are two device selection modes: + + - Single device: Use one device target specified by the user. + - Multiple devices: Automatically choose the devices with the same backend. + + | Device selection | Parameter | + |:----------------:|:--------------------------------------:| + | Single device | --split-mode none --main-gpu DEVICE_ID | + | Multiple devices | --split-mode layer (default) | + + Examples: + + - Use device 0: + + ```sh + ./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0 + ``` + + - Use multiple devices: + + ```sh + ./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer + ``` + +### **GitHub contribution**: +Please add the **[CANN]** prefix/tag in issues/PRs titles to help the CANN-team check/address them without delay. + +## Updates +### Basic Flash Attention Support +The basic FA kernel with aclnnops has been added in aclnn_ops.cpp. +Currently, the FA only supports the cases with FP16 KV tensors and NO logit softcap. +Since the aclnn interface for flash attention cannot support the logit softcap, we will only update the quantized version in the future. + +Authors from Peking University: Bizhao Shi (bshi@pku.edu.cn), Yuxin Yang (yxyang@pku.edu.cn), Ruiyang Ma (ruiyang@stu.pku.edu.cn), and Guojie Luo (gluo@pku.edu.cn). + +We would like to thank Tuo Dai, Shanni Li, and all of the project maintainers from Huawei Technologies Co., Ltd for their help during the code development and pull request. + +## Environment variable setup + +### GGML_CANN_MEM_POOL + +Specifies the memory pool management strategy, Default is vmm. + +- vmm: Utilizes a virtual memory manager pool. If hardware support for VMM is unavailable, falls back to the legacy (leg) memory pool. + +- prio: Employs a priority queue-based memory pool management. + +- leg: Uses a fixed-size buffer pool. + +### GGML_CANN_DISABLE_BUF_POOL_CLEAN + +Controls automatic cleanup of the memory pool. This option is only effective when using the prio or leg memory pool strategies. + +### GGML_CANN_WEIGHT_NZ + +Converting the matmul weight format from ND to NZ to improve performance. Enabled by default. + +### GGML_CANN_ACL_GRAPH + +Operators are executed using ACL graph execution, rather than in op-by-op (eager) mode. Enabled by default. This option is only effective if `USE_ACL_GRAPH` was enabled at compilation time. To enable it, recompile using: + +```sh +cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release -DUSE_ACL_GRAPH=ON +cmake --build build --config release +``` + +### GGML_CANN_GRAPH_CACHE_CAPACITY + +Maximum number of compiled CANN graphs kept in the LRU cache, default is 12. When the number of cached graphs exceeds this capacity, the least recently used graph will be evicted. + +### GGML_CANN_PREFILL_USE_GRAPH + +Enable ACL graph execution during the prefill stage, default is false. This option is only effective when FA is enabled. + +### GGML_CANN_OPERATOR_FUSION + +Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance. diff --git a/llama.cpp/docs/backend/CUDA-FEDORA.md b/llama.cpp/docs/backend/CUDA-FEDORA.md new file mode 100644 index 0000000..1508faf --- /dev/null +++ b/llama.cpp/docs/backend/CUDA-FEDORA.md @@ -0,0 +1,283 @@ +# Setting Up CUDA on Fedora + +In this guide we setup [Nvidia CUDA](https://docs.nvidia.com/cuda/) in a toolbox container. This guide is applicable for: + +- [Fedora Workstation](https://fedoraproject.org/workstation/) +- [Atomic Desktops for Fedora](https://fedoraproject.org/atomic-desktops/) +- [Fedora Spins](https://fedoraproject.org/spins) +- [Other Distributions](https://containertoolbx.org/distros/), including `Red Hat Enterprise Linux >= 8.5`, `Arch Linux`, and `Ubuntu`. + +## Table of Contents + +- [Prerequisites](#prerequisites) +- [Using the Fedora 41 CUDA Repository](#using-the-fedora-41-cuda-repository) +- [Creating a Fedora Toolbox Environment](#creating-a-fedora-toolbox-environment) +- [Installing Essential Development Tools](#installing-essential-development-tools) +- [Adding the CUDA Repository](#adding-the-cuda-repository) +- [Installing Nvidia Driver Libraries](#installing-nvidia-driver-libraries) +- [Installing the CUDA Meta-Package](#installing-the-cuda-meta-package) +- [Configuring the Environment](#configuring-the-environment) +- [Verifying the Installation](#verifying-the-installation) +- [Conclusion](#conclusion) +- [Troubleshooting](#troubleshooting) +- [Additional Notes](#additional-notes) +- [References](#references) + +## Prerequisites + +- **Toolbox Installed on the Host System** `Fedora Silverblue` and `Fedora Workstation` both have toolbox by default, other distributions may need to install the [toolbox package](https://containertoolbx.org/install/). +- **NVIDIA Drivers and Graphics Card installed on Host System (recommended)** To run CUDA program, such as `llama.cpp`, the host should be setup to access your NVIDIA hardware. Fedora Hosts can use the [RPM Fusion Repository](https://rpmfusion.org/Howto/NVIDIA). +- **Internet connectivity** to download packages. + +### Using the Fedora 41 CUDA Repository + +The latest release is 41. + +- [Fedora 41 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/) + +**Note:** We recommend using a toolbox environment to prevent system conflicts. + +## Creating a Fedora Toolbox Environment + +This guide focuses on Fedora hosts, but with small adjustments, it can work for other hosts. Using the Fedora Toolbox allows us to install the necessary packages without affecting the host system. + +**Note:** Toolbox is available for other systems, and even without Toolbox, it is possible to use Podman or Docker. + +1. **Create a Fedora 41 Toolbox:** + + ```bash + toolbox create --image registry.fedoraproject.org/fedora-toolbox:41 --container fedora-toolbox-41-cuda + ``` + +2. **Enter the Toolbox:** + + ```bash + toolbox enter --container fedora-toolbox-41-cuda + ``` + + Inside the toolbox, you have root privileges and can install packages without affecting the host system. + +## Installing Essential Development Tools + +1. **Synchronize the DNF Package Manager:** + + ```bash + sudo dnf distro-sync + ``` + +2. **Install **Vim** the default text editor (Optional):** + + ```bash + sudo dnf install vim-default-editor --allowerasing + ``` + + The `--allowerasing` flag will allow the removal of the conflicting `nano-default-editor` package. + +3. **Install Development Tools and Libraries:** + + ```bash + sudo dnf install @c-development @development-tools cmake + ``` + + This installs essential packages for compiling software, including `gcc`, `make`, and other development headers. + +## Adding the CUDA Repository + +Add the NVIDIA CUDA repository to your DNF configuration: + +```bash +sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo +``` + +After adding the repository, synchronize the package manager again: + +```bash +sudo dnf distro-sync +``` + +## Installing Nvidia Driver Libraries + +First, we need to detect if the host is supplying the [NVIDIA driver libraries into the toolbox](https://github.com/containers/toolbox/blob/main/src/pkg/nvidia/nvidia.go): + +```bash +ls -la /usr/lib64/libcuda.so.1 +``` + +### If *`libcuda.so.1`* is missing: + +``` +ls: cannot access '/usr/lib64/libcuda.so.1': No such file or directory +``` + +**Explanation:** +The host dose not supply the CUDA drivers, **install them now:** + +#### Install the Nvidia Driver Libraries on Guest: + +```bash +sudo dnf install nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced +``` + +### If *`libcuda.so.1`* exists: +``` +lrwxrwxrwx. 1 root root 21 Mar 24 11:26 /usr/lib64/libcuda.so.1 -> libcuda.so.570.133.07 +``` + +**Explanation:** +The host is supply the CUDA drivers, **we need to update the guest RPM Database accordingly:** + +#### Update the Toolbox RPM Database to include the Host-Supplied Libraries: + +Note: we do not actually install the libraries, we just update the DB so that the guest system knows they are supplied by the host. + +##### 1. Download `nvidia-` parts that are supplied by the host RPM's (with dependencies) + +```bash +sudo dnf download --destdir=/tmp/nvidia-driver-libs --resolve --arch x86_64 nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced +``` + +##### 2. Update the RPM database to assume the installation of these packages. + +```bash +sudo rpm --install --verbose --hash --justdb /tmp/nvidia-driver-libs/* +``` + +**Note:** + +- The `--justdb` option only updates the RPM database, without touching the filesystem elsewhere. + +##### Check that the RPM Database has been correctly updated: + +**Note:** This is the same command as in the *"Install the Nvidia Driver Libraries on Guest"* for if *`libcuda.so.1`* was missing. + + +```bash +sudo dnf install nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced +``` + +*(this time it will not install anything, as the database things that these packages are already installed)* + +``` +Updating and loading repositories: +Repositories loaded. +Package "nvidia-driver-cuda-3:570.124.06-1.fc41.x86_64" is already installed. +Package "nvidia-driver-libs-3:570.124.06-1.fc41.x86_64" is already installed. +Package "nvidia-driver-cuda-libs-3:570.124.06-1.fc41.x86_64" is already installed. +Package "nvidia-persistenced-3:570.124.06-1.fc41.x86_64" is already installed. + +Nothing to do. +``` + +## Installing the CUDA Meta-Package + +Now that the driver libraries are installed, proceed to install CUDA: + +```bash +sudo dnf install cuda +``` + +This installs the CUDA toolkit and associated packages. + +## Configuring the Environment + +To use CUDA, add its binary directory to your system's `PATH`. + +1. **Create a Profile Script:** + + ```bash + sudo sh -c 'echo "export PATH=\$PATH:/usr/local/cuda/bin" >> /etc/profile.d/cuda.sh' + ``` + + **Explanation:** + + - We add to `/etc/profile.d/` as the `/etc/` folder is unique to this particular container, and is not shared with other containers or the host system. + - The backslash `\` before `$PATH` ensures the variable is correctly written into the script. + +2. **Make the Script Executable:** + + ```bash + sudo chmod +x /etc/profile.d/cuda.sh + ``` + +3. **Source the Script to Update Your Environment:** + + ```bash + source /etc/profile.d/cuda.sh + ``` + + **Note:** This command updates your current shell session with the new `PATH`. The `/etc/profile.d/cuda.sh` script ensures that the CUDA binaries are available in your `PATH` for all future sessions. + +## Verifying the Installation + +To confirm that CUDA is correctly installed and configured, check the version of the NVIDIA CUDA Compiler (`nvcc`): + +```bash +nvcc --version +``` + +You should see output similar to: + +``` +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Fri_Feb_21_20:23:50_PST_2025 +Cuda compilation tools, release 12.8, V12.8.93 +Build cuda_12.8.r12.8/compiler.35583870_0 +``` + +This output confirms that the CUDA compiler is accessible and indicates the installed version. + +## Conclusion + +You have successfully set up CUDA on Fedora within a toolbox environment using the Fedora 41 CUDA repository. By manually updating the RPM db and configuring the environment, you can develop CUDA applications without affecting your host system. + +## Troubleshooting + +- **Installation Failures:** + + - If you encounter errors during installation, carefully read the error messages. They often indicate conflicting files or missing dependencies. + - You may use the `--excludepath` option with `rpm` to exclude conflicting files during manual RPM installations. + +- **Rebooting the Container:** + + - Sometimes there may be a bug in the NVIDIA driver host passthrough (such as missing a shared library). Rebooting the container may solve this issue: + + ```bash + # on the host system + podman container restart --all + ``` + +- **Environment Variables Not Set:** + - If `nvcc` is not found after installation, ensure that `/usr/local/cuda/bin` is in your `PATH`. + - Run `echo $PATH` to check if the path is included. + - Re-source the profile script or open a new terminal session. + +## Additional Notes + +- **Updating CUDA in the Future:** + + - Keep an eye on the official NVIDIA repositories for updates to your Fedora version. + - When an updated repository becomes available, adjust your `dnf` configuration accordingly. + +- **Building `llama.cpp`:** + + - With CUDA installed, you can follow these [build instructions for `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) to compile it with CUDA support. + - Ensure that any CUDA-specific build flags or paths are correctly set in your build configuration. + +- **Using the Toolbox Environment:** + - The toolbox environment is isolated from your host system, which helps prevent conflicts. + - Remember that system files and configurations inside the toolbox are separate from the host. By default the home directory of the user is shared between the host and the toolbox. + +--- + +**Disclaimer:** Manually installing and modifying system packages can lead to instability of the container. The above steps are provided as a guideline and may need adjustments based on your specific system configuration. Always back up important data before making significant system changes, especially as your home folder is writable and shared with he toolbox. + +**Acknowledgments:** Special thanks to the Fedora community and NVIDIA documentation for providing resources that assisted in creating this guide. + +## References + +- [Fedora Toolbox Documentation](https://docs.fedoraproject.org/en-US/fedora-silverblue/toolbox/) +- [NVIDIA CUDA Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) +- [Podman Documentation](https://podman.io/get-started) + +--- diff --git a/llama.cpp/docs/backend/OPENCL.md b/llama.cpp/docs/backend/OPENCL.md new file mode 100644 index 0000000..044ac60 --- /dev/null +++ b/llama.cpp/docs/backend/OPENCL.md @@ -0,0 +1,282 @@ +# llama.cpp for OpenCL + +- [Background](#background) +- [OS](#os) +- [Hardware](#hardware) +- [DataType Supports](#datatype-supports) +- [Model Preparation](#model-preparation) +- [CMake Options](#cmake-options) +- [Android](#android) +- [Windows 11 Arm64](#windows-11-arm64) +- [Linux](#Linux) +- [Known Issue](#known-issues) +- [TODO](#todo) + +## Background + +OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. Similar to CUDA, OpenCL has been widely used to program GPUs and is supported by most GPU vendors. + +### Llama.cpp + OpenCL + +The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adreno GPU** firstly via OpenCL. Thanks to the portabilty of OpenCL, the OpenCL backend can also run on certain Intel GPUs such as those that do not have [SYCL](/docs/backend/SYCL.md) support although the performance is not optimal. + +## OS + +| OS | Status | Verified | +|---------|---------|------------------------------------------------| +| Android | Support | Snapdragon 8 Gen 3, Snapdragon 8 Elite | +| Windows | Support | Windows 11 Arm64 with Snapdragon X Elite | +| Linux | Support | Ubuntu 22.04 WSL2 with Intel 12700H | + +## Hardware + +### Adreno GPU + +**Verified devices** + +| Adreno GPU | Status | +|:------------------------------------:|:-------:| +| Adreno 750 (Snapdragon 8 Gen 3) | Support | +| Adreno 830 (Snapdragon 8 Elite) | Support | +| Adreno X85 (Snapdragon X Elite) | Support | + +> A6x GPUs with a recent driver and compiler are supported; they are usually found in IoT platforms. +However, A6x GPUs in phones are likely not supported due to the outdated driver and compiler. + +## DataType Supports + +| DataType | Status | +|:----------------------:|:--------------------------:| +| Q4_0 | Support | +| Q6_K | Support, but not optimized | +| Q8_0 | Support | +| MXFP4 | Support | + +## Model Preparation + +You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization. + +Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example, + +```sh +./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0 +``` + +Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization. + +### `MXFP4` MoE Models + +OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in `MXFP4_MOE`, a mixture of `MXFP4` and `Q8_0`. +For this quantization, there is no need to specify `--pure`. +For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face. + +Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization. +Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model. + +> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization. + +## CMake Options + +The OpenCL backend has the following CMake options that control the behavior of the backend. + +| CMake options | Default value | Description | +|:---------------------------------:|:--------------:|:------------------------------------------| +| `GGML_OPENCL_EMBED_KERNELS` | `ON` | Embed OpenCL kernels into the executable. | +| `GGML_OPENCL_USE_ADRENO_KERNELS` | `ON` | Use kernels optimized for Adreno. | + +## Android + +Ubuntu 22.04 is used for targeting Android. Make sure the following tools are accessible from command line, + +* Git +* CMake 3.29 +* Ninja +* Python3 + +### I. Setup Environment + +1. **Install NDK** + +```sh +cd ~ +wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \ +unzip commandlinetools-linux-8512546_latest.zip && \ +mkdir -p ~/android-sdk/cmdline-tools && \ +mv cmdline-tools latest && \ +mv latest ~/android-sdk/cmdline-tools/ && \ +rm -rf commandlinetools-linux-8512546_latest.zip + +yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264" +``` + +2. **Install OpenCL Headers and Library** + +```sh +mkdir -p ~/dev/llm +cd ~/dev/llm + +git clone https://github.com/KhronosGroup/OpenCL-Headers && \ +cd OpenCL-Headers && \ +cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include + +cd ~/dev/llm + +git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \ +cd OpenCL-ICD-Loader && \ +mkdir build_ndk26 && cd build_ndk26 && \ +cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \ + -DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=24 \ + -DANDROID_STL=c++_shared && \ +ninja && \ +cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android +``` + +### II. Build llama.cpp + +```sh +cd ~/dev/llm + +git clone https://github.com/ggml-org/llama.cpp && \ +cd llama.cpp && \ +mkdir build-android && cd build-android + +cmake .. -G Ninja \ + -DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=android-28 \ + -DBUILD_SHARED_LIBS=OFF \ + -DGGML_OPENCL=ON + +ninja +``` + +## Windows 11 Arm64 + +A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the following tools are accessible from command line, + +* Git +* CMake 3.29 +* Clang 19 +* Ninja +* Visual Studio 2022 +* Powershell 7 +* Python + +Visual Studio provides necessary headers and libraries although it is not directly used for building. +Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio. + +> Note that building using Visual Studio's cl compiler is not supported. Clang must be used. Clang depends on libraries provided by Visual Studio to work. Therefore, Visual Studio must be installed. Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio. + +Powershell 7 is used for the following commands. +If an older version of Powershell is used, these commands may not work as they are. + +### I. Setup Environment + +1. **Install OpenCL Headers and Library** + +```powershell +mkdir -p ~/dev/llm + +cd ~/dev/llm +git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers +mkdir build && cd build +cmake .. -G Ninja ` + -DBUILD_TESTING=OFF ` + -DOPENCL_HEADERS_BUILD_TESTING=OFF ` + -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF ` + -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" +cmake --build . --target install + +cd ~/dev/llm +git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader +mkdir build && cd build +cmake .. -G Ninja ` + -DCMAKE_BUILD_TYPE=Release ` + -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" ` + -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" +cmake --build . --target install +``` + +### II. Build llama.cpp + +```powershell + +mkdir -p ~/dev/llm +cd ~/dev/llm + +git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp +mkdir build && cd build + +cmake .. -G Ninja ` + -DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" ` + -DCMAKE_BUILD_TYPE=Release ` + -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" ` + -DBUILD_SHARED_LIBS=OFF ` + -DGGML_OPENCL=ON +ninja +``` + +## Linux + +The two steps just above also apply to Linux. When building for linux, the commands are mostly the same as those for PowerShell on Windows, but in the second step they do not have the `-DCMAKE_TOOLCHAIN_FILE` parameter, and then in both steps the backticks are replaced with back slashes. + +If not installed already, install Git, CMake, Clang, Ninja and Python, then run in the terminal the following: + +### I. Setup Environment + +1. **Install OpenCL Headers and Library** + +```bash +mkdir -p ~/dev/llm + +cd ~/dev/llm +git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers +mkdir build && cd build +cmake .. -G Ninja \ + -DBUILD_TESTING=OFF \ + -DOPENCL_HEADERS_BUILD_TESTING=OFF \ + -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \ + -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" +cmake --build . --target install + +cd ~/dev/llm +git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader +mkdir build && cd build +cmake .. -G Ninja \ + -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \ + -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl" +cmake --build . --target install +``` + +### II. Build llama.cpp + +```bash +mkdir -p ~/dev/llm +cd ~/dev/llm + +git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp +mkdir build && cd build + +cmake .. -G Ninja \ + -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \ + -DBUILD_SHARED_LIBS=OFF \ + -DGGML_OPENCL=ON +ninja +``` + +## Known Issues + +- Flash attention does not always improve performance. +- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms). + However, it does not work on A6xx GPUs found in phones with old drivers and compilers. + +## TODO + +- Optimization for Q6_K +- Support and optimization for Q4_K +- Improve flash attention diff --git a/llama.cpp/docs/backend/SYCL.md b/llama.cpp/docs/backend/SYCL.md new file mode 100644 index 0000000..b3cff96 --- /dev/null +++ b/llama.cpp/docs/backend/SYCL.md @@ -0,0 +1,753 @@ +# llama.cpp for SYCL + +- [Background](#background) +- [Recommended Release](#recommended-release) +- [News](#news) +- [OS](#os) +- [Hardware](#hardware) +- [Docker](#docker) +- [Linux](#linux) +- [Windows](#windows) +- [Environment Variable](#environment-variable) +- [Known Issue](#known-issues) +- [Q&A](#qa) +- [TODO](#todo) + +## Background + +**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17. + +**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to Intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include: + +- **DPCPP** *(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers. +- **oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. Intel oneMKL, oneMath and oneDNN)*. +- **oneAPI LevelZero**: A high performance low level interface for fine-grained control over Intel iGPUs and dGPUs. + +### Llama.cpp + SYCL + +The llama.cpp SYCL backend is primarily designed for **Intel GPUs**. +SYCL cross-platform capabilities enable support for other vendor GPUs as well. + +## Recommended Release + +The following releases are verified and recommended: + +|Commit ID|Tag|Release|Verified Platform| Update date| +|-|-|-|-|-| +|24e86cae7219b0f3ede1d5abdf5bf3ad515cccb8|b5377 |[llama-b5377-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b5377/llama-b5377-bin-win-sycl-x64.zip) |Arc B580/Linux/oneAPI 2025.1<br>LNL Arc GPU/Windows 11/oneAPI 2025.1.1|2025-05-15| +|3bcd40b3c593d14261fb2abfabad3c0fb5b9e318|b4040 |[llama-b4040-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b4040/llama-b4040-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1| 2024-11-19| +|fb76ec31a9914b7761c1727303ab30380fd4f05c|b3038 |[llama-b3038-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b3038/llama-b3038-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1|| + + +## News + +- 2026.02 + - Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. User can't build up the software for Nvidia & AMD GPU. + +- 2025.11 + - Support malloc memory on device more than 4GB. + +- 2025.2 + - Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC). + |GPU|Base tokens/s|Increased tokens/s|Percent| + |-|-|-|-| + |PVC 1550|39|73|+87%| + |Flex 170|39|50|+28%| + |Arc A770|42|55|+30%| + |MTL|13|16|+23%| + |ARL-H|14|17|+21%| + +- 2024.11 + - Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer. + +- 2024.8 + - Use oneDNN as the default GEMM library, improve the compatibility for new Intel GPUs. + +- 2024.5 + - Performance is increased: 34 -> 37 tokens/s of llama-2-7b.Q4_0 on Arc A770. + - Arch Linux is verified successfully. + +- 2024.4 + - Support data types: GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M. + +- 2024.3 + - Release binary files of Windows. + - A blog is published: **Run LLM on all Intel GPUs Using llama.cpp**: [intel.com](https://www.intel.com/content/www/us/en/developer/articles/technical/run-llm-on-all-gpus-using-llama-cpp-artical.html) or [medium.com](https://medium.com/@jianyu_neo/run-llm-on-all-intel-gpus-using-llama-cpp-fd2e2dcbd9bd). + - New base line is ready: [tag b2437](https://github.com/ggml-org/llama.cpp/tree/b2437). + - Support multiple cards: **--split-mode**: [none|layer]; not support [row], it's on developing. + - Support to assign main GPU by **--main-gpu**, replace $GGML_SYCL_DEVICE. + - Support detecting all GPUs with level-zero and same top **Max compute units**. + - Support OPs + - hardsigmoid + - hardswish + - pool2d + +- 2024.1 + - Create SYCL backend for Intel GPU. + - Support Windows build + +## OS + +| OS | Status | Verified | +|---------|---------|------------------------------------------------| +| Linux | Support | Ubuntu 22.04, Fedora Silverblue 39, Arch Linux | +| Windows | Support | Windows 11 | + + +## Hardware + +### Intel GPU + +SYCL backend supports Intel GPU Family: + +- Intel Data Center Max Series +- Intel Flex Series, Arc Series +- Intel Built-in Arc GPU +- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)). + +On older Intel GPUs, you may try [OpenCL](/docs/backend/OPENCL.md) although the performance is not optimal, and some GPUs may not support OpenCL nor have any GPGPU capabilities. + +#### Verified devices + +| Intel GPU | Status | Verified Model | +|-------------------------------|---------|---------------------------------------| +| Intel Data Center Max Series | Support | Max 1550, 1100 | +| Intel Data Center Flex Series | Support | Flex 170 | +| Intel Arc A-Series | Support | Arc A770, Arc A730M, Arc A750 | +| Intel Arc B-Series | Support | Arc B580 | +| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake, Lunar Lake | +| Intel iGPU | Support | iGPU in 13700k, 13400, i5-1250P, i7-1260P, i7-1165G7 | + +*Notes:* + +- **Memory** + - The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-completion`. + - Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU. + +- **Execution Unit (EU)** + - If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use. + +### Other Vendor GPU + +NA + +## Docker + +The docker build option is currently limited to *Intel GPU* targets. + +### Build image + +```sh +# Using FP32 +docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=OFF" --target light -f .devops/intel.Dockerfile . + +# Using FP16 +docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=ON" --target light -f .devops/intel.Dockerfile . +``` + +*Notes*: + +You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative. +Check the [documentation for Docker](../docker.md) to see the available images. + +### Run container + +```sh +# First, find all the DRI cards +ls -la /dev/dri +# Then, pick the card that you want to use (here for e.g. /dev/dri/card1). +docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0 +``` + +*Notes:* +- Docker has been tested successfully on native Linux. WSL support has not been verified yet. +- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*. + +## Linux + +### I. Setup Environment + +1. **Install GPU drivers** + + - **Intel GPU** + +Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps). + +*Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html). + +Once installed, add the user(s) to the `video` and `render` groups. + +```sh +sudo usermod -aG render $USER +sudo usermod -aG video $USER +``` + +*Note*: logout/re-login for the changes to take effect. + +Verify installation through `clinfo`: + +```sh +sudo apt install clinfo +sudo clinfo -l +``` + +Sample output: + +```sh +Platform #0: Intel(R) OpenCL Graphics + `-- Device #0: Intel(R) Arc(TM) A770 Graphics + +Platform #0: Intel(R) OpenCL HD Graphics + `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49] +``` + +2. **Install Intel® oneAPI Base toolkit** + +SYCL backend depends on: + - Intel® oneAPI DPC++/C++ compiler/running-time. + - Intel® oneAPI DPC++/C++ library (oneDPL). + - Intel® oneAPI Deep Neural Network Library (oneDNN). + - Intel® oneAPI Math Kernel Library (oneMKL). + +- **For Intel GPU** + +All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages. + +It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size. + +The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page. + +Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*. + +Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable. + +Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs. + +|Verified release| +|-| +|2025.2.1| +|2025.1| +|2024.1| + +3. **Verify installation and environment** + +In order to check the available SYCL devices on the machine, please use the `sycl-ls` command. +```sh +source /opt/intel/oneapi/setvars.sh +sycl-ls +``` + +- **Intel GPU** + +When targeting an intel GPU, the user should expect one or more devices among the available SYCL devices. Please make sure that at least one GPU is present via `sycl-ls`, for instance `[level_zero:gpu]` in the sample output below: + +``` +[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.3.29735+27] +[level_zero:gpu][level_zero:1] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) UHD Graphics 730 12.2.0 [1.3.29735+27] +[opencl:cpu][opencl:0] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i5-13400 OpenCL 3.0 (Build 0) [2025.20.8.0.06_160000] +[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.39.31294] +[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO [24.39.31294] +``` + +### II. Build llama.cpp + +#### Intel GPU + +```sh +./examples/sycl/build.sh +``` + +or + +```sh +# Export relevant ENV variables +source /opt/intel/oneapi/setvars.sh + +# Option 1: Use FP32 (recommended for better performance in most cases) +cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx + +# Option 2: Use FP16 +cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON + +# build all binary +cmake --build build --config Release -j -v +``` + +It is possible to come across some precision issues when running tests that stem from using faster +instructions, which can be circumvented by setting the environment variable `SYCL_PROGRAM_COMPILE_OPTIONS` +as `-cl-fp32-correctly-rounded-divide-sqrt` + +### III. Run the inference + +#### Retrieve and prepare model + +You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf?download=true) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf). + +##### Check device + +1. Enable oneAPI running environment + +```sh +source /opt/intel/oneapi/setvars.sh +``` + +2. List devices information + +Similar to the native `sycl-ls`, available SYCL devices can be queried as follow: + +```sh +./build/bin/llama-ls-sycl-device +``` + +This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following: +``` +found 2 SYCL devices: + +| | | |Compute |Max compute|Max work|Max sub| | +|ID| Device Type| Name|capability|units |group |group |Global mem size| +|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------| +| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136| +| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216| +``` + +#### Choose level-zero devices + +|Chosen Device ID|Setting| +|-|-| +|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:0"` or no action| +|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`| +|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`| + +#### Execute + +Choose one of following methods to run. + +1. Script + +- Use device 0: + +```sh +./examples/sycl/test.sh -mg 0 +``` +- Use multiple devices: + +```sh +./examples/sycl/test.sh +``` + +2. Command line +Launch inference + +There are two device selection modes: + +- Single device: Use one device assigned by user. Default device id is 0. +- Multiple devices: Automatically choose the devices with the same backend. + +In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR. + +| Device selection | Parameter | +|------------------|----------------------------------------| +| Single device | --split-mode none --main-gpu DEVICE_ID | +| Multiple devices | --split-mode layer (default) | + +Examples: + +- Use device 0: + +```sh +ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap +``` + +- Use multiple devices: + +```sh +ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm layer --mmap +``` + +*Notes:* + +- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow: + +```sh +detect 1 SYCL GPUs: [0] with top Max compute units:512 +``` +Or +```sh +use 1 SYCL GPUs: [0] with Max compute units:512 +``` + +## Windows + +### I. Setup Environment + +1. Install GPU driver + +Intel GPU drivers instructions guide and download page can be found here: [Get Intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html). + +2. Install Visual Studio + +If you already have a recent version of Microsoft Visual Studio, you can skip this step. Otherwise, please refer to the official download page for [Microsoft Visual Studio](https://visualstudio.microsoft.com/). + +3. Install Intel® oneAPI Base toolkit + +SYCL backend depends on: + - Intel® oneAPI DPC++/C++ compiler/running-time. + - Intel® oneAPI DPC++/C++ library (oneDPL). + - Intel® oneAPI Deep Neural Network Library (oneDNN). + - Intel® oneAPI Math Kernel Library (oneMKL). + +All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages. + +It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size. + +The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page. + +Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*. + +Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable. + +b. Enable oneAPI running environment: + +- Type "oneAPI" in the search bar, then open the `Intel oneAPI command prompt for Intel 64 for Visual Studio 2022` App. + +- On the command prompt, enable the runtime environment with the following: +``` +"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 +``` + +- if you are using Powershell, enable the runtime environment with the following: + +``` +cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell' +``` + +c. Verify installation + +In the oneAPI command line, run the following to print the available SYCL devices: + +``` +sycl-ls.exe +``` + +There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device: + +Output (example): +``` +[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000] +[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] +[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186] +[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044] +``` + +4. Install build tools + +a. Download & install cmake for Windows: https://cmake.org/download/ (CMake can also be installed from Visual Studio Installer) +b. The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/) + + +### II. Build llama.cpp + +You could download the release package for Windows directly, which including binary files and depended oneAPI dll files. + +Choose one of following methods to build from source code. + +#### 1. Script + +```sh +.\examples\sycl\win-build-sycl.bat +``` + +#### 2. CMake + +On the oneAPI command line window, step into the llama.cpp main directory and run the following: + +``` +@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force + +# Option 1: Use FP32 (recommended for better performance in most cases) +cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release + +# Option 2: Or FP16 +cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON + +cmake --build build --config Release -j +``` + +Or, use CMake presets to build: + +```sh +cmake --preset x64-windows-sycl-release +cmake --build build-x64-windows-sycl-release -j --target llama-completion + +cmake -DGGML_SYCL_F16=ON --preset x64-windows-sycl-release +cmake --build build-x64-windows-sycl-release -j --target llama-completion + +cmake --preset x64-windows-sycl-debug +cmake --build build-x64-windows-sycl-debug -j --target llama-completion +``` + +#### 3. Visual Studio + +You have two options to use Visual Studio to build llama.cpp: +- As CMake Project using CMake presets. +- Creating a Visual Studio solution to handle the project. + +**Note**: + +All following commands are executed in PowerShell. + +##### - Open as a CMake Project + +You can use Visual Studio to open the `llama.cpp` folder directly as a CMake project. Before compiling, select one of the SYCL CMake presets: + +- `x64-windows-sycl-release` + +- `x64-windows-sycl-debug` + +*Notes:* +- For a minimal experimental setup, you can build only the inference executable using: + + ```Powershell + cmake --build build --config Release -j --target llama-completion + ``` + +##### - Generating a Visual Studio Solution + +You can use Visual Studio solution to build and work on llama.cpp on Windows. You need to convert the CMake Project into a `.sln` file. + +If you want to use the Intel C++ Compiler for the entire `llama.cpp` project, run the following command: + +```Powershell +cmake -B build -G "Visual Studio 17 2022" -T "Intel C++ Compiler 2025" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release +``` + +If you prefer to use the Intel C++ Compiler only for `ggml-sycl`, ensure that `ggml` and its backend libraries are built as shared libraries ( i.e. `-DBUILD_SHARED_LIBRARIES=ON`, this is default behaviour): + +```Powershell +cmake -B build -G "Visual Studio 17 2022" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release \ + -DSYCL_INCLUDE_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\include" \ + -DSYCL_LIBRARY_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\lib" +``` + +If successful the build files have been written to: *path/to/llama.cpp/build* +Open the project file **build/llama.cpp.sln** with Visual Studio. + +Once the Visual Studio solution is created, follow these steps: + +1. Open the solution in Visual Studio. + +2. Right-click on `ggml-sycl` and select **Properties**. + +3. In the left column, expand **C/C++** and select **DPC++**. + +4. In the right panel, find **Enable SYCL Offload** and set it to `Yes`. + +5. Apply the changes and save. + + +*Navigation Path:* + +``` +Properties -> C/C++ -> DPC++ -> Enable SYCL Offload (Yes) +``` + +Now, you can build `llama.cpp` with the SYCL backend as a Visual Studio project. +To do it from menu: `Build -> Build Solution`. +Once it is completed, final results will be in **build/Release/bin** + +*Additional Note* + +- You can avoid specifying `SYCL_INCLUDE_DIR` and `SYCL_LIBRARY_DIR` in the CMake command by setting the environment variables: + + - `SYCL_INCLUDE_DIR_HINT` + + - `SYCL_LIBRARY_DIR_HINT` + +- Above instruction has been tested with Visual Studio 17 Community edition and oneAPI 2025.0. We expect them to work also with future version if the instructions are adapted accordingly. + +### III. Run the inference + +#### Retrieve and prepare model + +You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf). + +##### Check device + +1. Enable oneAPI running environment + +On the oneAPI command line window, run the following and step into the llama.cpp directory: +``` +"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 +``` + +2. List devices information + +Similar to the native `sycl-ls`, available SYCL devices can be queried as follow: + +``` +build\bin\llama-ls-sycl-device.exe +``` + +This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following: +``` +found 2 SYCL devices: +| | | |Compute |Max compute|Max work|Max sub| | +|ID| Device Type| Name|capability|units |group |group |Global mem size| +|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------| +| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136| +| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216| + +``` + +#### Choose level-zero devices + +|Chosen Device ID|Setting| +|-|-| +|0|Default option. You may also want to `set ONEAPI_DEVICE_SELECTOR="level_zero:0"`| +|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`| +|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"` or `set ONEAPI_DEVICE_SELECTOR="level_zero:*"`| + +#### Execute + +Choose one of following methods to run. + +1. Script + +``` +examples\sycl\win-test.bat +``` + +2. Command line + +Launch inference + +There are two device selection modes: + +- Single device: Use one device assigned by user. Default device id is 0. +- Multiple devices: Automatically choose the devices with the same backend. + +In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR. + +| Device selection | Parameter | +|------------------|----------------------------------------| +| Single device | --split-mode none --main-gpu DEVICE_ID | +| Multiple devices | --split-mode layer (default) | + +Examples: + +- Use device 0: + +``` +build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap +``` + +- Use multiple devices: + +``` +build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm layer --mmap +``` + + +Note: + +- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow: + +```sh +detect 1 SYCL GPUs: [0] with top Max compute units:512 +``` + +Or + +```sh +use 1 SYCL GPUs: [0] with Max compute units:512 +``` + + +## Environment Variable + +#### Build + +| Name | Value | Function | +|--------------------|---------------------------------------|---------------------------------------------| +| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path. | +| GGML_SYCL_TARGET | INTEL *(default)* | Set the SYCL target device type. | +| GGML_SYCL_DEVICE_ARCH | Optional | Set the SYCL device architecture. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. | +| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) | +| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | +| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. | +| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. | +| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. | + +1. FP32 or FP16 have different performance impact to LLM. Recommended to test them for better prompt processing performance on your models. You need to rebuild the code after change `GGML_SYCL_F16=OFF/ON`. + +#### Runtime + +| Name | Value | Function | +|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------| +| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG | +| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) | +| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. | +| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. | +| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer | +| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.| + + + +## Known Issues + +- `Split-mode:[row]` is not supported. + +## Q&A + +- Error: `error while loading shared libraries: libsycl.so: cannot open shared object file: No such file or directory`. + + - Potential cause: Unavailable oneAPI installation or not set ENV variables. + - Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`. + +- General compiler error: + + - Remove **build** folder or try a clean-build. + +- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux. + + Please double-check with `sudo sycl-ls`. + + If it's present in the list, please add video/render group to your user then **logout/login** or restart your system: + + ``` + sudo usermod -aG render $USER + sudo usermod -aG video $USER + ``` + Otherwise, please double-check the GPU driver installation steps. + +- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend? + + No. We can't support Ollama issue directly, because we aren't familiar with Ollama. + + Suggest reproducing on llama.cpp and report similar issue to llama.cpp. We will support it. + + It's same for other projects including llama.cpp SYCL backend. + +- `Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)`, `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 3503030272 Bytes of memory on device`, or `failed to allocate SYCL0 buffer` + + You are running out of Device Memory. + + |Reason|Solution| + |-|-| + | The default context is too big. It leads to excessive memory usage.|Set `-c 8192` or a smaller value.| + | The model is too big and requires more memory than what is available.|Choose a smaller model or change to a smaller quantization, like Q5 -> Q4;<br>Alternatively, use more than one device to load model.| + +- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device` + + You need to enable to support 4GB memory malloc by: + ``` + export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 + set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 + ``` + +### **GitHub contribution**: +Please add the `SYCL :` prefix/tag in issues/PRs titles to help the SYCL contributors to check/address them without delay. + +## TODO + +- Review ZES_ENABLE_SYSMAN: https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations diff --git a/llama.cpp/docs/backend/VirtGPU.md b/llama.cpp/docs/backend/VirtGPU.md new file mode 100644 index 0000000..c81468d --- /dev/null +++ b/llama.cpp/docs/backend/VirtGPU.md @@ -0,0 +1,180 @@ +# GGML-VirtGPU Backend + +The GGML-VirtGPU backend enables GGML applications to run machine +learning computations on host hardware while the application itself +runs inside a virtual machine. It uses host-guest shared memory to +efficiently share data buffers between the two sides. + +This backend relies on the virtio-gpu, and VirglRenderer API Remoting +(APIR) component. The backend is split into two libraries: +- a GGML implementation (the "remoting frontend"), running in the + guest and interacting with the virtgpu device +- a VirglRenderer APIR compatible library (the "remoting backend"), + running in the host and interacting with Virglrenderer and an actual + GGML device backend. + +## OS support + +| OS | Status | Backend | CI testing | Notes +| -------- | ----------------- | ----------- | ----------- | ----- +| MacOS 14 | Supported | ggml-metal | X | Working when compiled on MacOS 14 +| MacOS 15 | Supported | ggml-metal | X | Working when compiled on MacOS 14 or MacOS 15 +| MacOS 26 | Not tested | | | +| Linux | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks + + +## Architecture Overview + +The GGML-VirtGPU backend consists of three main components: + +```mermaid +graph TD + %% Nodes + + subgraph GuestVM ["Guest VM - Frontend"] + App([GGML Application<br/>llama.cpp, etc.]) + + direction TB + Interface[GGML Backend Interface] + Comm["GGML-VirtGPU<br/>(hypercalls + shared mem)"] + + App --> Interface + Interface --> Comm + end + + API[virtio-gpu / virglrenderer API] + + subgraph HostSystem [Host System - Backend] + direction TB + Dispatcher[GGML-VirtGPU-Backend] + BackendLib[GGML Backend library<br/>Metal / Vulkan / CPU / ...] + + Dispatcher --> BackendLib + end + + %% Connections + Comm --> API + API --> HostSystem +``` + +### Key Components + +1. **Guest-side Frontend** (`ggml-virtgpu/`): Implements the GGML backend interface and forwards operations to the host +2. **Host-side Backend** (`ggml-virtgpu/backend/`): Receives forwarded operations and executes them on actual hardware backends +3. **Communication Layer**: Uses virtio-gpu hypercalls and shared memory for efficient data transfer + +## Features + +- **Dynamic backend loading** on the host side (CPU, CUDA, Metal, etc.) +- **Zero-copy data transfer** via host-guest shared memory pages + +## Communication Protocol + +### Hypercalls and Shared Memory + +The backend uses two primary communication mechanisms: + +1. **Hypercalls (`DRM_IOCTL_VIRTGPU_EXECBUFFER`)**: Trigger remote execution from guest to host +2. **Shared Memory Pages**: Zero-copy data transfer for tensors and parameters + +#### Shared Memory Layout + +Each connection uses two shared memory buffers: + +- **Data Buffer** (24 MiB): For command/response data and tensor transfers +- **Reply Buffer** (16 KiB): For command replies and status information +- **Data Buffers**: Dynamically allocated host-guest shared buffers + served as GGML buffers. + +### APIR Protocol + +The Virglrender API Remoting protocol defines three command types: + +- `HANDSHAKE`: Protocol version negotiation and capability discovery +- `LOADLIBRARY`: Dynamic loading of backend libraries on the host +- `FORWARD`: API function call forwarding + +### Binary Serialization + +Commands and data are serialized using a custom binary protocol with: + +- Fixed-size encoding for basic types +- Variable-length arrays with size prefixes +- Buffer bounds checking +- Error recovery mechanisms + +## Supported Operations + +### Device Operations +- Device enumeration and capability queries +- Memory information (total/free) +- Backend type detection + +### Buffer Operations +- Buffer allocation and deallocation +- Tensor data transfer (host ↔ guest) +- Memory copying and clearing + +### Computation Operations +- Graph execution forwarding + +## Build Requirements + +### Guest-side Dependencies +- `libdrm` for DRM/virtio-gpu communication +- C++20 compatible compiler +- CMake 3.14+ + +### Host-side Dependencies +- virglrenderer with APIR support (pending upstream review) +- Target backend libraries (libggml-metal, libggml-vulkan, etc.) + +## Configuration + +### Environment Variables + +- `GGML_VIRTGPU_BACKEND_LIBRARY`: Path to the host-side backend library +- `GGML_VIRTGPU_DEBUG`: Enable debug logging + +### Build Options + +- `GGML_VIRTGPU`: Enable the VirtGPU backend (`ON` or `OFF`, default: `OFF`) +- `GGML_VIRTGPU_BACKEND`: Build the host-side backend component (`ON`, `OFF` or `ONLY`, default: `OFF`) + +### System Requirements + +- VM with virtio-gpu support +- VirglRenderer with APIR patches +- Compatible backend libraries on host + +## Limitations + +- **VM-specific**: Only works in virtual machines with virtio-gpu support +- **Host dependency**: Requires properly configured host-side backend +- **Latency**: Small overhead from VM escaping for each operation + + +* This work is pending upstream changes in the VirglRenderer + project. + * The backend can be tested with Virglrenderer compiled from source + using this PR: + https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590 +* This work is pending changes in the VMM/hypervisor running the + virtual machine, which need to know how to route the newly + introduced APIR capset. + * The environment variable `VIRGL_ROUTE_VENUS_TO_APIR=1` allows + using the Venus capset, until the relevant hypervisors have been + patched. However, setting this flag breaks the Vulkan/Venus normal + behavior. + * The environment variable `GGML_REMOTING_USE_APIR_CAPSET` tells the + `ggml-virtgpu` backend to use the APIR capset. This will become + the default when the relevant hypervisors have been patched. + +* This work focused on improving the performance of llama.cpp running + on MacOS containers, and is mainly tested on this platform. The + linux support (via `krun`) is in progress. + +## See Also + +- [Development and Testing](VirtGPU/development.md) +- [Backend configuration](VirtGPU/configuration.md) diff --git a/llama.cpp/docs/backend/VirtGPU/configuration.md b/llama.cpp/docs/backend/VirtGPU/configuration.md new file mode 100644 index 0000000..597862d --- /dev/null +++ b/llama.cpp/docs/backend/VirtGPU/configuration.md @@ -0,0 +1,174 @@ +# GGML-VirtGPU Backend Configuration + +This document describes the environment variables used by the ggml-virtgpu backend system, covering both the frontend (guest-side) and backend (host-side) components. + +## Environment Variables Overview + +The ggml-virtgpu backend uses environment variables for configuration across three main components: +- **Frontend (Guest)**: GGML applications running in VMs +- **Hypervisor**: Virglrenderer/APIR system +- **Backend (Host)**: Host-side GGML backend integration + +## Frontend (Guest-side) Configuration + +### GGML_REMOTING_USE_APIR_CAPSET +- **Location**: `ggml/src/ggml-virtgpu/virtgpu.cpp` +- **Type**: Boolean flag (presence-based) +- **Purpose**: Controls which virtio-gpu capability set to use for communication +- **Values**: + - Set (any value): Use the APIR capset (long-term setup) + - Unset: Use the Venus capset (easier for testing with an unmodified hypervisor) +- **Default**: Unset (Venus capset) +- **Usage**: + ```bash + export GGML_REMOTING_USE_APIR_CAPSET=1 # Use APIR capset + # or leave unset for Venus capset + ``` + +## Hypervisor (Virglrenderer/APIR) Configuration + +These environment variables are used during the transition phase for +running with an unmodified hypervisor (not supporting the +VirglRenderer APIR component). They will be removed in the future, and +the hypervisor will instead configure VirglRenderer with the APIR +_Configuration Key_. + +### VIRGL_APIR_BACKEND_LIBRARY +- **Location**: `virglrenderer/src/apir/apir-context.c` +- **Configuration Key**: `apir.load_library.path` +- **Type**: File path string +- **Purpose**: Path to the APIR backend library that virglrenderer should dynamically load +- **Required**: Yes +- **Example**: + ```bash + export VIRGL_APIR_BACKEND_LIBRARY="/path/to/libggml-remotingbackend.so" + ``` + +### VIRGL_ROUTE_VENUS_TO_APIR +- **Location**: `virglrenderer/src/apir/apir-renderer.h` +- **Type**: Boolean flag (presence-based) +- **Purpose**: Temporary workaround to route Venus capset calls to APIR during hypervisor transition period +- **Status**: will be removed once hypervisors support APIR natively +- **Warning**: Breaks normal Vulkan/Venus functionality +- **Usage**: + ```bash + export VIRGL_ROUTE_VENUS_TO_APIR=1 # For testing with an unmodified hypervisor + ``` + +### VIRGL_APIR_LOG_TO_FILE +- **Location**: `virglrenderer/src/apir/apir-renderer.c` +- **Environment Variable**: `VIRGL_APIR_LOG_TO_FILE` +- **Type**: File path string +- **Purpose**: Enable debug logging from the VirglRenderer APIR component to specified file +- **Required**: No (optional debugging) +- **Default**: Logging to `stderr` +- **Usage**: + ```bash + export VIRGL_APIR_LOG_TO_FILE="/tmp/apir-debug.log" + ``` + +## Backend (Host-side) Configuration + +These environment variables are used during the transition phase for +running with an unmodified hypervisor (not supporting the +VirglRenderer APIR component). They will be removed in the future, and +the hypervisor will instead configure VirglRenderer with the APIR +_Configuration Key_. + +### APIR_LLAMA_CPP_GGML_LIBRARY_PATH +- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp` +- **Environment Variable**: `APIR_LLAMA_CPP_GGML_LIBRARY_PATH` +- **Configuration Key**: `ggml.library.path` +- **Type**: File path string +- **Purpose**: Path to the actual GGML backend library (Metal, CUDA, Vulkan, etc.) +- **Required**: **Yes** - backend initialization fails without this +- **Examples**: + ```bash + # macOS with Metal backend + export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib" + + # Linux with CUDA backend + export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-cuda.so" + + # macOS or Linux with Vulkan backend + export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-vulkan.so" + ``` + +### APIR_LLAMA_CPP_GGML_LIBRARY_REG +- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp` +- **Environment Variable**: `APIR_LLAMA_CPP_GGML_LIBRARY_REG` +- **Configuration Key**: `ggml.library.reg` +- **Type**: Function symbol name string +- **Purpose**: Name of the backend registration function to call after loading the library +- **Required**: No (defaults to `ggml_backend_init`) +- **Default**: `ggml_backend_init` +- **Examples**: + ```bash + # Metal backend + export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg" + + # CUDA backend + export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_cuda_reg" + + # Vulkan backend + export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_vulkan_reg" + + # Generic fallback (default) + # export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_init" + ``` + +### APIR_LLAMA_CPP_LOG_TO_FILE +- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp:62` +- **Environment Variable**: `APIR_LLAMA_CPP_LOG_TO_FILE` +- **Type**: File path string +- **Purpose**: Enable debug logging from the GGML backend to specified file +- **Required**: No (optional debugging) +- **Usage**: + ```bash + export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml-backend-debug.log" + ``` + +## Configuration Flow + +The configuration system works as follows: + +1. **Hypervisor Setup**: Virglrenderer loads the APIR backend library specified by `VIRGL_APIR_BACKEND_LIBRARY` + +2. **Context Creation**: When an APIR context is created, it populates a configuration table with environment variables: + - `apir.load_library.path` ← `VIRGL_APIR_BACKEND_LIBRARY` + - `ggml.library.path` ← `APIR_LLAMA_CPP_GGML_LIBRARY_PATH` + - `ggml.library.reg` ← `APIR_LLAMA_CPP_GGML_LIBRARY_REG` + - this step will eventually be performed by the hypervisor itself, with command-line arguments instead of environment variables. + +3. **Backend Initialization**: The backend queries the configuration via callbacks: + - `virgl_cbs->get_config(ctx_id, "ggml.library.path")` returns the library path + - `virgl_cbs->get_config(ctx_id, "ggml.library.reg")` returns the registration function + +4. **Library Loading**: The backend dynamically loads and initializes the specified GGML library + +## Error Messages + +Common error scenarios and their messages: + +- **Missing library path**: `"cannot open the GGML library: env var 'APIR_LLAMA_CPP_GGML_LIBRARY_PATH' not defined"` +- **Missing registration function**: `"cannot register the GGML library: env var 'APIR_LLAMA_CPP_GGML_LIBRARY_REG' not defined"` + +## Example Complete Configuration + +Here's an example configuration for a macOS host with Metal backend: + +```bash +# Hypervisor environment +export VIRGL_APIR_BACKEND_LIBRARY="/opt/llama.cpp/lib/libggml-virtgpu-backend.dylib" + +# Backend configuration +export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib" +export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg" + +# Optional logging +export VIRGL_APIR_LOG_TO_FILE="/tmp/apir.log" +export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml.log" + +# Guest configuration +export GGML_REMOTING_USE_APIR_CAPSET=1 +``` diff --git a/llama.cpp/docs/backend/VirtGPU/development.md b/llama.cpp/docs/backend/VirtGPU/development.md new file mode 100644 index 0000000..ca2e477 --- /dev/null +++ b/llama.cpp/docs/backend/VirtGPU/development.md @@ -0,0 +1,220 @@ +# Development and Testing + +## Development + +### Code Generation + +The backend uses code generation from YAML configuration: + +```bash +# Regenerate protocol code +cd ggml-virtgpu/ +python regenerate_remoting.py +``` + +### Adding New Operations + +1. Add function definition to `ggmlremoting_functions.yaml` +2. Regenerate code with `regenerate_remoting.py` +3. Implement guest-side forwarding in `virtgpu-forward-*.cpp` +4. Implement host-side handling in `backend-dispatched-*.cpp` + +## Testing + +This document provides instructions for building and testing the GGML-VirtGPU backend on macOS with containers. + +### Prerequisites + +The testing setup requires: + +- macOS host system +- Container runtime with `libkrun` provider (podman machine) +- Access to development patchset for VirglRenderer + +### Required Patchsets + +The backend requires patches that are currently under review: + +- **Virglrenderer APIR upstream PR**: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590 (for reference) +- **MacOS Virglrenderer (for krunkit)**: https://gitlab.freedesktop.org/kpouget/virglrenderer/-/tree/main-macos +- **Linux Virglrenderer (for krun)**: https://gitlab.freedesktop.org/kpouget/virglrenderer/-/tree/main-linux + +### Build Instructions + +#### 1. Build ggml-virtgpu-backend (Host-side, macOS) + +```bash +# Build the backend that runs natively on macOS +mkdir llama.cpp +cd llama.cpp +git clone https://github.com/ggml-org/llama.cpp.git src +cd src + +LLAMA_MAC_BUILD=$PWD/build/ggml-virtgpu-backend + +cmake -S . -B $LLAMA_MAC_BUILD \ + -DGGML_NATIVE=OFF \ + -DLLAMA_CURL=ON \ + -DGGML_REMOTINGBACKEND=ONLY \ + -DGGML_METAL=ON + +TARGETS="ggml-metal" +cmake --build $LLAMA_MAC_BUILD --parallel 8 --target $TARGETS + +# Build additional tools for native benchmarking +EXTRA_TARGETS="llama-run llama-bench" +cmake --build $LLAMA_MAC_BUILD --parallel 8 --target $EXTRA_TARGETS +``` + +#### 2. Build virglrenderer (Host-side, macOS) + +```bash +# Build virglrenderer with APIR support +mkdir virglrenderer +git clone https://gitlab.freedesktop.org/kpouget/virglrenderer -b main-macos src +cd src + +VIRGL_BUILD_DIR=$PWD/build + +# -Dvenus=true and VIRGL_ROUTE_VENUS_TO_APIR=1 route the APIR requests via the Venus backend, for easier testing without a patched hypervisor + +meson setup $VIRGL_BUILD_DIR \ + -Dvenus=true \ + -Dapir=true + +ninja -C $VIRGL_BUILD_DIR +``` + +#### 3. Build ggml-virtgpu (Guest-side, Linux) + +Option A: Build from a script: + +```bash +# Inside a Linux container +mkdir llama.cpp +git clone https://github.com/ggml-org/llama.cpp.git src +cd src + +LLAMA_LINUX_BUILD=$PWD//build-virtgpu + +cmake -S . -B $LLAMA_LINUX_BUILD \ + -DGGML_VIRTGPU=ON + +ninja -C $LLAMA_LINUX_BUILD +``` + +Option B: Build container image with frontend: + +```bash +cat << EOF > remoting.containerfile +FROM quay.io/fedora/fedora:43 +USER 0 + +WORKDIR /app/remoting + +ARG LLAMA_CPP_REPO="https://github.com/ggml-org/llama.cpp.git" +ARG LLAMA_CPP_VERSION="master" +ARG LLAMA_CPP_CMAKE_FLAGS="-DGGML_VIRTGPU=ON" +ARG LLAMA_CPP_CMAKE_BUILD_FLAGS="--parallel 4" + +RUN dnf install -y git cmake gcc gcc-c++ libcurl-devel libdrm-devel + +RUN git clone "\${LLAMA_CPP_REPO}" src \\ + && git -C src fetch origin \${LLAMA_CPP_VERSION} \\ + && git -C src reset --hard FETCH_HEAD + +RUN mkdir -p build \\ + && cd src \\ + && set -o pipefail \\ + && cmake -S . -B ../build \${LLAMA_CPP_CMAKE_FLAGS} \\ + && cmake --build ../build/ \${LLAMA_CPP_CMAKE_BUILD_FLAGS} + +ENTRYPOINT ["/app/remoting/src/build/bin/llama-server"] +EOF + +mkdir -p empty_dir +podman build -f remoting.containerfile ./empty_dir -t localhost/llama-cpp.virtgpu +``` + +### Environment Setup + +#### Set krunkit Environment Variables + +```bash +# Define the base directories (adapt these paths to your system) +VIRGL_BUILD_DIR=$HOME/remoting/virglrenderer/build +LLAMA_MAC_BUILD=$HOME/remoting/llama.cpp/build-backend + +# For krunkit to load the custom virglrenderer library +export DYLD_LIBRARY_PATH=$VIRGL_BUILD_DIR/src + +# For Virglrenderer to load the ggml-remotingbackend library +export VIRGL_APIR_BACKEND_LIBRARY="$LLAMA_MAC_BUILD/bin/libggml-virtgpu-backend.dylib" + +# For llama.cpp remotingbackend to load the ggml-metal backend +export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="$LLAMA_MAC_BUILD/bin/libggml-metal.dylib" +export APIR_LLAMA_CPP_GGML_LIBRARY_REG=ggml_backend_metal_reg +``` + +#### Launch Container Environment + +```bash +# Set container provider to libkrun +export CONTAINERS_MACHINE_PROVIDER=libkrun +podman machine start +``` + +#### Verify Environment + +Confirm that krunkit is using the correct virglrenderer library: + +```bash +lsof -c krunkit | grep virglrenderer +# Expected output: +# krunkit 50574 user txt REG 1,14 2273912 10849442 ($VIRGL_BUILD_DIR/src)/libvirglrenderer.1.dylib +``` + +### Running Tests + +#### Launch Test Container + +```bash +# Optional model caching +mkdir -p models +PODMAN_CACHE_ARGS="-v models:/models --user root:root --cgroupns host --security-opt label=disable -w /models" + +podman run $PODMAN_CACHE_ARGS -it --rm --device /dev/dri localhost/llama-cpp.virtgpu +``` + +#### Test llama.cpp in Container + +```bash + +# Run performance benchmark +/app/remoting/build/bin/llama-bench -m ./llama3.2 +``` + +Expected output (performance may vary): +``` +| model | size | params | backend | ngl | test | t/s | +| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | +| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | ggml-virtgpu | 99 | pp512 | 991.30 ± 0.66 | +| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | ggml-virtgpu | 99 | tg128 | 85.71 ± 0.11 | +``` + +### Troubleshooting + +#### SSH Environment Variable Issues + +⚠️ **Warning**: Setting `DYLD_LIBRARY_PATH` from SSH doesn't work on macOS. Here is a workaround: + +**Workaround 1: Replace system library** +```bash +VIRGL_BUILD_DIR=$HOME/remoting/virglrenderer/build # ⚠️ adapt to your system +BREW_VIRGL_DIR=/opt/homebrew/Cellar/virglrenderer/0.10.4d/lib +VIRGL_LIB=libvirglrenderer.1.dylib + +cd $BREW_VIRGL_DIR +mv $VIRGL_LIB ${VIRGL_LIB}.orig +ln -s $VIRGL_BUILD_DIR/src/$VIRGL_LIB +``` diff --git a/llama.cpp/docs/backend/ZenDNN.md b/llama.cpp/docs/backend/ZenDNN.md new file mode 100644 index 0000000..b57fd97 --- /dev/null +++ b/llama.cpp/docs/backend/ZenDNN.md @@ -0,0 +1,258 @@ +# llama.cpp for AMD ZenDNN + +> [!WARNING] +> **Note:** ZenDNN is **not** the same as zDNN. +> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs +> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md)) + +- [Background](#background) +- [OS](#os) +- [Hardware](#hardware) +- [Supported Operations](#supported-operations) +- [DataType Supports](#datatype-supports) +- [Linux](#linux) +- [Environment Variable](#environment-variable) +- [Performance Optimization](#performance-optimization) +- [Known Issues](#known-issues) +- [TODO](#todo) + +## Background + +**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures. + +**Llama.cpp + ZenDNN** + +The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN). + +For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html + +## OS + +| OS | Status | Verified | +|:-------:|:-------:|:----------------------------------------------:| +| Linux | Support | Ubuntu 20.04, 22.04, 24.04 | + +For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os). + +## Hardware + +### AMD CPUs + +**Recommended Processors** + +ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer. + +| CPU Family | Status | Notes | +|:-----------------------------:|:-------:|:----------------------------------:| +| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture | +| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture | +| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture | +| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors | + +*Notes:* + +- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series). +- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets. +- For optimal performance, ensure your system has sufficient memory bandwidth. + +## Supported Operations + +The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend. + +| Operation | Status | Notes | +|:-------------|:-------:|:----------------------------------------------:| +| MUL_MAT | ✓ | Accelerated via ZenDNN LowOHA MatMul | + +*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs). + +## DataType Supports + +| DataType | Status | Notes | +|:----------------------:|:-------:|:---------------------------------------------:| +| FP32 | Support | Full precision floating point | +| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) | + +*Notes:* + +- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin). + +## Linux + +### I. Setup Environment + +You have two options to set up ZenDNN: + +#### Option 1: Automatic Download and Build (Recommended) + +CMake will automatically download and build ZenDNN for you: + +```sh +# Build llama.cpp - ZenDNN will be automatically downloaded and built +cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release +cmake --build build --config Release -j $(nproc) +``` + +No manual ZenDNN installation required. CMake will handle everything automatically. + +#### Option 2: Use Custom ZenDNN Installation + +If you want to build ZenDNN yourself or use a specific version: + +**Step 1: Build ZenDNN from source** + +```sh +# Clone ZenDNN repository +git clone https://github.com/amd/ZenDNN.git +cd ZenDNN +git checkout zendnnl + +# Build and install (requires CMake >= 3.25) +mkdir build && cd build +cmake .. +cmake --build . --target all +``` + +Default installation path: `ZenDNN/build/install` + +**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md). + +**Step 2: Build llama.cpp with custom ZenDNN path** + +```sh +# Using environment variable +export ZENDNN_ROOT=/path/to/ZenDNN/build/install +cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release +cmake --build build --config Release -j $(nproc) + +# OR specify path directly in CMake +cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release +cmake --build build --config Release -j $(nproc) +``` + +### II. Run the Server + +#### 1. Download Model + +Download LLaMA 3.1 8B Instruct BF16 model: + +```sh +# Download from Hugging Face +huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/ +``` + +#### 2. Start Server + +Run llama.cpp server with ZenDNN acceleration: + +```sh +# Set optimal configuration +export OMP_NUM_THREADS=64 # Adjust to your CPU core count +export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance + +# Start server +./build/bin/llama-server \ + -m models/Llama-3.1-8B-Instruct.BF16.gguf \ + --host 0.0.0.0 \ + --port 8080 \ + -t 64 +``` + +Access the server at `http://localhost:8080`. + +**Performance tips**: +- Set `OMP_NUM_THREADS` to match your physical core count +- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance +- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...` + +## Environment Variable + +### Build Time + +| Name | Value | Function | +|--------------------|---------------------------------------|---------------------------------------------| +| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support | +| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory | +| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading | + +### Runtime + +| Name | Value | Function | +|-------------------------|--------------------------|-------------------------------------------------------------------| +| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) | +| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) | +| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) | +| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) | +| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) | + +**Example**: + +```sh +export OMP_NUM_THREADS=64 +export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance +./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100 +``` + +## Performance Optimization + +### MatMul Algorithm Selection + +ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm: + +```sh +export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended) +``` + +**Available algorithms**: + +| Value | Algorithm | Description | +|:-----:|:-----------------------|:----------------------------------------------| +| 0 | Dynamic Dispatch | Automatic backend selection (default) | +| 1 | AOCL BLIS | AOCL BLIS backend | +| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** | +| 3 | OneDNN | OneDNN backend | +| 4 | OneDNN Blocked | Blocked OneDNN | +| 5 | LibXSMM | LibXSMM backend | + +### Profiling and Debugging + +For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md). + +## Known Issues + +- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend. +- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32. +- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance. + +## Q&A + +**Q: How do I verify that ZenDNN backend is being used?** + +A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output. + +**Q: What performance improvement can I expect?** + +A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations. + +**Q: Can I use ZenDNN on non-AMD processors?** + +A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures. + +**Q: Does ZenDNN support quantized models?** + +A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time. + +**Q: Why is my inference not faster with ZenDNN?** + +A: Ensure: +1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer) +2. `OMP_NUM_THREADS` is set appropriately (physical core count) +3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS) +4. You're using a sufficiently large model (small models may not benefit as much) +5. Enable profiling to verify ZenDNN MatMul is being called + +### **GitHub Contribution**: +Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay. + +## TODO + +- Expand operation support beyond MUL_MAT (attention operations, activations, etc.) diff --git a/llama.cpp/docs/backend/snapdragon/CMakeUserPresets.json b/llama.cpp/docs/backend/snapdragon/CMakeUserPresets.json new file mode 100644 index 0000000..1faae2f --- /dev/null +++ b/llama.cpp/docs/backend/snapdragon/CMakeUserPresets.json @@ -0,0 +1,61 @@ +{ + "version": 5, + "configurePresets": [ + { + "name": "arm64-android-snapdragon", + "hidden": true, + "architecture": { "value": "arm64", "strategy": "external" }, + "toolset": { "value": "host=x86_64", "strategy": "external" }, + "cacheVariables": { + "ANDROID_ABI": "arm64-v8a", + "ANDROID_PLATFORM": "android-31", + "CMAKE_TOOLCHAIN_FILE": "$env{ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake", + "CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE", + "CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE", + "CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG", + "CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG", + "CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g", + "CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g", + "CMAKE_PREFIX_PATH": "$env{OPENCL_SDK_ROOT}", + "HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}", + "HEXAGON_TOOLS_ROOT": "$env{HEXAGON_TOOLS_ROOT}", + "PREBUILT_LIB_DIR": "android_aarch64", + "GGML_OPENMP": "OFF", + "GGML_LLAMAFILE": "OFF", + "GGML_OPENCL": "ON", + "GGML_HEXAGON": "ON", + "GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128", + "LLAMA_OPENSSL": "OFF" + } + }, + + { + "name": "arm64-windows-snapdragon", + "inherits": [ "base", "arm64-windows-llvm" ], + "cacheVariables": { + "CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE", + "CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE", + "CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG", + "CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG", + "CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g", + "CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g", + "CMAKE_PREFIX_PATH": "$env{OPENCL_SDK_ROOT}", + "HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}", + "HEXAGON_TOOLS_ROOT": "$env{HEXAGON_TOOLS_ROOT}", + "PREBUILT_LIB_DIR": "windows_aarch64", + "GGML_OPENMP": "OFF", + "GGML_LLAMAFILE": "OFF", + "GGML_OPENCL": "ON", + "GGML_HEXAGON": "ON", + "GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128", + "LLAMA_OPENSSL": "OFF" + } + }, + + { "name": "arm64-android-snapdragon-debug" , "inherits": [ "base", "arm64-android-snapdragon", "debug" ] }, + { "name": "arm64-android-snapdragon-release", "inherits": [ "base", "arm64-android-snapdragon", "release" ] }, + + { "name": "arm64-windows-snapdragon-debug" , "inherits": [ "base", "arm64-windows-snapdragon", "debug" ] }, + { "name": "arm64-windows-snapdragon-release", "inherits": [ "base", "arm64-windows-snapdragon", "release" ] } + ] +} diff --git a/llama.cpp/docs/backend/snapdragon/README.md b/llama.cpp/docs/backend/snapdragon/README.md new file mode 100644 index 0000000..8e1f37b --- /dev/null +++ b/llama.cpp/docs/backend/snapdragon/README.md @@ -0,0 +1,269 @@ +# Snapdragon-based devices + +## Setup + +### Android + +The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain). +This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc. + +This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop. + +``` +~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3 +[d]/> cd /workspace +``` + +Note: The rest of the **Android** build process assumes that you're running inside the toolchain container. + +### Windows On Snapdragon + +Native Windows 11 arm64 builds has the following tools dependencies: +- MS Visual Studio 2026 (Community Edition or Pro) + - MSVC arm64 standard and runtime libraries + - UCRT and Driver Kit +- LLVM core libraries and Clang compiler (winget) +- CMake, Git, Python (winget) +- Hexagon SDK Community Edition 6.4 or later (see windows.md) +- OpenCL SDK 2.3 or later (see windows.md) + +Note: The rest of the **Windows** build process assumes that you're running natively in Powershell. +Adapt below build commands accordingly. + +## How to Build + +Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets: + +``` +[d]/workspace> cp docs/backend/hexagon/CMakeUserPresets.json . + +[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon +Preset CMake variables: + ANDROID_ABI="arm64-v8a" + ... + CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake" + GGML_HEXAGON="ON" + GGML_OPENCL="ON" + GGML_OPENMP="OFF" + HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2" +... +-- Including OpenCL backend +-- Including Hexagon backend +... +-- Build files have been written to: /workspace/build-snapdragon + +[d]/workspace> cmake --build build-snapdragon +... +[144/356] Performing build step for 'htp-v73' +[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h +[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj +[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj +[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj +... +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so +-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so +... +``` + +To generate an installable "package" simply use cmake --install: + +``` +[d]/workspace> cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp +-- Install configuration: "Release" +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-cpu.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-opencl.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-hexagon.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v73.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v75.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v79.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v81.so +-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml.so +... +-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-bench +-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-cli +... +``` + +## How to Install + +### Android + +For this step, your device needs to be configured for on-device development. +Please see https://developer.android.com/studio/debug/dev-options for details. + +Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device. +**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.** + +``` +~/src/llama.cpp$ adb push pkg-snapdragon/llama.cpp /data/local/tmp/ +pkg-snapdragon/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s) +pkg-snapdragon/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s) +pkg-snapdragon/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s) +102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s) +``` + +At this point, you should also install some models: + +``` +~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf +... +2025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920] + +~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf +Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s) +``` + +### Windows + +All artifacts are already installed in the `pkg-snapdragon` folder. +To run, adapt below instructions to use Powershell scrits in `scripts/snapdragon/windows`. + +## How to Run + +The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables. + +llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4). +You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option. + +Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options. + +Here are some examples of running various llama.cpp tools via ADB. + +Simple question for Llama-3.2-1B + +``` +~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -p "what is the most popular cookie in the world?" +... +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 +ggml-hex: Hexagon Arch version v79 +ggml-hex: allocating new session: HTP0 +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50 +... +load_tensors: offloading output layer to GPU +load_tensors: offloaded 17/17 layers to GPU +load_tensors: CPU model buffer size = 225.49 MiB +load_tensors: HTP0 model buffer size = 0.26 MiB +load_tensors: HTP0-REPACK model buffer size = 504.00 MiB +... +I hope this helps you understand the world's most popular cookies! [end of text] +... +llama_perf_sampler_print: sampling time = 30.08 ms / 487 runs ( 0.06 ms per token, 16191.77 tokens per second) +llama_perf_context_print: load time = 617.94 ms +llama_perf_context_print: prompt eval time = 80.76 ms / 11 tokens ( 7.34 ms per token, 136.21 tokens per second) +llama_perf_context_print: eval time = 9210.59 ms / 475 runs ( 19.39 ms per token, 51.57 tokens per second) +llama_perf_context_print: total time = 9454.92 ms / 486 tokens +llama_perf_context_print: graphs reused = 473 +llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 | +llama_memory_breakdown_print: | - HTP0-REPACK | 504 = 504 + 0 + 0 | +``` + +Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices + +``` +~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt +... +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 +ggml-hex: Hexagon Arch version v81 +ggml-hex: allocating new session: HTP0 +ggml-hex: allocating new session: HTP1 +... +load_tensors: offloading output layer to GPU +load_tensors: offloaded 17/17 layers to GPU +load_tensors: CPU model buffer size = 143.86 MiB +load_tensors: HTP1 model buffer size = 0.23 MiB +load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB +load_tensors: HTP0 model buffer size = 0.28 MiB +load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB +... +llama_context: CPU output buffer size = 0.19 MiB +llama_kv_cache: HTP1 KV buffer size = 238.00 MiB +llama_kv_cache: HTP0 KV buffer size = 306.00 MiB +llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB +llama_context: HTP0 compute buffer size = 15.00 MiB +llama_context: HTP1 compute buffer size = 15.00 MiB +llama_context: CPU compute buffer size = 24.56 MiB +... +llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second) +llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second) +llama_perf_context_print: total time = 7377.33 ms / 469 tokens +llama_perf_context_print: graphs reused = 255 +llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 | +llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 | +llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 | +``` + +Op test for MUL_MAT + +``` +~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT +... +Backend 2/3: HTP0 +Device description: Hexagon +Device memory: 2048 MB (2048 MB free) +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK +MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK + +~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64 +... +ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1 +ggml-hex: Hexagon Arch version v79 +ggml-hex: allocating new session: HTP0 +ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090 +| model | size | params | backend | ngl | threads | n_batch | mmap | test | t/s | +| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: | +| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | pp128 | 169.42 ± 1.75 | +| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | tg64 | 51.54 ± 1.13 | + +build: 6a8cf8914 (6733) +``` + +## Environment variables + +- `GGML_HEXAGON_NDEV=1` + Controls the number of devices/sessions to allocate. The default is 1. + Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four. + +- `GGML_HEXAGON_NHVX=0` + Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version). + +- `GGML_HEXAGON_HOSTBUF=1` + Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers. + This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID). + +- `GGML_HEXAGON_EXPERIMENTAL=1` + Controls whether the Hexagon backend enables experimental features. + This option is required for enabling/testing experimental Ops (FLASH_ATTN_EXT). + +- `GGML_HEXAGON_VERBOSE=1` + Enables verbose logging of Ops from the backend. Example output: + + ``` + ggml-hex: HTP0 graph-compute n_nodes 2 + ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1 + ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3 + ggml-hex: HTP0 graph-compute n_nodes 1 + ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0 + ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024 + ``` + +- `GGML_HEXAGON_PROFILE=1` + Generates a host-side profile for the ggml-hexagon Ops. + +- `GGML_HEXAGON_OPMASK=0x0` + Allows enabling specific stages of the processing pipeline: + + - `0x1` Enable Op Queue (i.e., queuing Ops into NPU) + - `0x2` Enable Dynamic Quantizer (if needed for the Op) + - `0x4` Enable Op Compute (MUL_MAT, etc.) + + Examples: + + `GGML_HEXAGON_OPMASK=0x1 llama-completion ...` - Ops are enqueued but NPU-side processing is stubbed out + `GGML_HEXAGON_OPMASK=0x3 llama-completion ...` - NPU performs dynamic quantization and skips the rest + `GGML_HEXAGON_OPMASK=0x7 llama-completion ...` - Full queuing and processing of Ops (default) diff --git a/llama.cpp/docs/backend/snapdragon/developer.md b/llama.cpp/docs/backend/snapdragon/developer.md new file mode 100644 index 0000000..fc4d160 --- /dev/null +++ b/llama.cpp/docs/backend/snapdragon/developer.md @@ -0,0 +1,109 @@ +# Hexagon backend developer details + +## Backend libraries + +The Hexagon backend consist of two parts: + + - `libggml-hexagon` + This is the regular CPU-side GGML backend library, either shared or statically linked + + - `libggml-htp-vNN` + This is the NPU-side (HTP stands for Hexagon Tensor Processor) shared library that contains the Op dispatcher and kernels. + The correct library is selected automatically at runtime based on the HW version. + +Here is an example of the build artifacts + +``` +~/src/llama.cpp$ ls -l pkg-adb/llama.cpp/lib/libggml* +pkg-adb/llama.cpp/lib/libggml-base.so +pkg-adb/llama.cpp/lib/libggml-cpu.so +pkg-adb/llama.cpp/lib/libggml-hexagon.so <<< CPU library +pkg-adb/llama.cpp/lib/libggml-htp-v73.so <<< HTP op/kernels for Hexagon v73 +pkg-adb/llama.cpp/lib/libggml-htp-v75.so +pkg-adb/llama.cpp/lib/libggml-htp-v79.so +pkg-adb/llama.cpp/lib/libggml-htp-v81.so +``` + +## Memory buffers + +Hexagon NPU backend takes advantage of the Snapdragon's unified memory model where all buffers are fully accessible by the CPU and GPU. +The NPU does have a dedicated tightly-coupled memory called VTCM but that memory is used only for intermediate data (e.g. dynamically +quantized tensors) or temporary data (chunks of the weight tensors fetched via DMA). + +Please note that currently the Hexagon backend does not implement SET/GET_ROWS Ops because there is no advantage in offloading those +to the NPU at this point. + +The backend does allocates non-host buffers for the tensors with datatypes that require repacking: Q4_0, Q8_0, MXFP4. +From the MMU perspective these buffers are still regular buffers (normal access by the CPU) they are marked as non-host simply to force +the repacking. + +## Large model handling + +Hexagon NPU session (aka Process Domain (PD) in the Hexagon docs) is limited to a memory mapping of around 3.5GB. +In llama.cpp/GGML the Hexagon session is mapped to a single GGML backend device (HTP0, HTP1, etc). + +In order to map models larger than 3.5GB we need to allocate multiple devices and split the model. +For this we're taking advantage of the llama.cpp/GGML multi-GPU layer-splitting support. +Each Hexagon device behaves like a GPU from the offload and model splitting perspective. + +Here is an example of running GPT-OSS-20B model on a newer Snapdragon device with 16GB of DDR. + +``` +M=gpt-oss-20b-Q4_0.gguf NDEV=4 D=HTP0,HTP1,HTP2,HTP3 P=surfing.txt scripts/snapdragon/adb/run-completion.sh -f surfing.txt -n 32 +... +LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib +ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib +GGML_HEXAGON_NDEV=4 ./bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf + -t 4 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0,HTP1,HTP2,HTP3 -no-cnv -f surfing.txt +... +llama_model_loader: - type f32: 289 tensors +llama_model_loader: - type q4_0: 96 tensors +llama_model_loader: - type q8_0: 2 tensors +llama_model_loader: - type mxfp4: 72 tensors +... +load_tensors: offloaded 25/25 layers to GPU +load_tensors: CPU model buffer size = 1182.09 MiB +load_tensors: HTP1 model buffer size = 6.64 MiB +load_tensors: HTP1-REPACK model buffer size = 2505.94 MiB +load_tensors: HTP3 model buffer size = 5.55 MiB +load_tensors: HTP3-REPACK model buffer size = 2088.28 MiB +load_tensors: HTP0 model buffer size = 7.75 MiB +load_tensors: HTP0-REPACK model buffer size = 2923.59 MiB +load_tensors: HTP2 model buffer size = 6.64 MiB +load_tensors: HTP2-REPACK model buffer size = 2505.94 MiB +... +llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized +llama_context: CPU output buffer size = 0.77 MiB +llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells +llama_kv_cache: HTP1 KV buffer size = 25.50 MiB +llama_kv_cache: HTP3 KV buffer size = 25.50 MiB +llama_kv_cache: HTP0 KV buffer size = 25.50 MiB +llama_kv_cache: HTP2 KV buffer size = 25.50 MiB +llama_kv_cache: size = 102.00 MiB ( 8192 cells, 12 layers, 1/1 seqs), K (q8_0): 51.00 MiB, V (q8_0): 51.00 MiB +llama_kv_cache_iswa: creating SWA KV cache, size = 256 cells +llama_kv_cache: HTP1 KV buffer size = 0.80 MiB +llama_kv_cache: HTP3 KV buffer size = 0.53 MiB +llama_kv_cache: HTP0 KV buffer size = 1.06 MiB +llama_kv_cache: HTP2 KV buffer size = 0.80 MiB +llama_kv_cache: size = 3.19 MiB ( 256 cells, 12 layers, 1/1 seqs), K (q8_0): 1.59 MiB, V (q8_0): 1.59 MiB +llama_context: HTP0 compute buffer size = 16.06 MiB +llama_context: HTP1 compute buffer size = 16.06 MiB +llama_context: HTP2 compute buffer size = 16.06 MiB +llama_context: HTP3 compute buffer size = 16.06 MiB +llama_context: CPU compute buffer size = 98.19 MiB +... +llama_perf_context_print: prompt eval time = 3843.67 ms / 197 tokens ( 19.51 ms per token, 51.25 tokens per second) +llama_perf_context_print: eval time = 1686.13 ms / 31 runs ( 54.39 ms per token, 18.39 tokens per second) +llama_perf_context_print: total time = 6266.30 ms / 228 tokens +llama_perf_context_print: graphs reused = 30 +llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | +llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - HTP2 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - HTP3 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 | +llama_memory_breakdown_print: | - Host | 1476 = 1208 + 105 + 162 | +llama_memory_breakdown_print: | - HTP1-REPACK | 2505 = 2505 + 0 + 0 | +llama_memory_breakdown_print: | - HTP3-REPACK | 2088 = 2088 + 0 + 0 | +llama_memory_breakdown_print: | - HTP0-REPACK | 2923 = 2923 + 0 + 0 | +llama_memory_breakdown_print: | - HTP2-REPACK | 2505 = 2505 + 0 + 0 | +``` diff --git a/llama.cpp/docs/backend/snapdragon/windows.md b/llama.cpp/docs/backend/snapdragon/windows.md new file mode 100644 index 0000000..e9346cc --- /dev/null +++ b/llama.cpp/docs/backend/snapdragon/windows.md @@ -0,0 +1,161 @@ +## Overview + +The document covers procedures for installing the latest GPU and NPU drivers, and OpenCL and Hexagon SDKs. + + +In order to use Hexagon NPU on Snapdragon Windows devices the underlying HTP Ops libraries (e.g libggml-htp-v73.so) +must be included in the .cat file digitally signed with a trusted certificate. + +This document covers details on how to generate personal certificate files (.pfx) and how to configure the system +to allow for test signatures (aka test-signing). + +## Install the latest Adreno OpenCL SDK + +Either use the trimmed down version (optimized for CI) from + + https://github.com/snapdragon-toolchain/opencl-sdk/releases/download/v2.3.2/adreno-opencl-sdk-v2.3.2-arm64-wos.tar.xz + +Or download the complete official version from + + https://softwarecenter.qualcomm.com/catalog/item/Adreno_OpenCL_SDK?version=2.3.2 + +Unzip/untar the archive into +``` +c:\Qualcomm\OpenCL_SDK\2.3.2 +``` + +## Install the latest Hexagon SDK Community Edition + +Either use the trimmed down version (optimized for CI) from + + https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.4.0.2/hexagon-sdk-v6.4.0.2-arm64-wos.tar.xz + +Or download the complete official version from + + https://softwarecenter.qualcomm.com/catalog/item/Hexagon_SDK?version=6.4.0.2 + +Unzip/untar the archive into +``` +c:\Qualcomm\Hexagon_SDK\6.4.0.2 +``` + +## Install the latest Adreno GPU driver + +Download the driver from + + https://softwarecenter.qualcomm.com/catalog/item/Windows_Graphics_Driver + +After the automated installation and reboot please make sure that the GPU device shows up in the `Device Manager` (under 'Display Adapters`) + +## Install the latest Qualcomm NPU driver + +Download the driver from + + https://softwarecenter.qualcomm.com/catalog/item/Qualcomm_HND + +After the automated installation and reboot please make sure that the Hexagon NPU device shows up in the `Device Manager` (under `Neural Processors`). + +If the device is not available you can try installing all components (`qcnspmcdm8380`, `qcnspmcdm8380_ext`) manually. +The components are extracted into +``` +c:\QCDrivers\qcnspmcdm... +``` + +## Enable NPU driver test signatures + +Please note that the following steps are required only for the Hexagon NPU. +Adreno GPU backend does not require test signatures. + +### Enable testsigning + +Use `bcdedit` to enable test-signing +``` +> bcdedit /set TESTSIGNING ON +``` +(Secure Boot may need to be disabled for this to work) + +Make sure test-signing is enabled after reboot +``` +> bcdedit /enum +... +testsigning Yes +... +``` +For additional details see Microsoft guide at + + https://learn.microsoft.com/en-us/windows-hardware/drivers/install/the-testsigning-boot-configuration-option + +### Create personal certificate + +The tools required for this procedure are available as part of Windows SDK and Windows Driver Kit which should be +installed as part of the MS Visual Studio. +They are typically located at +``` +c:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0 +``` +(replace 10.0.26100.0 with correct version). + +To create personal self-signed certificate run the following commands (either from cmd or power-shell): +``` +> cd c:\Users\MyUser +> mkdir Certs +> cd Certs +> makecert -r -pe -ss PrivateCertStore -n CN=GGML.HTP.v1 -eku 1.3.6.1.5.5.7.3.3 -sv ggml-htp-v1.pvk ggml-htp-v1.cer +> pvk2pfx.exe -pvk ggml-htp-v1.pvk -spc ggml-htp-v1.cer -pfx ggml-htp-v1.pfx +``` +(replace `MyUser` with your username). + +Add this certificate to `Trusted Root Certification Authorities` and `Trusted Publishers` stores. +This can be done using `certlm` Certificate Manager tool. +Right click on the certificate store, select `All Tasks -> Import` and follow the prompts to import the certificate from the +PFX file you created above. + +For additional details see Microsoft guide at + + https://learn.microsoft.com/en-us/windows-hardware/drivers/install/introduction-to-test-signing + +Make sure to save the PFX file, you will need it for the build procedures. +Please note that the same certificate can be used for signing any number of builds. + +## Build Hexagon backend with signed HTP ops libraries + +The overall Hexagon backend build procedure for Windows on Snapdragon is the same as for other platforms. +However, additional settings are required for generating and signing HTP Ops libraries. +``` +> $env:OPENCL_SDK_ROOT="C:\Qualcomm\OpenCL_SDK\2.3.2" +> $env:HEXAGON_SDK_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2" +> $env:HEXAGON_TOOLS_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2\tools\HEXAGON_Tools\19.0.04" +> $env:HEXAGON_HTP_CERT="c:\Users\MyUsers\Certs\ggml-htp-v1.pfx" +> $env:WINDOWS_SDK_BIN="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\arm64" + +> cmake --preset arm64-windows-snapdragon-release -B build-wos +... +> cmake --install build-wos --prefix pkg-snapdragon +``` + +Once the build is complete HTP ops libraries will be installed like this +``` +> dir pkg-snapdragon/lib +... +-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v73.so +-a---- 1/22/2026 6:01 PM 191752 libggml-htp-v75.so +-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v79.so +-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v81.so +-a---- 1/22/2026 6:01 PM 4139 libggml-htp.cat +``` + +The .cat file, the signature and proper certicate installation can be verified with + +``` +> signtool.exe verify /v /pa .\pkg-snapdragon\lib\libggml-htp.cat +Verifying: .\pkg-snapdragon\lib\libggml-htp.cat + +Signature Index: 0 (Primary Signature) +Hash of file (sha256): 9820C664DA59D5EAE31DBB664127FCDAEF59CDC31502496BC567544EC2F401CF + +Signing Certificate Chain: + Issued to: GGML.HTP.v1 +... +Successfully verified: .\pkg-snapdragon\lib\libggml-htp.cat +... +``` diff --git a/llama.cpp/docs/backend/zDNN.md b/llama.cpp/docs/backend/zDNN.md new file mode 100644 index 0000000..94108f5 --- /dev/null +++ b/llama.cpp/docs/backend/zDNN.md @@ -0,0 +1,66 @@ +# llama.cpp for IBM zDNN Accelerator + +> [!WARNING] +> **Note:** zDNN is **not** the same as ZenDNN. +> - **zDNN** (this page): IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes +> - **ZenDNN**: AMD's deep learning library for AMD EPYC CPUs ([see ZenDNN documentation](ZenDNN.md)) + +## Background + +IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations. + +### Llama.cpp + IBM zDNN + +The llama.cpp zDNN backend is designed to enable llama.cpp on IBM z17 and later systems via the IBM zDNN hardware acceleration library. + +## Software & Hardware Support + +| Hardware Level | Status | Verified | +| -------------------- | ------------- | -------------------------- | +| IBM z17 / LinuxONE 5 | Supported | RHEL 9.6, IBM z17, 40 IFLs | +| IBM z16 / LinuxONE 4 | Not Supported | | + +## Data Types Supported + +| Data Type | Status | +| --------- | --------- | +| F32 | Supported | +| F16 | Supported | +| BF16 | Supported | + +## CMake Options + +The IBM zDNN backend has the following CMake options that control the behaviour of the backend. + +| CMake Option | Default Value | Description | +| ------------ | ------------- | ----------------------------------- | +| `GGML_ZDNN` | `OFF` | Compile llama.cpp with zDNN support | +| `ZDNN_ROOT` | `""` | Override zDNN library lookup | + +## 1. Install zDNN Library + +Note: Using the zDNN library provided via `apt` or `yum` may not work correctly as reported in [#15772](https://github.com/ggml-org/llama.cpp/issues/15772). It is preferred that you compile from source. + +```sh +git clone --recurse-submodules https://github.com/IBM/zDNN +cd zDNN + +autoreconf . +./configure --prefix=/opt/zdnn-libs + +make build +sudo make install +``` + +## 2. Build llama.cpp + +```sh +git clone https://github.com/ggml-org/llama.cpp +cd llama.cpp + +cmake -S . -G Ninja -B build \ + -DCMAKE_BUILD_TYPE=Release \ + -DGGML_ZDNN=ON \ + -DZDNN_ROOT=/opt/zdnn-libs +cmake --build build --config Release -j$(nproc) +``` |
