1# llama.cpp for AMD ZenDNN
2
3> [!WARNING]
4> **Note:** ZenDNN is **not** the same as zDNN.
5> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
6> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))
7
8- [Background](#background)
9- [OS](#os)
10- [Hardware](#hardware)
11- [Supported Operations](#supported-operations)
12- [DataType Supports](#datatype-supports)
13- [Linux](#linux)
14- [Environment Variable](#environment-variable)
15- [Performance Optimization](#performance-optimization)
16- [Known Issues](#known-issues)
17- [TODO](#todo)
18
19## Background
20
21**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
22
23**Llama.cpp + ZenDNN**
24
25The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN).
26
27For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
28
29## OS
30
31| OS | Status | Verified |
32|:-------:|:-------:|:----------------------------------------------:|
33| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
34
35For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os).
36
37## Hardware
38
39### AMD CPUs
40
41**Recommended Processors**
42
43ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
44
45| CPU Family | Status | Notes |
46|:-----------------------------:|:-------:|:----------------------------------:|
47| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture |
48| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture |
49| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture |
50| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
51
52*Notes:*
53
54- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
55- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
56- For optimal performance, ensure your system has sufficient memory bandwidth.
57
58## Supported Operations
59
60The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
61
62| Operation | Status | Notes |
63|:-------------|:-------:|:----------------------------------------------:|
64| MUL_MAT | ✓ | Accelerated via ZenDNN LowOHA MatMul |
65
66*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
67
68## DataType Supports
69
70| DataType | Status | Notes |
71|:----------------------:|:-------:|:---------------------------------------------:|
72| FP32 | Support | Full precision floating point |
73| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
74
75*Notes:*
76
77- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
78
79## Linux
80
81### I. Setup Environment
82
83You have two options to set up ZenDNN:
84
85#### Option 1: Automatic Download and Build (Recommended)
86
87CMake will automatically download and build ZenDNN for you:
88
89```sh
90# Build llama.cpp - ZenDNN will be automatically downloaded and built
91cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
92cmake --build build --config Release -j $(nproc)
93```
94
95No manual ZenDNN installation required. CMake will handle everything automatically.
96
97#### Option 2: Use Custom ZenDNN Installation
98
99If you want to build ZenDNN yourself or use a specific version:
100
101**Step 1: Build ZenDNN from source**
102
103```sh
104# Clone ZenDNN repository
105git clone https://github.com/amd/ZenDNN.git
106cd ZenDNN
107git checkout zendnnl
108
109# Build and install (requires CMake >= 3.25)
110mkdir build && cd build
111cmake ..
112cmake --build . --target all
113```
114
115Default installation path: `ZenDNN/build/install`
116
117**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md).
118
119**Step 2: Build llama.cpp with custom ZenDNN path**
120
121```sh
122# Using environment variable
123export ZENDNN_ROOT=/path/to/ZenDNN/build/install
124cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
125cmake --build build --config Release -j $(nproc)
126
127# OR specify path directly in CMake
128cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
129cmake --build build --config Release -j $(nproc)
130```
131
132### II. Run the Server
133
134#### 1. Download Model
135
136Download LLaMA 3.1 8B Instruct BF16 model:
137
138```sh
139# Download from Hugging Face
140huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
141```
142
143#### 2. Start Server
144
145Run llama.cpp server with ZenDNN acceleration:
146
147```sh
148# Set optimal configuration
149export OMP_NUM_THREADS=64 # Adjust to your CPU core count
150export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance
151
152# Start server
153./build/bin/llama-server \
154 -m models/Llama-3.1-8B-Instruct.BF16.gguf \
155 --host 0.0.0.0 \
156 --port 8080 \
157 -t 64
158```
159
160Access the server at `http://localhost:8080`.
161
162**Performance tips**:
163- Set `OMP_NUM_THREADS` to match your physical core count
164- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance
165- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
166
167## Environment Variable
168
169### Build Time
170
171| Name | Value | Function |
172|--------------------|---------------------------------------|---------------------------------------------|
173| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support |
174| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory |
175| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading |
176
177### Runtime
178
179| Name | Value | Function |
180|-------------------------|--------------------------|-------------------------------------------------------------------|
181| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) |
182| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) |
183| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) |
184| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) |
185| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) |
186
187**Example**:
188
189```sh
190export OMP_NUM_THREADS=64
191export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance
192./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100
193```
194
195## Performance Optimization
196
197### MatMul Algorithm Selection
198
199ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm:
200
201```sh
202export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended)
203```
204
205**Available algorithms**:
206
207| Value | Algorithm | Description |
208|:-----:|:-----------------------|:----------------------------------------------|
209| 0 | Dynamic Dispatch | Automatic backend selection (default) |
210| 1 | AOCL BLIS | AOCL BLIS backend |
211| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** |
212| 3 | OneDNN | OneDNN backend |
213| 4 | OneDNN Blocked | Blocked OneDNN |
214| 5 | LibXSMM | LibXSMM backend |
215
216### Profiling and Debugging
217
218For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md).
219
220## Known Issues
221
222- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
223- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
224- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
225
226## Q&A
227
228**Q: How do I verify that ZenDNN backend is being used?**
229
230A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
231
232**Q: What performance improvement can I expect?**
233
234A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
235
236**Q: Can I use ZenDNN on non-AMD processors?**
237
238A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
239
240**Q: Does ZenDNN support quantized models?**
241
242A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
243
244**Q: Why is my inference not faster with ZenDNN?**
245
246A: Ensure:
2471. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
2482. `OMP_NUM_THREADS` is set appropriately (physical core count)
2493. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS)
2504. You're using a sufficiently large model (small models may not benefit as much)
2515. Enable profiling to verify ZenDNN MatMul is being called
252
253### **GitHub Contribution**:
254Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
255
256## TODO
257
258- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)