1> [!IMPORTANT]
2> This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: [build.md](build.md).
3
4# Build llama.cpp locally (for s390x)
5
6The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](../include/llama.h).
7
8The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
9
10**To get the code:**
11
12```bash
13git clone https://github.com/ggml-org/llama.cpp
14cd llama.cpp
15```
16
17## CPU Build with BLAS
18
19Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
20
21```bash
22cmake -S . -B build \
23 -DCMAKE_BUILD_TYPE=Release \
24 -DGGML_BLAS=ON \
25 -DGGML_BLAS_VENDOR=OpenBLAS
26
27cmake --build build --config Release -j $(nproc)
28```
29
30**Notes**:
31
32- For faster repeated compilation, install [ccache](https://ccache.dev/)
33- By default, VXE/VXE2 is enabled. To disable it (not recommended):
34
35 ```bash
36 cmake -S . -B build \
37 -DCMAKE_BUILD_TYPE=Release \
38 -DGGML_BLAS=ON \
39 -DGGML_BLAS_VENDOR=OpenBLAS \
40 -DGGML_VXE=OFF
41
42 cmake --build build --config Release -j $(nproc)
43 ```
44
45- For debug builds:
46
47 ```bash
48 cmake -S . -B build \
49 -DCMAKE_BUILD_TYPE=Debug \
50 -DGGML_BLAS=ON \
51 -DGGML_BLAS_VENDOR=OpenBLAS
52 cmake --build build --config Debug -j $(nproc)
53 ```
54
55- For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
56
57 ```bash
58 cmake -S . -B build \
59 -DCMAKE_BUILD_TYPE=Release \
60 -DGGML_BLAS=ON \
61 -DGGML_BLAS_VENDOR=OpenBLAS \
62 -DBUILD_SHARED_LIBS=OFF
63
64 cmake --build build --config Release -j $(nproc)
65 ```
66
67## IBM zDNN Accelerator
68
69This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
70
71#### Compile from source from IBM
72
73You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
74
75### Compilation
76
77```bash
78cmake -S . -B build \
79 -DCMAKE_BUILD_TYPE=Release \
80 -DGGML_ZDNN=ON
81cmake --build build --config Release -j$(nproc)
82```
83
84## Getting GGUF Models
85
86All models need to be converted to Big-Endian. You can achieve this in three cases:
87
881. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**
89
90 
91
92 You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
93
94 These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
95
962. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
97
98 
99
100 The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
101
102 Ensure that you have installed the required packages in advance
103
104 ```bash
105 pip3 install -r requirements.txt
106 ```
107
108 Convert the `safetensors` model to `GGUF`
109
110 ```bash
111 python3 convert_hf_to_gguf.py \
112 --outfile model-name-be.f16.gguf \
113 --outtype f16 \
114 --bigendian \
115 model-directory/
116 ```
117
118 For example,
119
120 ```bash
121 python3 convert_hf_to_gguf.py \
122 --outfile granite-3.3-2b-instruct-be.f16.gguf \
123 --outtype f16 \
124 --bigendian \
125 granite-3.3-2b-instruct/
126 ```
127
1283. **Convert existing GGUF Little-Endian model to Big-Endian**
129
130 
131
132 The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
133
134 ```bash
135 python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
136 ```
137
138 For example,
139
140 ```bash
141 python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
142 mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
143 ```
144
145 **Notes:**
146
147 - The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
148
149## IBM Accelerators
150
151### 1. SIMD Acceleration
152
153Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
154
155### 2. zDNN Accelerator (WIP)
156
157Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
158
159### 3. Spyre Accelerator
160
161_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
162
163## Performance Tuning
164
165### 1. Virtualization Setup
166
167It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
168
169Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
170
171### 2. IFL (Core) Count
172
173It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
174
175Note: IFL count does not equate to vCPU count.
176
177### 3. SMT vs NOSMT (Simultaneous Multithreading)
178
179It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
180
181### 4. BLAS vs NOBLAS
182
183IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
184
185## Frequently Asked Questions (FAQ)
186
1871. I'm getting the following error message while trying to load a model: `gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?`
188
189 Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the `-be` suffix, i.e., `granite-3.3-2b-instruct-be.F16.gguf`.
190
191 You may refer to the [Getting GGUF Models](#getting-gguf-models) section to manually convert a `safetensors` model to `GGUF` Big Endian.
192
1932. I'm getting extremely poor performance when running inference on a model
194
195 Answer: Please refer to the [Appendix B: SIMD Support Matrix](#appendix-b-simd-support-matrix) to check if your model quantization is supported by SIMD acceleration.
196
1973. I'm building on IBM z17 and getting the following error messages: `invalid switch -march=z17`
198
199 Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
200
2014. Failing to install the `sentencepiece` package using GCC 15+
202
203 Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
204
205 As a temporary workaround, please run the installation command with the following environment variables.
206
207 ```bash
208 export CXXFLAGS="-include cstdint"
209 ```
210
211 For example,
212
213 ```bash
214 CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
215 ```
216
217## Getting Help on IBM Z & LinuxONE
218
2191. **Bugs, Feature Requests**
220
221 Please file an issue in llama.cpp and ensure that the title contains "s390x".
222
2232. **Other Questions**
224
225 Please reach out directly to [aionz@us.ibm.com](mailto:aionz@us.ibm.com).
226
227## Appendix A: Hardware Support Matrix
228
229| | Support | Minimum Compiler Version |
230| -------- | ------- | ------------------------ |
231| IBM z15 | ✅ | |
232| IBM z16 | ✅ | |
233| IBM z17 | ✅ | GCC 15.1.0 |
234| IBM zDNN | ✅ | |
235
236- ✅ - supported and verified to run as intended
237- 🚫 - unsupported, we are unlikely able to provide support
238
239## Appendix B: SIMD Support Matrix
240
241| | VX/VXE/VXE2 | zDNN | Spyre |
242|------------|-------------|------|-------|
243| FP32 | ✅ | ✅ | ❓ |
244| FP16 | ✅ | ✅ | ❓ |
245| BF16 | 🚫 | ✅ | ❓ |
246| Q4_0 | ✅ | ❓ | ❓ |
247| Q4_1 | ✅ | ❓ | ❓ |
248| MXFP4 | 🚫 | ❓ | ❓ |
249| Q5_0 | ✅ | ❓ | ❓ |
250| Q5_1 | ✅ | ❓ | ❓ |
251| Q8_0 | ✅ | ❓ | ❓ |
252| Q2_K | 🚫 | ❓ | ❓ |
253| Q3_K | ✅ | ❓ | ❓ |
254| Q4_K | ✅ | ❓ | ❓ |
255| Q5_K | ✅ | ❓ | ❓ |
256| Q6_K | ✅ | ❓ | ❓ |
257| TQ1_0 | 🚫 | ❓ | ❓ |
258| TQ2_0 | 🚫 | ❓ | ❓ |
259| IQ2_XXS | 🚫 | ❓ | ❓ |
260| IQ2_XS | 🚫 | ❓ | ❓ |
261| IQ2_S | 🚫 | ❓ | ❓ |
262| IQ3_XXS | 🚫 | ❓ | ❓ |
263| IQ3_S | 🚫 | ❓ | ❓ |
264| IQ1_S | 🚫 | ❓ | ❓ |
265| IQ1_M | 🚫 | ❓ | ❓ |
266| IQ4_NL | ✅ | ❓ | ❓ |
267| IQ4_XS | ✅ | ❓ | ❓ |
268| FP32->FP16 | 🚫 | ❓ | ❓ |
269| FP16->FP32 | 🚫 | ❓ | ❓ |
270
271- ✅ - acceleration available
272- 🚫 - acceleration unavailable, will still run using scalar implementation
273- ❓ - acceleration unknown, please contribute if you can test it yourself
274
275Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Sep 7, 2025.