llmnpc - llama.cpp/docs/build-s390x.md

Path: llmnpc / llama.cpp / docs / build-s390x.md (raw)
  1> [!IMPORTANT]
  2> This build documentation is specific only to IBM Z & LinuxONE mainframes (s390x). You can find the build documentation for other architectures: [build.md](build.md).
  3
  4# Build llama.cpp locally (for s390x)
  5
  6The main product of this project is the `llama` library. Its C-style interface can be found in [include/llama.h](../include/llama.h).
  7
  8The project also includes many example programs and tools using the `llama` library. The examples range from simple, minimal code snippets to sophisticated sub-projects such as an OpenAI-compatible HTTP server.
  9
 10**To get the code:**
 11
 12```bash
 13git clone https://github.com/ggml-org/llama.cpp
 14cd llama.cpp
 15```
 16
 17## CPU Build with BLAS
 18
 19Building llama.cpp with BLAS support is highly recommended as it has shown to provide performance improvements. Make sure to have OpenBLAS installed in your environment.
 20
 21```bash
 22cmake -S . -B build             \
 23    -DCMAKE_BUILD_TYPE=Release  \
 24    -DGGML_BLAS=ON              \
 25    -DGGML_BLAS_VENDOR=OpenBLAS
 26
 27cmake --build build --config Release -j $(nproc)
 28```
 29
 30**Notes**:
 31
 32-   For faster repeated compilation, install [ccache](https://ccache.dev/)
 33-   By default, VXE/VXE2 is enabled. To disable it (not recommended):
 34
 35    ```bash
 36    cmake -S . -B build             \
 37        -DCMAKE_BUILD_TYPE=Release  \
 38        -DGGML_BLAS=ON              \
 39        -DGGML_BLAS_VENDOR=OpenBLAS \
 40        -DGGML_VXE=OFF
 41
 42    cmake --build build --config Release -j $(nproc)
 43    ```
 44
 45-   For debug builds:
 46
 47    ```bash
 48    cmake -S . -B build             \
 49        -DCMAKE_BUILD_TYPE=Debug    \
 50        -DGGML_BLAS=ON              \
 51        -DGGML_BLAS_VENDOR=OpenBLAS
 52    cmake --build build --config Debug -j $(nproc)
 53    ```
 54
 55-   For static builds, add `-DBUILD_SHARED_LIBS=OFF`:
 56
 57    ```bash
 58    cmake -S . -B build             \
 59        -DCMAKE_BUILD_TYPE=Release  \
 60        -DGGML_BLAS=ON              \
 61        -DGGML_BLAS_VENDOR=OpenBLAS \
 62        -DBUILD_SHARED_LIBS=OFF
 63
 64    cmake --build build --config Release -j $(nproc)
 65    ```
 66
 67## IBM zDNN Accelerator
 68
 69This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
 70
 71#### Compile from source from IBM
 72
 73You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
 74
 75### Compilation
 76
 77```bash
 78cmake -S . -B build             \
 79    -DCMAKE_BUILD_TYPE=Release  \
 80    -DGGML_ZDNN=ON
 81cmake --build build --config Release -j$(nproc)
 82```
 83
 84## Getting GGUF Models
 85
 86All models need to be converted to Big-Endian. You can achieve this in three cases:
 87
 881. **Use pre-converted models verified for use on IBM Z & LinuxONE (easiest)**
 89
 90    ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
 91
 92    You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
 93
 94    These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
 95
 962. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
 97
 98    ![File Type - safetensors](https://img.shields.io/badge/File_Type-safetensors-da1e28)
 99
100    The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
101
102    Ensure that you have installed the required packages in advance
103
104    ```bash
105    pip3 install -r requirements.txt
106    ```
107
108    Convert the `safetensors` model to `GGUF`
109
110    ```bash
111    python3 convert_hf_to_gguf.py \
112        --outfile model-name-be.f16.gguf \
113        --outtype f16 \
114        --bigendian \
115        model-directory/
116    ```
117
118    For example,
119
120    ```bash
121    python3 convert_hf_to_gguf.py \
122        --outfile granite-3.3-2b-instruct-be.f16.gguf \
123        --outtype f16 \
124        --bigendian \
125        granite-3.3-2b-instruct/
126    ```
127
1283. **Convert existing GGUF Little-Endian model to Big-Endian**
129
130    ![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
131
132    The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
133
134    ```bash
135    python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
136    ```
137
138    For example,
139
140    ```bash
141    python3 gguf-py/gguf/scripts/gguf_convert_endian.py granite-3.3-2b-instruct-le.f16.gguf BIG
142    mv granite-3.3-2b-instruct-le.f16.gguf granite-3.3-2b-instruct-be.f16.gguf
143    ```
144
145    **Notes:**
146
147    - The GGUF endian conversion script may not support all data types at the moment and may fail for some models/quantizations. When that happens, please try manually converting the safetensors model to GGUF Big-Endian via Step 2.
148
149## IBM Accelerators
150
151### 1. SIMD Acceleration
152
153Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
154
155### 2. zDNN Accelerator (WIP)
156
157Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
158
159### 3. Spyre Accelerator
160
161_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
162
163## Performance Tuning
164
165### 1. Virtualization Setup
166
167It is strongly recommended to use only LPAR (Type-1) virtualization to get the most performance.
168
169Note: Type-2 virtualization is not supported at the moment, while you can get it running, the performance will not be the best.
170
171### 2. IFL (Core) Count
172
173It is recommended to allocate a minimum of 8 shared IFLs assigned to the LPAR. Increasing the IFL count past 8 shared IFLs will only improve Prompt Processing performance but not Token Generation.
174
175Note: IFL count does not equate to vCPU count.
176
177### 3. SMT vs NOSMT (Simultaneous Multithreading)
178
179It is strongly recommended to disable SMT via the kernel boot parameters as it negatively affects performance. Please refer to your Linux distribution's guide on disabling SMT via kernel boot parameters.
180
181### 4. BLAS vs NOBLAS
182
183IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongly recommended to use BLAS.
184
185## Frequently Asked Questions (FAQ)
186
1871. I'm getting the following error message while trying to load a model: `gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?`
188
189    Answer: Please ensure that the model you have downloaded/converted is GGUFv3 Big-Endian. These models are usually denoted with the `-be` suffix, i.e., `granite-3.3-2b-instruct-be.F16.gguf`.
190
191    You may refer to the [Getting GGUF Models](#getting-gguf-models) section to manually convert a `safetensors` model to `GGUF` Big Endian.
192
1932. I'm getting extremely poor performance when running inference on a model
194
195    Answer: Please refer to the [Appendix B: SIMD Support Matrix](#appendix-b-simd-support-matrix) to check if your model quantization is supported by SIMD acceleration.
196
1973. I'm building on IBM z17 and getting the following error messages: `invalid switch -march=z17`
198
199    Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
200
2014. Failing to install the `sentencepiece` package using GCC 15+
202
203    Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
204
205    As a temporary workaround, please run the installation command with the following environment variables.
206
207    ```bash
208    export CXXFLAGS="-include cstdint"
209    ```
210
211    For example,
212
213    ```bash
214    CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
215    ```
216
217## Getting Help on IBM Z & LinuxONE
218
2191. **Bugs, Feature Requests**
220
221    Please file an issue in llama.cpp and ensure that the title contains "s390x".
222
2232. **Other Questions**
224
225    Please reach out directly to [aionz@us.ibm.com](mailto:aionz@us.ibm.com).
226
227## Appendix A: Hardware Support Matrix
228
229|          | Support | Minimum Compiler Version |
230| -------- | ------- | ------------------------ |
231| IBM z15  | ✅      |                          |
232| IBM z16  | ✅      |                          |
233| IBM z17  | ✅      | GCC 15.1.0               |
234| IBM zDNN | ✅      |                          |
235
236-   ✅ - supported and verified to run as intended
237-   🚫 - unsupported, we are unlikely able to provide support
238
239## Appendix B: SIMD Support Matrix
240
241|            | VX/VXE/VXE2 | zDNN | Spyre |
242|------------|-------------|------|-------|
243| FP32       | ✅           | ✅    | ❓     |
244| FP16       | ✅           | ✅    | ❓     |
245| BF16       | 🚫           | ✅    | ❓     |
246| Q4_0       | ✅           | ❓    | ❓     |
247| Q4_1       | ✅           | ❓    | ❓     |
248| MXFP4      | 🚫           | ❓    | ❓     |
249| Q5_0       | ✅           | ❓    | ❓     |
250| Q5_1       | ✅           | ❓    | ❓     |
251| Q8_0       | ✅           | ❓    | ❓     |
252| Q2_K       | 🚫           | ❓    | ❓     |
253| Q3_K       | ✅           | ❓    | ❓     |
254| Q4_K       | ✅           | ❓    | ❓     |
255| Q5_K       | ✅           | ❓    | ❓     |
256| Q6_K       | ✅           | ❓    | ❓     |
257| TQ1_0      | 🚫           | ❓    | ❓     |
258| TQ2_0      | 🚫           | ❓    | ❓     |
259| IQ2_XXS    | 🚫           | ❓    | ❓     |
260| IQ2_XS     | 🚫           | ❓    | ❓     |
261| IQ2_S      | 🚫           | ❓    | ❓     |
262| IQ3_XXS    | 🚫           | ❓    | ❓     |
263| IQ3_S      | 🚫           | ❓    | ❓     |
264| IQ1_S      | 🚫           | ❓    | ❓     |
265| IQ1_M      | 🚫           | ❓    | ❓     |
266| IQ4_NL     | ✅           | ❓    | ❓     |
267| IQ4_XS     | ✅           | ❓    | ❓     |
268| FP32->FP16 | 🚫           | ❓    | ❓     |
269| FP16->FP32 | 🚫           | ❓    | ❓     |
270
271-   ✅ - acceleration available
272-   🚫 - acceleration unavailable, will still run using scalar implementation
273-   ❓ - acceleration unknown, please contribute if you can test it yourself
274
275Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Sep 7, 2025.