llmnpc - llama.cpp/docs/backend/SYCL.md

Path: llmnpc / llama.cpp / docs / backend / SYCL.md (raw)
  1# llama.cpp for SYCL
  2
  3- [Background](#background)
  4- [Recommended Release](#recommended-release)
  5- [News](#news)
  6- [OS](#os)
  7- [Hardware](#hardware)
  8- [Docker](#docker)
  9- [Linux](#linux)
 10- [Windows](#windows)
 11- [Environment Variable](#environment-variable)
 12- [Known Issue](#known-issues)
 13- [Q&A](#qa)
 14- [TODO](#todo)
 15
 16## Background
 17
 18**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17.
 19
 20**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to Intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:
 21
 22- **DPCPP** *(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers.
 23- **oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. Intel oneMKL, oneMath and oneDNN)*.
 24- **oneAPI LevelZero**: A high performance low level interface for fine-grained control over Intel iGPUs and dGPUs.
 25
 26### Llama.cpp + SYCL
 27
 28The llama.cpp SYCL backend is primarily designed for **Intel GPUs**.
 29SYCL cross-platform capabilities enable support for other vendor GPUs as well.
 30
 31## Recommended Release
 32
 33The following releases are verified and recommended:
 34
 35|Commit ID|Tag|Release|Verified  Platform| Update date|
 36|-|-|-|-|-|
 37|24e86cae7219b0f3ede1d5abdf5bf3ad515cccb8|b5377 |[llama-b5377-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b5377/llama-b5377-bin-win-sycl-x64.zip) |Arc B580/Linux/oneAPI 2025.1<br>LNL Arc GPU/Windows 11/oneAPI 2025.1.1|2025-05-15|
 38|3bcd40b3c593d14261fb2abfabad3c0fb5b9e318|b4040 |[llama-b4040-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b4040/llama-b4040-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1| 2024-11-19|
 39|fb76ec31a9914b7761c1727303ab30380fd4f05c|b3038 |[llama-b3038-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b3038/llama-b3038-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1||
 40
 41
 42## News
 43
 44- 2026.02
 45  - Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. User can't build up the software for Nvidia & AMD GPU.
 46
 47- 2025.11
 48  - Support malloc memory on device more than 4GB.
 49
 50- 2025.2
 51  - Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
 52    |GPU|Base tokens/s|Increased tokens/s|Percent|
 53    |-|-|-|-|
 54    |PVC 1550|39|73|+87%|
 55    |Flex 170|39|50|+28%|
 56    |Arc A770|42|55|+30%|
 57    |MTL|13|16|+23%|
 58    |ARL-H|14|17|+21%|
 59
 60- 2024.11
 61  - Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
 62
 63- 2024.8
 64  - Use oneDNN as the default GEMM library, improve the compatibility for new Intel GPUs.
 65
 66- 2024.5
 67  - Performance is increased: 34 -> 37 tokens/s of llama-2-7b.Q4_0 on Arc A770.
 68  - Arch Linux is verified successfully.
 69
 70- 2024.4
 71  - Support data types: GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M.
 72
 73- 2024.3
 74  - Release binary files of Windows.
 75  - A blog is published: **Run LLM on all Intel GPUs Using llama.cpp**: [intel.com](https://www.intel.com/content/www/us/en/developer/articles/technical/run-llm-on-all-gpus-using-llama-cpp-artical.html) or [medium.com](https://medium.com/@jianyu_neo/run-llm-on-all-intel-gpus-using-llama-cpp-fd2e2dcbd9bd).
 76  - New base line is ready: [tag b2437](https://github.com/ggml-org/llama.cpp/tree/b2437).
 77  - Support multiple cards: **--split-mode**: [none|layer]; not support [row], it's on developing.
 78  - Support to assign main GPU by **--main-gpu**, replace $GGML_SYCL_DEVICE.
 79  - Support detecting all GPUs with level-zero and same top **Max compute units**.
 80  - Support OPs
 81    - hardsigmoid
 82    - hardswish
 83    - pool2d
 84
 85- 2024.1
 86  - Create SYCL backend for Intel GPU.
 87  - Support Windows build
 88
 89## OS
 90
 91| OS      | Status  | Verified                                       |
 92|---------|---------|------------------------------------------------|
 93| Linux   | Support | Ubuntu 22.04, Fedora Silverblue 39, Arch Linux |
 94| Windows | Support | Windows 11                                     |
 95
 96
 97## Hardware
 98
 99### Intel GPU
100
101SYCL backend supports Intel GPU Family:
102
103- Intel Data Center Max Series
104- Intel Flex Series, Arc Series
105- Intel Built-in Arc GPU
106- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)).
107
108On older Intel GPUs, you may try [OpenCL](/docs/backend/OPENCL.md) although the performance is not optimal, and some GPUs may not support OpenCL nor have any GPGPU capabilities.
109
110#### Verified devices
111
112| Intel GPU                     | Status  | Verified Model                        |
113|-------------------------------|---------|---------------------------------------|
114| Intel Data Center Max Series  | Support | Max 1550, 1100                        |
115| Intel Data Center Flex Series | Support | Flex 170                              |
116| Intel Arc A-Series            | Support | Arc A770, Arc A730M, Arc A750         |
117| Intel Arc B-Series            | Support | Arc B580                              |
118| Intel built-in Arc GPU        | Support | built-in Arc GPU in Meteor Lake, Arrow Lake, Lunar Lake |
119| Intel iGPU                    | Support | iGPU in 13700k, 13400, i5-1250P, i7-1260P, i7-1165G7  |
120
121*Notes:*
122
123- **Memory**
124  - The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-completion`.
125  - Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.
126
127- **Execution Unit (EU)**
128  - If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
129
130### Other Vendor GPU
131
132NA
133
134## Docker
135
136The docker build option is currently limited to *Intel GPU* targets.
137
138### Build image
139
140```sh
141# Using FP32
142docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=OFF" --target light -f .devops/intel.Dockerfile .
143
144# Using FP16
145docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=ON" --target light -f .devops/intel.Dockerfile .
146```
147
148*Notes*:
149
150You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative.
151Check the [documentation for Docker](../docker.md) to see the available images.
152
153### Run container
154
155```sh
156# First, find all the DRI cards
157ls -la /dev/dri
158# Then, pick the card that you want to use (here for e.g. /dev/dri/card1).
159docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0
160```
161
162*Notes:*
163- Docker has been tested successfully on native Linux. WSL support has not been verified yet.
164- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
165
166## Linux
167
168### I. Setup Environment
169
1701. **Install GPU drivers**
171
172  - **Intel GPU**
173
174Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps).
175
176*Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html).
177
178Once installed, add the user(s) to the `video` and `render` groups.
179
180```sh
181sudo usermod -aG render $USER
182sudo usermod -aG video $USER
183```
184
185*Note*: logout/re-login for the changes to take effect.
186
187Verify installation through `clinfo`:
188
189```sh
190sudo apt install clinfo
191sudo clinfo -l
192```
193
194Sample output:
195
196```sh
197Platform #0: Intel(R) OpenCL Graphics
198 `-- Device #0: Intel(R) Arc(TM) A770 Graphics
199
200Platform #0: Intel(R) OpenCL HD Graphics
201 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
202```
203
2042. **Install Intel® oneAPI Base toolkit**
205
206SYCL backend depends on:
207  - Intel® oneAPI DPC++/C++ compiler/running-time.
208  - Intel® oneAPI DPC++/C++ library (oneDPL).
209  - Intel® oneAPI Deep Neural Network Library (oneDNN).
210  - Intel® oneAPI Math Kernel Library (oneMKL).
211
212- **For Intel GPU**
213
214All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
215
216It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
217
218The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
219
220Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*.
221
222Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
223
224Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs.
225
226|Verified release|
227|-|
228|2025.2.1|
229|2025.1|
230|2024.1|
231
2323. **Verify installation and environment**
233
234In order to check the available SYCL devices on the machine, please use the `sycl-ls` command.
235```sh
236source /opt/intel/oneapi/setvars.sh
237sycl-ls
238```
239
240- **Intel GPU**
241
242When targeting an intel GPU, the user should expect one or more devices among the available SYCL devices. Please make sure that at least one GPU is present via `sycl-ls`, for instance `[level_zero:gpu]` in the sample output below:
243
244```
245[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.3.29735+27]
246[level_zero:gpu][level_zero:1] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) UHD Graphics 730 12.2.0 [1.3.29735+27]
247[opencl:cpu][opencl:0] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i5-13400 OpenCL 3.0 (Build 0) [2025.20.8.0.06_160000]
248[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [24.39.31294]
249[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO  [24.39.31294]
250```
251
252### II. Build llama.cpp
253
254#### Intel GPU
255
256```sh
257./examples/sycl/build.sh
258```
259
260or
261
262```sh
263# Export relevant ENV variables
264source /opt/intel/oneapi/setvars.sh
265
266# Option 1: Use FP32 (recommended for better performance in most cases)
267cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
268
269# Option 2: Use FP16
270cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON
271
272# build all binary
273cmake --build build --config Release -j -v
274```
275
276It is possible to come across some precision issues when running tests that stem from using faster
277instructions, which can be circumvented by setting the environment variable `SYCL_PROGRAM_COMPILE_OPTIONS`
278as `-cl-fp32-correctly-rounded-divide-sqrt`
279
280### III. Run the inference
281
282#### Retrieve and prepare model
283
284You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf?download=true) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
285
286##### Check device
287
2881. Enable oneAPI running environment
289
290```sh
291source /opt/intel/oneapi/setvars.sh
292```
293
2942. List devices information
295
296Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
297
298```sh
299./build/bin/llama-ls-sycl-device
300```
301
302This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
303```
304found 2 SYCL devices:
305
306|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
307|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
308|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
309| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
310| 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
311```
312
313#### Choose level-zero devices
314
315|Chosen Device ID|Setting|
316|-|-|
317|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:0"` or no action|
318|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
319|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
320
321#### Execute
322
323Choose one of following methods to run.
324
3251. Script
326
327- Use device 0:
328
329```sh
330./examples/sycl/test.sh -mg 0
331```
332- Use multiple devices:
333
334```sh
335./examples/sycl/test.sh
336```
337
3382. Command line
339Launch inference
340
341There are two device selection modes:
342
343- Single device: Use one device assigned by user. Default device id is 0.
344- Multiple devices: Automatically choose the devices with the same backend.
345
346In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
347
348| Device selection | Parameter                              |
349|------------------|----------------------------------------|
350| Single device    | --split-mode none --main-gpu DEVICE_ID |
351| Multiple devices | --split-mode layer (default)           |
352
353Examples:
354
355- Use device 0:
356
357```sh
358ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap
359```
360
361- Use multiple devices:
362
363```sh
364ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm layer --mmap
365```
366
367*Notes:*
368
369- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
370
371```sh
372detect 1 SYCL GPUs: [0] with top Max compute units:512
373```
374Or
375```sh
376use 1 SYCL GPUs: [0] with Max compute units:512
377```
378
379## Windows
380
381### I. Setup Environment
382
3831. Install GPU driver
384
385Intel GPU drivers instructions guide and download page can be found here: [Get Intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
386
3872. Install Visual Studio
388
389If you already have a recent version of Microsoft Visual Studio, you can skip this step. Otherwise, please refer to the official download page for [Microsoft Visual Studio](https://visualstudio.microsoft.com/).
390
3913. Install Intel® oneAPI Base toolkit
392
393SYCL backend depends on:
394  - Intel® oneAPI DPC++/C++ compiler/running-time.
395  - Intel® oneAPI DPC++/C++ library (oneDPL).
396  - Intel® oneAPI Deep Neural Network Library (oneDNN).
397  - Intel® oneAPI Math Kernel Library (oneMKL).
398
399All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
400
401It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
402
403The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
404
405Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*.
406
407Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
408
409b. Enable oneAPI running environment:
410
411- Type "oneAPI" in the search bar, then open the `Intel oneAPI command prompt for Intel 64 for Visual Studio 2022` App.
412
413- On the command prompt, enable the runtime environment with the following:
414```
415"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
416```
417
418- if you are using Powershell, enable the runtime environment with the following:
419
420```
421cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
422```
423
424c. Verify installation
425
426In the oneAPI command line, run the following to print the available SYCL devices:
427
428```
429sycl-ls.exe
430```
431
432There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
433
434Output (example):
435```
436[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
437[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
438[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]
439[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
440```
441
4424. Install build tools
443
444a. Download & install cmake for Windows: https://cmake.org/download/ (CMake can also be installed from Visual Studio Installer)
445b. The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)
446
447
448### II. Build llama.cpp
449
450You could download the release package for Windows directly, which including binary files and depended oneAPI dll files.
451
452Choose one of following methods to build from source code.
453
454#### 1. Script
455
456```sh
457.\examples\sycl\win-build-sycl.bat
458```
459
460#### 2. CMake
461
462On the oneAPI command line window, step into the llama.cpp main directory and run the following:
463
464```
465@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
466
467# Option 1: Use FP32 (recommended for better performance in most cases)
468cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release
469
470# Option 2: Or FP16
471cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON
472
473cmake --build build --config Release -j
474```
475
476Or, use CMake presets to build:
477
478```sh
479cmake --preset x64-windows-sycl-release
480cmake --build build-x64-windows-sycl-release -j --target llama-completion
481
482cmake -DGGML_SYCL_F16=ON --preset x64-windows-sycl-release
483cmake --build build-x64-windows-sycl-release -j --target llama-completion
484
485cmake --preset x64-windows-sycl-debug
486cmake --build build-x64-windows-sycl-debug -j --target llama-completion
487```
488
489#### 3. Visual Studio
490
491You have two options to use Visual Studio to build llama.cpp:
492- As CMake Project using CMake presets.
493- Creating a Visual Studio solution to handle the project.
494
495**Note**:
496
497All following commands are executed in PowerShell.
498
499##### - Open as a CMake Project
500
501You can use Visual Studio to open the `llama.cpp` folder directly as a CMake project. Before compiling, select one of the SYCL CMake presets:
502
503- `x64-windows-sycl-release`
504
505- `x64-windows-sycl-debug`
506
507*Notes:*
508- For a minimal experimental setup, you can build only the inference executable using:
509
510    ```Powershell
511    cmake --build build --config Release -j --target llama-completion
512    ```
513
514##### - Generating a Visual Studio Solution
515
516You can use Visual Studio solution to build and work on llama.cpp on Windows. You need to convert the CMake Project into a `.sln` file.
517
518If you want to use the Intel C++ Compiler for the entire `llama.cpp` project, run the following command:
519
520```Powershell
521cmake -B build -G "Visual Studio 17 2022" -T "Intel C++ Compiler 2025" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release
522```
523
524If you prefer to use the Intel C++ Compiler only for `ggml-sycl`, ensure that `ggml` and its backend libraries are built as shared libraries ( i.e. `-DBUILD_SHARED_LIBRARIES=ON`, this is default behaviour):
525
526```Powershell
527cmake -B build -G "Visual Studio 17 2022" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release \
528      -DSYCL_INCLUDE_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\include" \
529      -DSYCL_LIBRARY_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\lib"
530```
531
532If successful the build files have been written to: *path/to/llama.cpp/build*
533Open the project file **build/llama.cpp.sln** with Visual Studio.
534
535Once the Visual Studio solution is created, follow these steps:
536
5371. Open the solution in Visual Studio.
538
5392. Right-click on `ggml-sycl` and select **Properties**.
540
5413. In the left column, expand **C/C++** and select **DPC++**.
542
5434. In the right panel, find **Enable SYCL Offload** and set it to `Yes`.
544
5455. Apply the changes and save.
546
547
548*Navigation Path:*
549
550```
551Properties -> C/C++ -> DPC++ -> Enable SYCL Offload (Yes)
552```
553
554Now, you can build `llama.cpp` with the SYCL backend as a Visual Studio project.
555To do it from menu: `Build -> Build Solution`.
556Once it is completed, final results will be in **build/Release/bin**
557
558*Additional Note*
559
560- You can avoid specifying `SYCL_INCLUDE_DIR` and `SYCL_LIBRARY_DIR` in the CMake command by setting the environment variables:
561
562    - `SYCL_INCLUDE_DIR_HINT`
563
564    - `SYCL_LIBRARY_DIR_HINT`
565
566- Above instruction has been tested with Visual Studio 17 Community edition and oneAPI 2025.0. We expect them to work also with future version if the instructions are adapted accordingly.
567
568### III. Run the inference
569
570#### Retrieve and prepare model
571
572You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
573
574##### Check device
575
5761. Enable oneAPI running environment
577
578On the oneAPI command line window, run the following and step into the llama.cpp directory:
579```
580"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
581```
582
5832. List devices information
584
585Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
586
587```
588build\bin\llama-ls-sycl-device.exe
589```
590
591This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following:
592```
593found 2 SYCL devices:
594|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
595|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
596|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
597| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
598| 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
599
600```
601
602#### Choose level-zero devices
603
604|Chosen Device ID|Setting|
605|-|-|
606|0|Default option. You may also want to `set ONEAPI_DEVICE_SELECTOR="level_zero:0"`|
607|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
608|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"` or `set ONEAPI_DEVICE_SELECTOR="level_zero:*"`|
609
610#### Execute
611
612Choose one of following methods to run.
613
6141. Script
615
616```
617examples\sycl\win-test.bat
618```
619
6202. Command line
621
622Launch inference
623
624There are two device selection modes:
625
626- Single device: Use one device assigned by user. Default device id is 0.
627- Multiple devices: Automatically choose the devices with the same backend.
628
629In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
630
631| Device selection | Parameter                              |
632|------------------|----------------------------------------|
633| Single device    | --split-mode none --main-gpu DEVICE_ID |
634| Multiple devices | --split-mode layer (default)           |
635
636Examples:
637
638- Use device 0:
639
640```
641build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap
642```
643
644- Use multiple devices:
645
646```
647build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm layer --mmap
648```
649
650
651Note:
652
653- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
654
655```sh
656detect 1 SYCL GPUs: [0] with top Max compute units:512
657```
658
659Or
660
661```sh
662use 1 SYCL GPUs: [0] with Max compute units:512
663```
664
665
666## Environment Variable
667
668#### Build
669
670| Name               | Value                                 | Function                                    |
671|--------------------|---------------------------------------|---------------------------------------------|
672| GGML_SYCL          | ON (mandatory)                        | Enable build with SYCL code path.           |
673| GGML_SYCL_TARGET   | INTEL *(default)*                     | Set the SYCL target device type.            |
674| GGML_SYCL_DEVICE_ARCH | Optional                           | Set the SYCL device architecture. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
675| GGML_SYCL_F16      | OFF *(default)* \|ON *(optional)*     | Enable FP16 build with SYCL code path. (1.) |
676| GGML_SYCL_GRAPH    | OFF *(default)* \|ON *(Optional)*     | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
677| GGML_SYCL_DNN      | ON *(default)* \|OFF *(Optional)*     | Enable build with oneDNN.                   |
678| CMAKE_C_COMPILER   | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path.      |
679| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)*   | Set `icpx/icx` compiler for SYCL code path. |
680
6811. FP32 or FP16 have different performance impact to LLM. Recommended to test them for better prompt processing performance on your models. You need to rebuild the code after change `GGML_SYCL_F16=OFF/ON`.
682
683#### Runtime
684
685| Name              | Value            | Function                                                                                                                  |
686|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
687| GGML_SYCL_DEBUG   | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG                                                                             |
688| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) |
689| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
690| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
691| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
692| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
693
694
695
696## Known Issues
697
698- `Split-mode:[row]` is not supported.
699
700## Q&A
701
702- Error:  `error while loading shared libraries: libsycl.so: cannot open shared object file: No such file or directory`.
703
704  - Potential cause: Unavailable oneAPI installation or not set ENV variables.
705  - Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`.
706
707- General compiler error:
708
709  - Remove **build** folder or try a clean-build.
710
711- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux.
712
713  Please double-check with `sudo sycl-ls`.
714
715  If it's present in the list, please add video/render group to your user then **logout/login** or restart your system:
716
717  ```
718  sudo usermod -aG render $USER
719  sudo usermod -aG video $USER
720  ```
721  Otherwise, please double-check the GPU driver installation steps.
722
723- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend?
724
725  No. We can't support Ollama issue directly, because we aren't familiar with Ollama.
726
727  Suggest reproducing on llama.cpp and report similar issue to llama.cpp. We will support it.
728
729  It's same for other projects including llama.cpp SYCL backend.
730
731- `Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)`, `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 3503030272 Bytes of memory on device`, or `failed to allocate SYCL0 buffer`
732
733  You are running out of Device Memory.
734
735  |Reason|Solution|
736  |-|-|
737  | The default context is too big. It leads to excessive memory usage.|Set `-c 8192` or a smaller value.|
738  | The model is too big and requires more memory than what is available.|Choose a smaller model or change to a smaller quantization, like Q5 -> Q4;<br>Alternatively, use more than one device to load model.|
739
740- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device`
741
742  You need to enable to support 4GB memory malloc by:
743  ```
744    export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
745    set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
746  ```
747
748### **GitHub contribution**:
749Please add the `SYCL :` prefix/tag in issues/PRs titles to help the SYCL contributors to check/address them without delay.
750
751## TODO
752
753- Review ZES_ENABLE_SYSMAN: https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations