Engage!

author: Mitja Felicijan <mitja.felicijan@gmail.com> 2026-02-12 20:57:17 +0100
committer: Mitja Felicijan <mitja.felicijan@gmail.com> 2026-02-12 20:57:17 +0100
commit: b333b06772c89d96aacb5490d6a219fba7c09cc6 (patch)
tree: 211df60083a5946baa2ed61d33d8121b7e251b06 /llama.cpp/docs/multimodal/granitevision.md
download: llmnpc-b333b06772c89d96aacb5490d6a219fba7c09cc6.tar.gz
1 files changed, 186 insertions, 0 deletions
diff --git a/llama.cpp/docs/multimodal/granitevision.md b/llama.cpp/docs/multimodal/granitevision.md
new file mode 100644
index 0000000..3118fe0
--- /dev/null
+++ b/llama.cpp/docs/multimodal/granitevision.md
@@ -0,0 +1,186 @@
+# Granite Vision
+Download the model and point your `GRANITE_MODEL` environment variable to the path.
+```bash
+$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
+$ export GRANITE_MODEL=./granite-vision-3.2-2b
+```
+### 1. Running llava surgery v2.
+First, we need to run the llava surgery script as shown below:
+`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
+You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
+```bash
+$ ls $GRANITE_MODEL | grep -i llava
+llava.clip
+llava.projector
+```
+We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
+```python
+import os
+import torch
+MODEL_PATH = os.getenv("GRANITE_MODEL")
+if not MODEL_PATH:
+    raise ValueError("env var GRANITE_MODEL is unset!")
+encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
+projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
+assert len(encoder_tensors) > 0
+assert len(projector_tensors) > 0
+```
+If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
+### 2. Creating the Visual Component GGUF
+Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
+```bash
+$ ENCODER_PATH=$PWD/visual_encoder
+$ mkdir $ENCODER_PATH
+$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
+$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
+```
+Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
+```json
+{
+    "_name_or_path": "siglip-model",
+    "architectures": [
+      "SiglipVisionModel"
+    ],
+    "image_grid_pinpoints": [
+        [384,384],
+        [384,768],
+        [384,1152],
+        [384,1536],
+        [384,1920],
+        [384,2304],
+        [384,2688],
+        [384,3072],
+        [384,3456],
+        [384,3840],
+        [768,384],
+        [768,768],
+        [768,1152],
+        [768,1536],
+        [768,1920],
+        [1152,384],
+        [1152,768],
+        [1152,1152],
+        [1536,384],
+        [1536,768],
+        [1920,384],
+        [1920,768],
+        [2304,384],
+        [2688,384],
+        [3072,384],
+        [3456,384],
+        [3840,384]
+    ],
+    "mm_patch_merge_type": "spatial_unpad",
+    "hidden_size": 1152,
+    "image_size": 384,
+    "intermediate_size": 4304,
+    "model_type": "siglip_vision_model",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 27,
+    "patch_size": 14,
+    "layer_norm_eps": 1e-6,
+    "hidden_act": "gelu_pytorch_tanh",
+    "projection_dim": 0,
+    "vision_feature_layer": [-24, -20, -12, -1]
+}
+```
+At this point you should have something like this:
+```bash
+$ ls $ENCODER_PATH
+config.json             llava.projector         pytorch_model.bin
+```
+Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
+```bash
+$ python convert_image_encoder_to_gguf.py \
+    -m $ENCODER_PATH \
+    --llava-projector $ENCODER_PATH/llava.projector \
+    --output-dir $ENCODER_PATH \
+    --clip-model-is-vision \
+    --clip-model-is-siglip \
+    --image-mean 0.5 0.5 0.5 \
+    --image-std 0.5 0.5 0.5
+```
+This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
+### 3. Creating the LLM GGUF.
+The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
+First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
+```bash
+$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
+```
+```python
+import os
+import transformers
+MODEL_PATH = os.getenv("GRANITE_MODEL")
+if not MODEL_PATH:
+    raise ValueError("env var GRANITE_MODEL is unset!")
+LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
+if not LLM_EXPORT_PATH:
+    raise ValueError("env var LLM_EXPORT_PATH is unset!")
+tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
+# NOTE: granite vision support was added to transformers very recently (4.49);
+# if you get size mismatches, your version is too old.
+# If you are running with an older version, set `ignore_mismatched_sizes=True`
+# as shown below; it won't be loaded correctly, but the LLM part of the model that
+# we are exporting will be loaded correctly.
+model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
+tokenizer.save_pretrained(LLM_EXPORT_PATH)
+model.language_model.save_pretrained(LLM_EXPORT_PATH)
+```
+Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
+```bash
+$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
+...
+$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
+```
+### 4. Quantization
+If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
+```bash
+$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
+$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
+```
+Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
+### 5. Running the Model in Llama cpp
+Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
+```bash
+$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
+    --mmproj $VISUAL_GGUF_PATH \
+    -c 16384 \
+    --temp 0
+```
author	Mitja Felicijan <mitja.felicijan@gmail.com>	2026-02-12 20:57:17 +0100
committer	Mitja Felicijan <mitja.felicijan@gmail.com>	2026-02-12 20:57:17 +0100
commit	b333b06772c89d96aacb5490d6a219fba7c09cc6 (patch)
tree	211df60083a5946baa2ed61d33d8121b7e251b06 /llama.cpp/docs/multimodal/granitevision.md
download	llmnpc-b333b06772c89d96aacb5490d6a219fba7c09cc6.tar.gz

diff --git a/llama.cpp/docs/multimodal/granitevision.md b/llama.cpp/docs/multimodal/granitevision.md new file mode 100644 index 0000000..3118fe0 --- /dev/null +++ b/llama.cpp/docs/multimodal/granitevision.md
@@ -0,0 +1,186 @@
	1	# Granite Vision
	2
	3	Download the model and point your `GRANITE_MODEL` environment variable to the path.
	4
	5	```bash
	6	$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
	7	$ export GRANITE_MODEL=./granite-vision-3.2-2b
	8	```
	9
	10
	11	### 1. Running llava surgery v2.
	12	First, we need to run the llava surgery script as shown below:
	13
	14	`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
	15
	16	You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.
	17
	18	```bash
	19	$ ls $GRANITE_MODEL \| grep -i llava
	20	llava.clip
	21	llava.projector
	22	```
	23
	24	We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
	25	```python
	26	import os
	27	import torch
	28
	29	MODEL_PATH = os.getenv("GRANITE_MODEL")
	30	if not MODEL_PATH:
	31	raise ValueError("env var GRANITE_MODEL is unset!")
	32
	33	encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
	34	projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
	35
	36	assert len(encoder_tensors) > 0
	37	assert len(projector_tensors) > 0
	38	```
	39
	40	If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
	41
	42
	43	### 2. Creating the Visual Component GGUF
	44	Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
	45
	46	```bash
	47	$ ENCODER_PATH=$PWD/visual_encoder
	48	$ mkdir $ENCODER_PATH
	49
	50	$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
	51	$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
	52	```
	53
	54	Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
	55
	56	```json
	57	{
	58	"_name_or_path": "siglip-model",
	59	"architectures": [
	60	"SiglipVisionModel"
	61	],
	62	"image_grid_pinpoints": [
	63	[384,384],
	64	[384,768],
	65	[384,1152],
	66	[384,1536],
	67	[384,1920],
	68	[384,2304],
	69	[384,2688],
	70	[384,3072],
	71	[384,3456],
	72	[384,3840],
	73	[768,384],
	74	[768,768],
	75	[768,1152],
	76	[768,1536],
	77	[768,1920],
	78	[1152,384],
	79	[1152,768],
	80	[1152,1152],
	81	[1536,384],
	82	[1536,768],
	83	[1920,384],
	84	[1920,768],
	85	[2304,384],
	86	[2688,384],
	87	[3072,384],
	88	[3456,384],
	89	[3840,384]
	90	],
	91	"mm_patch_merge_type": "spatial_unpad",
	92	"hidden_size": 1152,
	93	"image_size": 384,
	94	"intermediate_size": 4304,
	95	"model_type": "siglip_vision_model",
	96	"num_attention_heads": 16,
	97	"num_hidden_layers": 27,
	98	"patch_size": 14,
	99	"layer_norm_eps": 1e-6,
	100	"hidden_act": "gelu_pytorch_tanh",
	101	"projection_dim": 0,
	102	"vision_feature_layer": [-24, -20, -12, -1]
	103	}
	104	```
	105
	106	At this point you should have something like this:
	107	```bash
	108	$ ls $ENCODER_PATH
	109	config.json llava.projector pytorch_model.bin
	110	```
	111
	112	Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
	113	```bash
	114	$ python convert_image_encoder_to_gguf.py \
	115	-m $ENCODER_PATH \
	116	--llava-projector $ENCODER_PATH/llava.projector \
	117	--output-dir $ENCODER_PATH \
	118	--clip-model-is-vision \
	119	--clip-model-is-siglip \
	120	--image-mean 0.5 0.5 0.5 \
	121	--image-std 0.5 0.5 0.5
	122	```
	123
	124	This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
	125
	126
	127	### 3. Creating the LLM GGUF.
	128	The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
	129
	130	First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
	131	```bash
	132	$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
	133	```
	134
	135	```python
	136	import os
	137	import transformers
	138
	139	MODEL_PATH = os.getenv("GRANITE_MODEL")
	140	if not MODEL_PATH:
	141	raise ValueError("env var GRANITE_MODEL is unset!")
	142
	143	LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
	144	if not LLM_EXPORT_PATH:
	145	raise ValueError("env var LLM_EXPORT_PATH is unset!")
	146
	147	tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
	148
	149	# NOTE: granite vision support was added to transformers very recently (4.49);
	150	# if you get size mismatches, your version is too old.
	151	# If you are running with an older version, set `ignore_mismatched_sizes=True`
	152	# as shown below; it won't be loaded correctly, but the LLM part of the model that
	153	# we are exporting will be loaded correctly.
	154	model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
	155
	156	tokenizer.save_pretrained(LLM_EXPORT_PATH)
	157	model.language_model.save_pretrained(LLM_EXPORT_PATH)
	158	```
	159
	160	Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
	161	```bash
	162	$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
	163	...
	164	$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
	165	```
	166
	167
	168	### 4. Quantization
	169	If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
	170	```bash
	171	$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
	172	$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
	173	```
	174
	175	Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
	176
	177
	178	### 5. Running the Model in Llama cpp
	179	Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
	180
	181	```bash
	182	$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \
	183	--mmproj $VISUAL_GGUF_PATH \
	184	-c 16384 \
	185	--temp 0
	186	```