1# LLaVA
2
3Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
4as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
5
6The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
7and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
8models are available.
9For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
10
11After API is confirmed, more models will be supported / uploaded.
12
13## Usage
14Build the `llama-mtmd-cli` binary.
15
16After building, run: `./llama-mtmd-cli` to see the usage. For example:
17
18```sh
19./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
20 --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
21 --chat-template vicuna
22```
23
24**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
25**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
26
27## LLaVA 1.5
28
291. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
30
31```sh
32git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
33
34git clone https://huggingface.co/openai/clip-vit-large-patch14-336
35```
36
372. Install the required Python packages:
38
39```sh
40pip install -r tools/mtmd/requirements.txt
41```
42
433. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
44
45```sh
46python ./tools/mtmd/llava_surgery.py -m ../llava-v1.5-7b
47```
48
494. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
50
51```sh
52python ./tools/mtmd/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
53```
54
555. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
56
57```sh
58python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
59```
60
61Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
62
63## LLaVA 1.6 gguf conversion
641) First clone a LLaVA 1.6 model:
65```console
66git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
67```
68
692) Install the required Python packages:
70
71```sh
72pip install -r tools/mtmd/requirements.txt
73```
74
753) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
76```console
77python tools/mtmd/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
78```
79- you will find a llava.projector and a llava.clip file in your model directory
80
814) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
82```console
83mkdir vit
84cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
85cp ../llava-v1.6-vicuna-7b/llava.projector vit/
86curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
87```
88
895) Create the visual gguf model:
90```console
91python ./tools/mtmd/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
92```
93- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
94
956) Then convert the model to gguf format:
96```console
97python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
98```
99
1007) And finally we can run the llava cli using the 1.6 model version:
101```console
102./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
103```
104
105**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
106
107**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
108
109**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
110
111```python
112import os
113import transformers
114
115model_path = ...
116llm_export_path = ...
117
118tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
119model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
120
121tokenizer.save_pretrained(llm_export_path)
122model.language_model.save_pretrained(llm_export_path)
123```
124
125Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
126
127## Chat template
128
129For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template.
130
131
132## How to know if you are running in llava-1.5 or llava-1.6 mode
133
134When running llava-cli you will see a visual information right before the prompt is being processed:
135
136**Llava-1.5:**
137`encode_image_with_clip: image embedding created: 576 tokens`
138
139**Llava-1.6 (anything above 576):**
140`encode_image_with_clip: image embedding created: 2880 tokens`
141
142
143Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6