llmnpc - llama.cpp/docs/multimodal/llava.md

Path: llmnpc / llama.cpp / docs / multimodal / llava.md (raw)
  1# LLaVA
  2
  3Currently this implementation supports [llava-v1.5](https://huggingface.co/liuhaotian/llava-v1.5-7b) variants,
  4as well as llava-1.6 [llava-v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) variants.
  5
  6The pre-converted [7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
  7and [13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
  8models are available.
  9For llava-1.6 a variety of prepared gguf models are available as well [7b-34b](https://huggingface.co/cmp-nct/llava-1.6-gguf)
 10
 11After API is confirmed, more models will be supported / uploaded.
 12
 13## Usage
 14Build the `llama-mtmd-cli` binary.
 15
 16After building, run: `./llama-mtmd-cli` to see the usage. For example:
 17
 18```sh
 19./llama-mtmd-cli -m ../llava-v1.5-7b/ggml-model-f16.gguf \
 20    --mmproj ../llava-v1.5-7b/mmproj-model-f16.gguf \
 21    --chat-template vicuna
 22```
 23
 24**note**: A lower temperature like 0.1 is recommended for better quality. add `--temp 0.1` to the command to do so.
 25**note**: For GPU offloading ensure to use the `-ngl` flag just like usual
 26
 27## LLaVA 1.5
 28
 291. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
 30
 31```sh
 32git clone https://huggingface.co/liuhaotian/llava-v1.5-7b
 33
 34git clone https://huggingface.co/openai/clip-vit-large-patch14-336
 35```
 36
 372. Install the required Python packages:
 38
 39```sh
 40pip install -r tools/mtmd/requirements.txt
 41```
 42
 433. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
 44
 45```sh
 46python ./tools/mtmd/llava_surgery.py -m ../llava-v1.5-7b
 47```
 48
 494. Use `convert_image_encoder_to_gguf.py` to convert the LLaVA image encoder to GGUF:
 50
 51```sh
 52python ./tools/mtmd/convert_image_encoder_to_gguf.py -m ../clip-vit-large-patch14-336 --llava-projector ../llava-v1.5-7b/llava.projector --output-dir ../llava-v1.5-7b
 53```
 54
 555. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
 56
 57```sh
 58python ./examples/convert_legacy_llama.py ../llava-v1.5-7b --skip-unknown
 59```
 60
 61Now both the LLaMA part and the image encoder are in the `llava-v1.5-7b` directory.
 62
 63## LLaVA 1.6 gguf conversion
 641) First clone a LLaVA 1.6 model:
 65```console
 66git clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
 67```
 68
 692) Install the required Python packages:
 70
 71```sh
 72pip install -r tools/mtmd/requirements.txt
 73```
 74
 753) Use `llava_surgery_v2.py` which also supports llava-1.5 variants pytorch as well as safetensor models:
 76```console
 77python tools/mtmd/llava_surgery_v2.py -C -m ../llava-v1.6-vicuna-7b/
 78```
 79- you will find a llava.projector and a llava.clip file in your model directory
 80
 814) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory:
 82```console
 83mkdir vit
 84cp ../llava-v1.6-vicuna-7b/llava.clip vit/pytorch_model.bin
 85cp ../llava-v1.6-vicuna-7b/llava.projector vit/
 86curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json
 87```
 88
 895) Create the visual gguf model:
 90```console
 91python ./tools/mtmd/convert_image_encoder_to_gguf.py -m vit --llava-projector vit/llava.projector --output-dir vit --clip-model-is-vision
 92```
 93- This is similar to llava-1.5, the difference is that we tell the encoder that we are working with the pure vision model part of CLIP
 94
 956) Then convert the model to gguf format:
 96```console
 97python ./examples/convert_legacy_llama.py ../llava-v1.6-vicuna-7b/ --skip-unknown
 98```
 99
1007) And finally we can run the llava cli using the 1.6 model version:
101```console
102./llama-mtmd-cli -m ../llava-v1.6-vicuna-7b/ggml-model-f16.gguf --mmproj vit/mmproj-model-f16.gguf
103```
104
105**note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096)
106
107**note** llava-1.6 greatly benefits from batched prompt processing (defaults work)
108
109**note** if the language model in step `6)` is incompatible with the legacy conversion script, the easiest way handle the LLM model conversion is to load the model in transformers, and export only the LLM from the llava next model.
110
111```python
112import os
113import transformers
114
115model_path = ...
116llm_export_path = ...
117
118tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
119model = transformers.AutoModelForImageTextToText.from_pretrained(model_path)
120
121tokenizer.save_pretrained(llm_export_path)
122model.language_model.save_pretrained(llm_export_path)
123```
124
125Then, you can convert the LLM using the `convert_hf_to_gguf.py` script, which handles more LLM architectures.
126
127## Chat template
128
129For llava-1.5 and llava-1.6, you need to use `vicuna` chat template. Simply add `--chat-template vicuna` to activate this template.
130
131
132## How to know if you are running in llava-1.5 or llava-1.6 mode
133
134When running llava-cli you will see a visual information right before the prompt is being processed:
135
136**Llava-1.5:**
137`encode_image_with_clip: image embedding created: 576 tokens`
138
139**Llava-1.6 (anything above 576):**
140`encode_image_with_clip: image embedding created: 2880 tokens`
141
142
143Alternatively just pay notice to how many "tokens" have been used for your prompt, it will also show 1000+ tokens for llava-1.6