1# MobileVLM
2
3Currently this implementation supports [MobileVLM-1.7B](https://huggingface.co/mtgv/MobileVLM-1.7B) / [MobileVLM_V2-1.7B](https://huggingface.co/mtgv/MobileVLM_V2-1.7B) variants.
4
5for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com/Meituan-AutoML/MobileVLM)
6
7The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.
8
9Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown.
10
11## Usage
12
13Build the `llama-mtmd-cli` binary.
14
15After building, run: `./llama-mtmd-cli` to see the usage. For example:
16
17```sh
18./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
19 --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
20 --chat-template deepseek
21```
22
23## Model conversion
24
251. Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally:
26
27```sh
28git clone https://huggingface.co/mtgv/MobileVLM-1.7B
29
30git clone https://huggingface.co/openai/clip-vit-large-patch14-336
31```
32
332. Use `llava_surgery.py` to split the LLaVA model to LLaMA and multimodel projector constituents:
34
35```sh
36python ./tools/mtmd/llava_surgery.py -m path/to/MobileVLM-1.7B
37```
38
393. Use `convert_image_encoder_to_gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
40
41```sh
42python ./tools/mtmd/convert_image_encoder_to_gguf.py \
43 -m path/to/clip-vit-large-patch14-336 \
44 --llava-projector path/to/MobileVLM-1.7B/llava.projector \
45 --output-dir path/to/MobileVLM-1.7B \
46 --projector-type ldp
47```
48
49```sh
50python ./tools/mtmd/convert_image_encoder_to_gguf.py \
51 -m path/to/clip-vit-large-patch14-336 \
52 --llava-projector path/to/MobileVLM-1.7B_V2/llava.projector \
53 --output-dir path/to/MobileVLM-1.7B_V2 \
54 --projector-type ldpv2
55```
56
574. Use `examples/convert_legacy_llama.py` to convert the LLaMA part of LLaVA to GGUF:
58
59```sh
60python ./examples/convert_legacy_llama.py path/to/MobileVLM-1.7B --skip-unknown
61```
62
635. Use `quantize` to convert LLaMA part's DataType from `fp32` to `q4_k`
64```sh
65./llama-quantize path/to/MobileVLM-1.7B/ggml-model-F32.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s
66```
67
68Now both the LLaMA part and the image encoder is in the `MobileVLM-1.7B` directory.
69
70## Android compile and run
71### compile
72refer to `tools/mtmd/android/build_64.sh`
73```sh
74mkdir tools/mtmd/android/build_64
75cd tools/mtmd/android/build_64
76../build_64.sh
77```
78### run on Android
79refer to `android/adb_run.sh`, modify resources' `name` and `path`
80
81## Some result on Android with `Snapdragon 888` chip
82### case 1
83**input**
84```sh
85/data/local/tmp/llama-mtmd-cli \
86 -m /data/local/tmp/ggml-model-q4_k.gguf \
87 --mmproj /data/local/tmp/mmproj-model-f16.gguf \
88 -t 4 \
89 --image /data/local/tmp/demo.jpg \
90 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"
91```
92**output**
93```sh
94encode_image_with_clip: image encoded in 21148.71 ms by CLIP ( 146.87 ms per image patch)
95 Susan Wise Bauer
96llama_print_timings: load time = 23574.72 ms
97llama_print_timings: sample time = 1.24 ms / 6 runs ( 0.21 ms per token, 4850.44 tokens per second)
98llama_print_timings: prompt eval time = 12460.15 ms / 246 tokens ( 50.65 ms per token, 19.74 tokens per second)
99llama_print_timings: eval time = 424.86 ms / 6 runs ( 70.81 ms per token, 14.12 tokens per second)
100llama_print_timings: total time = 34731.93 ms
101```
102### case 2
103**input**
104```sh
105/data/local/tmp/llama-mtmd-cli \
106 -m /data/local/tmp/ggml-model-q4_k.gguf \
107 --mmproj /data/local/tmp/mmproj-model-f16.gguf \
108 -t 4 \
109 --image /data/local/tmp/cat.jpeg \
110 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
111```
112**output**
113```sh
114encode_image_with_clip: image encoded in 21149.51 ms by CLIP ( 146.87 ms per image patch)
115 The image depicts a cat sitting in the grass near some tall green plants.
116llama_print_timings: load time = 23257.32 ms
117llama_print_timings: sample time = 5.25 ms / 18 runs ( 0.29 ms per token, 3430.53 tokens per second)
118llama_print_timings: prompt eval time = 11900.73 ms / 232 tokens ( 51.30 ms per token, 19.49 tokens per second)
119llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 ms per token, 14.07 tokens per second)
120llama_print_timings: total time = 34570.79 ms
121```
122
123
124## Some result on Android with `Snapdragon 778G` chip
125### MobileVLM-1.7B case
126#### mtmd-cli release-b2005
127**input**
128```sh
129/data/local/tmp/llama-mtmd-cli \
130 -m /data/local/tmp/ggml-model-q4_k.gguf \
131 --mmproj /data/local/tmp/mmproj-model-f16.gguf \
132 -t 4 \
133 --image /data/local/tmp/many_llamas.jpeg \
134 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:"
135```
136**output**
137```sh
138encode_image_with_clip: image encoded in 18728.52 ms by CLIP ( 130.06 ms per image patch)
139system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
140user_prompt: \nWhat's that? ASSISTANT:
141
142 A group of llamas are standing in a green pasture.
143
144llama_print_timings: load time = 20357.33 ms
145llama_print_timings: sample time = 2.96 ms / 14 runs ( 0.21 ms per token, 4734.53 tokens per second)
146llama_print_timings: prompt eval time = 8119.49 ms / 191 tokens ( 42.51 ms per token, 23.52 tokens per second)
147llama_print_timings: eval time = 1005.75 ms / 14 runs ( 71.84 ms per token, 13.92 tokens per second)
148llama_print_timings: total time = 28038.34 ms / 205 tokens
149```
150#### mtmd-cli latest-version
151**input**
152
153Just the same as above.
154
155**output**(seems to be much slower)
156```sh
157encode_image_with_clip: image embedding created: 144 tokens
158
159encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch)
160system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
161user_prompt: \nWhat's that? ASSISTANT:
162
163 It is a group of sheep standing together in a grass field.
164
165llama_print_timings: load time = 818120.91 ms
166llama_print_timings: sample time = 3.44 ms / 14 runs ( 0.25 ms per token, 4067.40 tokens per second)
167llama_print_timings: prompt eval time = 529274.69 ms / 191 tokens ( 2771.07 ms per token, 0.36 tokens per second)
168llama_print_timings: eval time = 43894.02 ms / 13 runs ( 3376.46 ms per token, 0.30 tokens per second)
169llama_print_timings: total time = 865441.76 ms / 204 tokens
170```
171### MobileVLM_V2-1.7B case
172#### mtmd-cli release-2005b
173**input**
174
175Just the same as above.
176
177**output**
178```sh
179encode_image_with_clip: image encoded in 20609.61 ms by CLIP ( 143.12 ms per image patch)
180system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
181user_prompt: \nWhat's that? ASSISTANT:
182
183 This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting.
184
185The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama
186
187llama_print_timings: load time = 22406.77 ms
188llama_print_timings: sample time = 49.26 ms / 186 runs ( 0.26 ms per token, 3776.27 tokens per second)
189llama_print_timings: prompt eval time = 9044.54 ms / 191 tokens ( 47.35 ms per token, 21.12 tokens per second)
190llama_print_timings: eval time = 14497.49 ms / 186 runs ( 77.94 ms per token, 12.83 tokens per second)
191llama_print_timings: total time = 44411.01 ms / 377 tokens
192```
193
194## Orin compile and run
195### compile
196```sh
197make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 -j 32
198```
199### run on Orin
200### case 1
201**input**
202```sh
203./llama-mtmd-cli \
204 -m /data/local/tmp/ggml-model-q4_k.gguf \
205 --mmproj /data/local/tmp/mmproj-model-f16.gguf \
206 --image /data/local/tmp/demo.jpeg \
207 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
208 --n-gpu-layers 999
209```
210**output**
211```sh
212
213encode_image_with_clip: image encoded in 296.62 ms by CLIP ( 2.06 ms per image patch)
214
215 Susan Wise Bauer
216
217llama_print_timings: load time = 1067.64 ms
218llama_print_timings: sample time = 1.53 ms / 6 runs ( 0.25 ms per token, 3934.43 tokens per second)
219llama_print_timings: prompt eval time = 306.84 ms / 246 tokens ( 1.25 ms per token, 801.72 tokens per second)
220llama_print_timings: eval time = 91.50 ms / 6 runs ( 15.25 ms per token, 65.58 tokens per second)
221llama_print_timings: total time = 1352.63 ms / 252 tokens
222```
223
224### case 2
225**input**
226```sh
227./llama-mtmd-cli \
228 -m /data/local/tmp/ggml-model-q4_k.gguf \
229 --mmproj /data/local/tmp/mmproj-model-f16.gguf \
230 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
231 --n-gpu-layers 999
232
233```
234**output**
235```sh
236encode_image_with_clip: image encoded in 302.15 ms by CLIP ( 2.10 ms per image patch)
237
238 The image features a cat lying in the grass.
239
240llama_print_timings: load time = 1057.07 ms
241llama_print_timings: sample time = 3.27 ms / 11 runs ( 0.30 ms per token, 3360.83 tokens per second)
242llama_print_timings: prompt eval time = 213.60 ms / 232 tokens ( 0.92 ms per token, 1086.14 tokens per second)
243llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 ms per token, 66.01 tokens per second)
244llama_print_timings: total time = 1365.47 ms / 243 tokens
245```
246
247## Running on Intel(R) Core(TM) i7-10750H
248### Operating system
249Ubuntu22.04
250### compile
251```sh
252make -j32
253```
254### MobileVLM-1.7B case
255**input**
256```sh
257-m /path/to/ggml-model-q4_k.gguf \
258 --mmproj /path/to/mmproj-model-f16.gguf \
259 --image /path/to/many_llamas.jpeg
260 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
261```
262**output**
263```sh
264encode_image_with_clip: image embedding created: 144 tokens
265
266encode_image_with_clip: image encoded in 2730.94 ms by CLIP ( 18.96 ms per image patch)
267system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
268user_prompt: \nWhat's that?ASSISTANT:
269
270 A group of llamas are walking together in a field.
271
272llama_print_timings: load time = 5506.60 ms
273llama_print_timings: sample time = 0.44 ms / 13 runs ( 0.03 ms per token, 29545.45 tokens per second)
274llama_print_timings: prompt eval time = 2031.58 ms / 190 tokens ( 10.69 ms per token, 93.52 tokens per second)
275llama_print_timings: eval time = 438.92 ms / 12 runs ( 36.58 ms per token, 27.34 tokens per second)
276llama_print_timings: total time = 5990.25 ms / 202 tokens
277```
278
279### MobileVLM_V2-1.7B case
280**input**
281
282Just the same as above.
283
284**ouput**
285```sh
286encode_image_with_clip: image embedding created: 144 tokens
287
288encode_image_with_clip: image encoded in 3223.89 ms by CLIP ( 22.39 ms per image patch)
289system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
290user_prompt: \nWhat's that?ASSISTANT:
291
292 The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order.
293
294The park itself is lush and green, with trees dotting the landscape in the background. A sign reading "Llamas Tico Ana" is also visible in the image, possibly indicating the location or the breed of the llamas. The image seems to be taken from a distance, providing a wide view of the scene and the surrounding environment.
295
296The llamas' positions relative to each other, the sign, and the trees create a harmonious composition. The image does not contain any discernible text. The overall scene is one of peace and natural beauty, with the llamas in their natural habitat, surrounded by the vibrant colors and lush greenery of the park.
297
298llama_print_timings: load time = 6642.61 ms
299llama_print_timings: sample time = 8.15 ms / 223 runs ( 0.04 ms per token, 27358.61 tokens per second)
300llama_print_timings: prompt eval time = 2475.07 ms / 190 tokens ( 13.03 ms per token, 76.77 tokens per second)
301llama_print_timings: eval time = 8760.60 ms / 222 runs ( 39.46 ms per token, 25.34 tokens per second)
302llama_print_timings: total time = 15513.95 ms / 412 tokens
303```
304
305## Run on Intel(R) Core(TM) Ultra7 115H
306### operation system
307Windows11
308### comiple
309```sh
310make -j32
311```
312### MobileVLM-1.7B case
313**input**
314```sh
315-m /path/to/ggml-model-q4_k.gguf \
316 --mmproj /path/to/tmp/mmproj-model-f16.gguf \
317 -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
318```
319**output**
320```sh
321encode_image_with_clip: image encoded in 4902.81 ms by CLIP ( 34.05 ms per image patch)
322system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
323user_prompt: \nWhat's that? ASSISTANT:
324
325 The image features a group of brown and white llamas standing in a grassy field.
326
327llama_print_timings: load time = 7441.06 ms
328llama_print_timings: sample time = 0.72 ms / 19 runs ( 0.04 ms per token, 26279.39 tokens per second)
329llama_print_timings: prompt eval time = 2090.71 ms / 191 tokens ( 10.95 ms per token, 91.36 tokens per second)
330llama_print_timings: eval time = 512.35 ms / 18 runs ( 28.46 ms per token, 35.13 tokens per second)
331llama_print_timings: total time = 7987.23 ms / 209 tokens
332```
333
334### MobileVLM_V2-1.7B case
335**input**
336
337Just the same as above.
338
339**output**
340```sh
341encode_image_with_clip: image encoded in 4682.44 ms by CLIP ( 32.52 ms per image patch)
342system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
343user_prompt: \nWhat's that? ASSISTANT:
344
345 This image captures a lively scene of a group of 14 llamas in a grassy field. The llamas, with their distinctive black and white coats, are standing and walking in a line, seemingly engaged in a social activity. One
346 of them, possibly the first in the line, has its back turned, perhaps observing something in the distance.
347
348The llama in the front of the line stands out due to its black and white coloring, which is quite unusual for llama patterns. The llama in the front also seems to be more aware of its surroundings, as it faces the camera, giving a sense of engagement with the viewer.
349
350The image is taken from the side of the llama, providing a clear view of the llama in the front and its companions. The lameness in the llama in
351 front is not visible, indicating that it might not be the main focus of the photo.
352
353The background of the image features a grassy field, with a fence and a tree visible in the distance. The tree appears to be bare, suggesting that it might be during a time of year when most trees are dormant or have shed their leaves.
354
355
356llama_print_timings: load time = 7015.35 ms
357llama_print_timings: sample time = 10.61 ms / 256 runs ( 0.04 ms per token, 24119.09 tokens per second)
358llama_print_timings: prompt eval time = 2052.45 ms / 191 tokens ( 10.75 ms per token, 93.06 tokens per second)
359llama_print_timings: eval time = 7259.43 ms / 255 runs ( 28.47 ms per token, 35.13 tokens per second)
360llama_print_timings: total time = 14371.19 ms / 446 tokens
361```
362
363## TODO
364
365- [x] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
366- [ ] Optimize LDP projector performance
367
368 - Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
369 - Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
370- [x] run MobileVLM on `Jetson Orin`
371- [ ] Support more model variants, such as `MobileVLM-3B`.
372
373
374## contributor
375```sh
376zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77
377```