1# llama.cpp/example/tts
2This example demonstrates the Text To Speech feature. It uses a
3[model](https://www.outeai.com/blog/outetts-0.2-500m) from
4[outeai](https://www.outeai.com/).
5
6## Quickstart
7If you have built llama.cpp with SSL support you can simply run the
8following command and the required models will be downloaded automatically:
9```console
10$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
11```
12For details about the models and how to convert them to the required format
13see the following sections.
14
15### Model conversion
16Checkout or download the model that contains the LLM model:
17```console
18$ pushd models
19$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
20$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
21$ popd
22```
23Convert the model to .gguf format:
24```console
25(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
26 --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
27```
28The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.
29
30We can optionally quantize this to Q8_0 using the following command:
31```console
32$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
33 models/outetts-0.2-0.5B-q8_0.gguf q8_0
34```
35The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.
36
37Next we do something similar for the audio decoder. First download or checkout
38the model for the voice decoder:
39```console
40$ pushd models
41$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
42$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
43$ popd
44```
45This model file is a PyTorch checkpoint (.ckpt) and we first need to convert it to
46huggingface format:
47```console
48(venv) python tools/tts/convert_pt_to_hf.py \
49 models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
50...
51Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
52Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
53Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
54```
55Then we can convert the huggingface format to gguf:
56```console
57(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
58 --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
59...
60INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
61```
62
63### Running the example
64
65With both of the models generated, the LLM model and the voice decoder model,
66we can run the example:
67```console
68$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \
69 -mv ./models/wavtokenizer-large-75-f16.gguf \
70 -p "Hello world"
71...
72main: audio written to file 'output.wav'
73```
74The output.wav file will contain the audio of the prompt. This can be heard
75by playing the file with a media player. On Linux the following command will
76play the audio:
77```console
78$ aplay output.wav
79```
80
81### Running the example with llama-server
82Running this example with `llama-server` is also possible and requires two
83server instances to be started. One will serve the LLM model and the other
84will serve the voice decoder model.
85
86The LLM model server can be started with the following command:
87```console
88$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020
89```
90
91And the voice decoder model server can be started using:
92```console
93./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none
94```
95
96Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio.
97
98First create a virtual environment for python and install the required
99dependencies (this in only required to be done once):
100```console
101$ python3 -m venv venv
102$ source venv/bin/activate
103(venv) pip install requests numpy
104```
105
106And then run the python script using:
107```conole
108(venv) python ./tools/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"
109spectrogram generated: n_codes: 90, n_embd: 1282
110converting to audio ...
111audio generated: 28800 samples
112audio written to file "output.wav"
113```
114And to play the audio we can again use aplay or any other media player:
115```console
116$ aplay output.wav
117```