llmnpc - llama.cpp/tools/tts/README.md

Path: llmnpc / llama.cpp / tools / tts / README.md (raw)
  1# llama.cpp/example/tts
  2This example demonstrates the Text To Speech feature. It uses a
  3[model](https://www.outeai.com/blog/outetts-0.2-500m) from
  4[outeai](https://www.outeai.com/).
  5
  6## Quickstart
  7If you have built llama.cpp with SSL support you can simply run the
  8following command and the required models will be downloaded automatically:
  9```console
 10$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
 11```
 12For details about the models and how to convert them to the required format
 13see the following sections.
 14
 15### Model conversion
 16Checkout or download the model that contains the LLM model:
 17```console
 18$ pushd models
 19$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
 20$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
 21$ popd
 22```
 23Convert the model to .gguf format:
 24```console
 25(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
 26    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
 27```
 28The generated model will be `models/outetts-0.2-0.5B-f16.gguf`.
 29
 30We can optionally quantize this to Q8_0 using the following command:
 31```console
 32$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
 33    models/outetts-0.2-0.5B-q8_0.gguf q8_0
 34```
 35The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`.
 36
 37Next we do something similar for the audio decoder. First download or checkout
 38the model for the voice decoder:
 39```console
 40$ pushd models
 41$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
 42$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
 43$ popd
 44```
 45This model file is a PyTorch checkpoint (.ckpt) and we first need to convert it to
 46huggingface format:
 47```console
 48(venv) python tools/tts/convert_pt_to_hf.py \
 49    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
 50...
 51Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
 52Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
 53Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
 54```
 55Then we can convert the huggingface format to gguf:
 56```console
 57(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
 58    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
 59...
 60INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
 61```
 62
 63### Running the example
 64
 65With both of the models generated, the LLM model and the voice decoder model,
 66we can run the example:
 67```console
 68$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
 69    -mv ./models/wavtokenizer-large-75-f16.gguf \
 70    -p "Hello world"
 71...
 72main: audio written to file 'output.wav'
 73```
 74The output.wav file will contain the audio of the prompt. This can be heard
 75by playing the file with a media player. On Linux the following command will
 76play the audio:
 77```console
 78$ aplay output.wav
 79```
 80
 81### Running the example with llama-server
 82Running this example with `llama-server` is also possible and requires two
 83server instances to be started. One will serve the LLM model and the other
 84will serve the voice decoder model.
 85
 86The LLM model server can be started with the following command:
 87```console
 88$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020
 89```
 90
 91And the voice decoder model server can be started using:
 92```console
 93./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none
 94```
 95
 96Then we can run [tts-outetts.py](tts-outetts.py) to generate the audio.
 97
 98First create a virtual environment for python and install the required
 99dependencies (this in only required to be done once):
100```console
101$ python3 -m venv venv
102$ source venv/bin/activate
103(venv) pip install requests numpy
104```
105
106And then run the python script using:
107```conole
108(venv) python ./tools/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"
109spectrogram generated: n_codes: 90, n_embd: 1282
110converting to audio ...
111audio generated: 28800 samples
112audio written to file "output.wav"
113```
114And to play the audio we can again use aplay or any other media player:
115```console
116$ aplay output.wav
117```