Want to turn text back into speech?
Coqui Text to Speech can do that.
Rather than worry about building from python, there is a ready made docker image. This version works with a GPU.
$ docker pull ghcr.io/coqui-ai/tts
$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/coqui-ai/tts latest bc76ce8030a6 11 months ago 10.3GB
text=$(cat "$1")
dir=$(pwd)
sudo docker run --rm -v "$dir:/root/tts-output" bc76ce8030a6 --text "$text" ---model_name tts_models/en/jenny/jenny -out_path "/root/tts-output/$1.wav"
A python script can process text, maybe in practice using nltk, then get an mp3 out:
import os
import subprocess
from pydub import AudioSegment
text = "I'm going to generate that contract and that's it we've sent the quote to our customer they've accepted it and we've got a new contract based on the new terms."
def text_to_speech(text_file):
subprocess.run(['tts', text_file])
sound = AudioSegment.from_wav(f"{text_file}.wav")
sound.export(f"{text_file}.mp3", format="mp3")
os.remove(f"{text_file}.wav")
file = "temp.txt"
with open(file, 'w') as f:
f.write(text)
text_to_speech(file)
os.remove(file)
The speech is pretty good, but without customisation sometimes pronounciation is a bit weird, such as "vid-i-o" for video, "min-dful" for mindful. Also if using a multi-lingual model sometimes it might start rambling in what sounds like someone drunk in an unknown foreign lanuage.
Warning: if processing gets too intensive, this sort of thing can bring even a high spec workstation to a standstill.
This file was updated at 2025-03-01 19:15:47