Introduction
Warning
We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
This codebase is released under the BSD-3-Clause
license, and all models are released under the CC-BY-NC-SA-4.0 license.
Requirements
- GPU Memory: 4GB (for inference), 16GB (for fine-tuning)
- System: Linux, Windows
We recommend Windows users to use WSL2 or docker to run the codebase, or use the integrated environment developed by the community.
Setup
# Create a python 3.10 virtual environment, you can also use virtualenv
conda create -n fish-speech python=3.10
conda activate fish-speech
# Install pytorch
pip3 install torch torchvision torchaudio
# Install fish-speech
pip3 install -e .
# (Ubuntu / Debian User) Install sox
apt install libsox-dev
Changelog
- 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.
- 2024/04/22: Finished Fish-Speech 1.0 version, significantly modified VQGAN and LLAMA models.
- 2023/12/28: Added
lora
fine-tuning support. - 2023/12/27: Add
gradient checkpointing
,causual sampling
, andflash-attn
support. - 2023/12/19: Updated webui and HTTP API.
- 2023/12/18: Updated fine-tuning documentation and related examples.
- 2023/12/17: Updated
text2semantic
model, supporting phoneme-free mode. - 2023/12/13: Beta version released, includes VQGAN model and a language model based on LLAMA (phoneme support only).