Phoneme level lyrics aligner

This repository can be used to align lyrics transcripts with the corresponding audio signals. The audio signals may contain solo singing or singing voice mixed with other instruments. It contains a trained deep neural network which performs alignment and singing voice separation jointly. Details about the model, training, and data are described in the associated paper

Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021). doi: 10.1109/TASLP.2021.3091817. public version available here.

If you use the model or code, please cite the paper:

@article{schulze2021phoneme,
    author={Schulze-Forster, Kilian and Doire, Clement S. J. and Richard, Gaël and Badeau, Roland},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
    title={Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation}, 
    year={2021},
    volume={29},
    number={},
    pages={2382-2395},
    doi={10.1109/TASLP.2021.3091817}
    }

Installation

Clone the repository:

git clone https://github.com/schufo/lyrics-aligner.git

Install the conda environment:
- If you want to run the model on a CPU:
```
conda env create -f environment_cpu.yml
```
- If you want to run the model on a GPU:
```
conda env create -f environment_gpu.yml
```

Remember to activate the conda environment.

Data preparation

Audio

Please prepare one directory with all audio files. We load the audio files using librosa, so all formats supported by librosa can be used. This includes for example .wav and .mp3. See the documentation for more details.

Lyrics

Please prepare a separate directory with all lyrics files in .txt-format. Each lyrics file must have the same name as the corresponding audio file (e.g. song1.wav --> song1.txt).

You can provide the lyrics as words or as phonemes.

If your lyrics are already decomposed into phonemes, please consider the following:

We support only the 39 phonemes in ARPAbet notation listed on website of the CMU Pronouncing Dictionary.
The provided .txt-file should contain one phoneme per line.
The first and the last symbol should be the space character: >. It should also be placed between each word or at positions where silence between phonemes is expected in the singing voice signal.
In this case only phoneme onsets and no word onsets can be computed.

If the lyrics are provided as words, they must be processed as follows to be used with the alignment model:

Generate a .txt-file with a list of unique words:
```
 python make_word_list.py PATH/TO/LYRICS/DIRECTORY --dataset-name NAME
 
```
The --dataset-name flag is optional. It can be used if several datasets should be aligned with this model. The output files will contain the dataset name which defaults to 'dataset1'. This command generates the files NAME_word_list.txt and NAME_word2phoneme.txt in the files directory.
Go to http://www.speech.cs.cmu.edu/tools/lextool.html, upload NAME_word_list.txt as word file, and click COMPILE.
Click on the link to see the list of output files. Then, click on the .dict-file. You should now see a list of all words with their corresponding phoneme decomposition.
Copy the whole list and paste it into NAME_word2phoneme.txt in the files directory.
Run the following command:
```
 python make_word2phoneme_dict.py --dataset-name NAME
 
```
Use the same dataset name as in step 1. This will generate a Python dictionary to translate each word into phonemes and save it as NAME_word2phonemes.pickle in files.
Done!

Usage

The model has been trained on the MUSDB18 dataset using the lyrics extension. Therefore, it will probably work best with similar music. However, we also found it works well on solo singing. Some errors can be expected in challenging mixtures with long instrumental sections.

You can compute phoneme onsets and/or word onsets as follows:

python align.py PATH/TO/AUDIO/DIRECTORY PATH/TO/LYRICS/DIRECTORY \
--lyrics-format w --onsets p --dataset-name dataset1 --vad-threshold 0

Optional flags (defaults are shown above):

--lyrics-format Must be w if the lyrics are provided as words (and has been processed as descrived above) and p if the lyrics are provided as phonemes.

--onsets If phoneme onsets should be computed, set to p. If word onsets should be computed, set to w. If phoneme and word onsets should be computed, set to pw (only possible if lyrics are provided as words).

--dataset-name Should be the same as used for data preparation above.

--vad-threshold The model also computes an estimate of the isolated singing voice which can be used as Voice Activity Detector (VAD). This may be useful in challenging scenarios where long pauses are made by the singer while instruments are playing (e.g. intro, soli, outro). The magnitude of the vocals estimate is computed. Here a threshold (float) can be set to discriminate between active and inactive voice given the magnitude. The default is 0 which means that no VAD is used. The optimal value for a given audio signal may be difficult to determine as it depends on the loudness of the voice. In our experiments we used values between 0 and 30. You could print or plot the voice magnitude (computed in line 235) to get an intuition for an appropriate value. We recommend to use the option only if large errors are made on audio files with long instrumental sections.

Acknowledgment

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
files		files
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align.py		align.py
environment_cpu.yml		environment_cpu.yml
environment_gpu.yml		environment_gpu.yml
make_word2phoneme_dict.py		make_word2phoneme_dict.py
make_word_list.py		make_word_list.py
model.py		model.py
model_parameters.pth		model_parameters.pth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

files

files

outputs

outputs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

align.py

align.py

environment_cpu.yml

environment_cpu.yml

environment_gpu.yml

environment_gpu.yml

make_word2phoneme_dict.py

make_word2phoneme_dict.py

make_word_list.py

make_word_list.py

model.py

model.py

model_parameters.pth

model_parameters.pth

Repository files navigation

Phoneme level lyrics aligner

Installation

Data preparation

Audio

Lyrics

Usage

Acknowledgment

Copyright

About

Releases

Packages

Languages

License

schufo/lyrics-aligner

Folders and files

Latest commit

History

Repository files navigation

Phoneme level lyrics aligner

Installation

Data preparation

Audio

Lyrics

Usage

Acknowledgment

Copyright

About

Resources

License

Stars

Watchers

Forks

Languages