This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

22 Jul 09:50

Mte90

2021.07.22

188c828

2021.07.22 Latest

Latest

Changelog

After around a year since the last update a lot of changes have been made to the model and scripts
Mainly there was a work to improve the audio+text dataset importer and bump to DeepSpeech 0.9.3.

It isn't a stable release as we don't have time now to do a proper release and also because there it will be soon the new CV dataset and now italian will have more than 300 hours compared to the version used to generate this.

For instructions how it was generated, parameters and other stuff check https://github.com/MozillaItalia/DeepSpeech-Italian-Model/wiki/Training-Notes-DeepSpeech-0.9.3-(2021.07.22-pre-release)

Trainer

CommonVoice 6.1 (Cleaned) : 126h
MITADS-Speech (Cleaned): 349h

Total 475h

Available in 2 version transfer used transfer learning form the official English model release by mozilla and other one is from scratch .

Thanks

This release was not possible without @eziolotta that did... everything!
Me (@Mte90) worked on the project management side about the model and with the help for the server offered by the Turin university we were able to do everything.

License

CC0 as public domain.

Contributors

Mte90 and eziolotta

Assets 4

07 Aug 10:30

Mte90

2020.08.07

714b07e

2020.08.07

Changelog

After 5 months we release a new model with a lot of improvements!

Don't use anymore the wikipedia dump as text corpus but Mitads (check at https://github.com/MozillaItalia/DeepSpeech-Italian-Model/releases/tag/Mitads-1.0.0-alpha2)
DeepSpeech 0.8 based
New docker scripts
- Sets of .env files to try different parameters https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/DeepSpeech/env_files
In the meantime we released also the notebooks!
We are using the Mozilla official DS model for English for transfer learning to improve the quality

Second version of Italian model, trained with:

~130 hours for Common Voice IT dataset
~127 hours of m-ailabs Italian dataset
total: ~257h

Available in 2 version transfer used transfer learning form the official English model release by mozilla and other one is from scratch .

model hyper-parameters:

batch_size=64
n_hidden=2048
epochs=30
learning_rate=0.0001
dropout=0.4
lm_alpha=0
lm_beta=0
es_epochs=10
early_stop=1
amp=1

For transfer learning model:

amp=0
drop_source_layer=1

Check the readme about the usage

Thanks

This release wasn't possible without the huge work of @nefastosaturo on the docker and DS side other than generating the new model.

Me (@Mte90) worked on the project management side about the model and @astrastefania with the help for the server offered by the Turin university we were able to do everything.

License

CC0 as public domain.

Assets 14

07 Aug 10:15

Mte90

Mitads-1.0.0-alpha2

634b063

Mitads-1.0.0-alpha2 Pre-release

Pre-release

7 days ago we released the first version of Mitads and you can find all the information here.

The difference with previous version:

11922393 lines instead of 11828730
More cleaning from wikiquote
More speech from wikisource

License

CC0 as public domain.

Assets 3

31 Jul 12:15

Mte90

Mitads-1.0.0-alpha

e66cae3

Mitads-1.0.0-alpha Pre-release

Pre-release

First official release of the Mitads text corpus!

What?

Mitads is an Italian text corpus with sentences extracted from discussions, chats, books to get a kind of spoken Italian that can be used with AI like DeepSpeech.
This dataset is released as Public Domain, it is generated with the scripts available at https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/MITADS and is based on aggregating different datasets or resources that allow to be released in this aggregated way (basically it isn't possible to recreate from this the original datasets).
As it is a generated on-the-fly we cannot release the file cache or file generated during the process (for license issues) except the final corpus with a log file.
This corpus doesn't include repeated sentences, we implemented various sanitization but this tasks is never ending and require your help to improve the quality of the corpus itself.

How works

Every script in the Mitads folder is for a specific resource that handle the download and parsing with generating txt files.
Usually every script has a caching workflow of external resources to speed up the development and generation itself, with specific rules to ignore lines, words and so on.
It is included a python library that is used for common tasks along the various scripts.
There is a final Bash script that execute all of them, do a final sanitization, remove duplciate sentences and generate the final corpus.

Numbers

Total sentences: 11828730
10 different datasets or resources used
9 months of working (and discussions) from 7 volunteers: @Mte90 @nefastosaturo @mone27 @dag7dev @ilyasmg @eziolotta @GianluigiMemoli
With the help of Turin's University (and @astrastefania that was the spoken person) that offered a server we was able to generate it (we have a partnership as community)

Tickets to do before final release:

Next steps

Close the last tickets and integrate this corpus with the script to generate a new model version. In our internal discussions use a text corpus more similar to Italian that is spoken between people the words recognition should improve a lot.
After the official release we will evaluate how to improve the performance, quality and maybe found new dataset suitable for this project.

Reach us!

Check with @mozitabot on Telegram and join the Mozilla Italia Developers group (we talk italian there).

Assets 3

13 Mar 17:19

Mte90

2020.03.13

d053ba7

2020.03.13

The zip file include the tensorflow and tflite version.

Second version of Italian model, trained with:

~85 hours for Common Voice IT dataset
~127 hours of m-ailabs Italian dataset
total: ~212h

model hyper-parameters:

batch_size=64
n_hidden=2048
epochs=30
learning_rate=0.0003
dropout=0.5
lm_alpha=0.65
lm_beta=1.45
beam_width=500
early_stop=1
amp=0

Check the readme about the usage

Assets 3

16 Oct 16:25

mone27

model-0.1a

2a550f4

Trained models 0.1a Pre-release

Pre-release

First version of Italian model, trained with:

~40 hours for Common Voice IT dataset
~ 127 hours of m-ailabs Italian dataset
total: ~170h

model hyper-parameters:

batch_size=68
n_hidden=2048
epochs=30
learning_rate=0.00025
dropout=0.15
lm_alpha=0.75
lm_beta=1.85
early_stop=0

Require DeepSpeech 0.6.0a9!

Assets 5

10 Oct 19:42

Mte90

lm-0.1

76e1546

lm-0.1

Italian Wikipedia Dump compressed

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

Trainer

Thanks

License

Contributors

Changelog

Second version of Italian model, trained with:

model hyper-parameters:

Thanks

License

License

What?

How works

Numbers

Tickets to do before final release:

Next steps

Reach us!

Second version of Italian model, trained with:

model hyper-parameters:

First version of Italian model, trained with:

model hyper-parameters:

Releases: MozillaItalia/DeepSpeech-Italian-Model

2021.07.22

Changelog

Trainer

Thanks

License

Contributors

2020.08.07

Changelog

Second version of Italian model, trained with:

model hyper-parameters:

Thanks

License

Mitads-1.0.0-alpha2

License

Mitads-1.0.0-alpha

What?

How works

Numbers

Tickets to do before final release:

Next steps

Reach us!

2020.03.13

Second version of Italian model, trained with:

model hyper-parameters:

Trained models 0.1a

First version of Italian model, trained with:

model hyper-parameters:

lm-0.1