Vietnamese Accent Prediction

A very simple/fast/accurate accent prediction for non-accented Vietnamese text using n-gram language model with Markov Chain

Performances

All the tests were done on my Macbook, 2.5 GHz Intel Core i7, 16 GB Ram

Speed: 350 sentences per second ~ 3500 words/syllables per second
Accuracy: 96.52% on test.txt provided in datasets folder

AccuracyCalculator ac = new AccuracyCalculator(); 
System.out.println("Accuracy:" + ac.getAccuracy("datasets/test.txt") +"%");

Examples

Anh yeu em --> Anh yêu em (I love you)
Toi dang di du lich o ha long --> Tôi đang đi du lịch ở hạ long (I am visting Halong)

API

Using the provided n-grams data

AccentPredictor ap = new AccentPredictor();
String str = "Toi thich di du lich Ha Noi";
String predictedStr = ap.predictAccents(str);

You can also get top N predicted results as follows:

AccentPredictor ap = new AccentPredictor();
String str = "Toi thich di du lich Ha Noi";

// (matched_str,  matched_score) map
LinkedHashMap<String, Double> = ap.predictAccentsWithMultiMatches(str, 5); //Return the 5 best matches

Using your own n-gram data

AccentPredictor ap = new AccentPredictor("_Your1GramFile", "_Your2GramsFile");
String str = "Toi thich di du lich Ha Noi";
String predictedStr = ap.predictAccents(str);

To create your own n-gram data, you can use the following API:

String dataFolderPath = "path_to_your_data"; // The folder contains your text data
int numberOfProcessingFiles = -1; // The max number of files you plan to process (-1 means using all the data)
boolean toLowercase = true; // if it is set to "true", the n-grams will be converted to lowercase
String _1GramFileOut =  "datasets/news1gram";
String _2GramsFileOut =  "datasets/news2grams";
new NGramer(dataFolderPath).statisticNGrams(numberOfProcessingFiles, toLowercase, _1GramFileOut, _2GramsFileOut);

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
datasets		datasets
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

src

src

.classpath

.classpath

.gitignore

.gitignore

.project

.project

README.md

README.md

Repository files navigation

Vietnamese Accent Prediction

A very simple/fast/accurate accent prediction for non-accented Vietnamese text using n-gram language model with Markov Chain

Performances

All the tests were done on my Macbook, 2.5 GHz Intel Core i7, 16 GB Ram

Examples

API

Using the provided n-grams data

Using your own n-gram data

About

Releases

Packages

Languages

tienthanhdhcn/Vietnamese-Accent-Prediction

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Accent Prediction

A very simple/fast/accurate accent prediction for non-accented Vietnamese text using n-gram language model with Markov Chain

Performances

All the tests were done on my Macbook, 2.5 GHz Intel Core i7, 16 GB Ram

Examples

API

Using the provided n-grams data

Using your own n-gram data

About

Topics

Resources

Stars

Watchers

Forks

Languages