Get better result after human validation #36

simonefrancia · 2019-01-07T15:51:54Z

Hi,
we are using indic-trans in order to make transliteration from Hindi to Roman/Eng.
After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.

चैन should be chain not chaiyn , 
कमल should be kamal not camel , 
मिलने should be milne not milane

Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?

Thanks

The text was updated successfully, but these errors were encountered:

irshadbhat · 2019-01-07T16:40:14Z

beamsearch will definitely help in this case. Beamsearch will return n outputs (by default n is 5) and in most of the cases the desired transliteration is either first or second.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn.transform(u'चैन')
[u'chaiyn', u'chain', u'chann', u'chen', u'chan']
trn.transform(u'कमल')
[u'camel', u'kamal', u'camal', u'kamel', u'comel']
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']

As you can see the expected result is the second output in all the three cases.

Adding more training data might not help in this case. Since Roman is not the original script for Hindi, one can choose any spelling, for example, for the word बहुत the actual pronunciation is bahut, but most of the Hindi speakers (including me) prefer the spelling bohat. So bohut, bahut, bohat all are correct for me. Saying that the above transliterations are erroneous doesn't seem right. These actually are one of the possible transliterations.

Though the system can actually fail in some cases. After all it is a machine that learned some parameters using some training data and not some actual human being doing the transliteration. Expecting a 100% result does not seem reasonable.

simonefrancia · 2019-01-07T17:15:25Z

Ok, thanks for clear response.
Do you think it's a good idea to consider a transformation certainly good if Viterbi output matches one of n outputs of beamsearch?

irshadbhat · 2019-01-07T17:40:42Z

I don't think so. Viterbi output matches the first output of beamsearch in almost all the cases.

You can use back-transliteration to estimate the quality of the target transliteration by comparing its back-transliteration with the source word.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn_revr = Transliterator(source='eng', target='hin')  #back-transliteration 
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']
print trn_revr.transform('milane')
मिलाने
print trn_revr.transform('milne')
मिलने

As you can see, in the above example, second beamsearch output was able to correctly back-transliterate to the original word and not the first one, so you can prefer 2nd output in this case over the first.

simonefrancia · 2019-01-08T11:12:23Z

Thank you very much!
Mainly I am using terminal commands and I work with text at sentence level.
So, If I consider a sentence, how can I choose the best transformation for every tokenized word? How can I tokenize Hindi sentence?

irshadbhat · 2019-01-08T11:44:38Z

You can tokenize Hindi text using polyglot-tokenizer.

simonefrancia · 2019-01-08T13:19:32Z

Great! I think I have all to continue.
Thanks

Regards

simonefrancia · 2019-01-09T10:18:53Z

Sorry, last question. How can I include polyglot-tokenizer into indictrans in order to make it work inside binary command?
Thanks

simonefrancia · 2019-02-14T10:30:05Z

Hi,
I would like to learn more about this repo. I have some problem in the choice of the correct transliteration from Indian to Romanization.
I followed your suggestion and here I will resume my approach.

I have this word ಕವನ in Kannada and If I choose to show the most with beamsearch (n=5), I get these results:

OUTPUT=kavan,cuvan,kavana,covan,kuvan

So how do I chose the "best"? What I do is what you suggest: I "back-transliterate" every word contained in OUTPUT and I check that every back-transliteration corresponds to the initial input word, in this case ಕವನ.
So doing, in this example kavan and cuvan are accepted, but kavana,covan,kuvan are discarded.
Google translitteration for ಕವನ is kavana, but it is discarded from tools.
How can I modify this behavior of the tool?

Thanks

irshadbhat · 2019-02-14T13:07:33Z

Based on the original question, my suggestion was not to add more training data rather to get some work around, but that was the case because the language pair under consideration was hin-eng. For hin-eng I suggested this because the current model for hin-eng is trained on around 100k pairs. While comparing this to kan-eng, the model is trained on only 10k pairs. So, for kan-eng adding more training data might help. You can go through my blog to learn how to train a new system.

simonefrancia · 2019-02-14T13:54:17Z

Ok, so I think that this is the link http://irshadbhat.github.io/rom-ind/
Do you know where can I find large corpus in order to do training from scratch?
I would like also to know which pairs of language are considered reliable?
Thanks

irshadbhat · 2019-02-14T14:39:28Z

If you read the blog, I have mentioned a couple of sources from were I collected/generated the training data. Apart from those you can search online for additional data.
The data I used for training is auto extracted and not gold annotated. Thus it is not 100% correct. If you create some data yourself (say another 10k word pairs), that will give you a much better model.

Regarding your second question, "which pairs of language are considered reliable?", reliability of the model is highly relative. It depends on the downstream task whether you consider the output good or bad. But since you have asked, the best performing models are hin-urd, hin-eng, urd-eng, next will be ben-eng, hin-ben. Rest all are less accurate than these (mainly because of less training data).

simonefrancia · 2019-04-29T13:56:01Z

Hi,
we are facing Tamil transliteration problem (tam-eng) and I would like to know which transliteration phonetic scheme were used for the train, Azhagi or Jaffna, if I'm not wrong.
We are having feedbacks from our translation checkers and at the moment they are not so good; but we would like to know if it could be a model problem or instead, our checkers refer to a phonetical system that is not the same which was used for the model training.
Thanks

simonefrancia closed this as completed Jan 8, 2019

simonefrancia reopened this Jan 9, 2019

simonefrancia closed this as completed Jan 15, 2019

irshadbhat mentioned this issue Feb 13, 2019

About Kannada and Hindi Script to Roman models #38

Closed

simonefrancia reopened this Feb 14, 2019

simonefrancia closed this as completed Mar 21, 2019

simonefrancia reopened this Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get better result after human validation #36

Get better result after human validation #36

simonefrancia commented Jan 7, 2019 •

edited

irshadbhat commented Jan 7, 2019 •

edited

simonefrancia commented Jan 7, 2019

irshadbhat commented Jan 7, 2019

simonefrancia commented Jan 8, 2019

irshadbhat commented Jan 8, 2019 •

edited

simonefrancia commented Jan 8, 2019

simonefrancia commented Jan 9, 2019 •

edited

simonefrancia commented Feb 14, 2019

irshadbhat commented Feb 14, 2019

simonefrancia commented Feb 14, 2019 •

edited

irshadbhat commented Feb 14, 2019 •

edited

simonefrancia commented Apr 29, 2019

Get better result after human validation #36

Get better result after human validation #36

Comments

simonefrancia commented Jan 7, 2019 • edited

irshadbhat commented Jan 7, 2019 • edited

simonefrancia commented Jan 7, 2019

irshadbhat commented Jan 7, 2019

simonefrancia commented Jan 8, 2019

irshadbhat commented Jan 8, 2019 • edited

simonefrancia commented Jan 8, 2019

simonefrancia commented Jan 9, 2019 • edited

simonefrancia commented Feb 14, 2019

irshadbhat commented Feb 14, 2019

simonefrancia commented Feb 14, 2019 • edited

irshadbhat commented Feb 14, 2019 • edited

simonefrancia commented Apr 29, 2019

simonefrancia commented Jan 7, 2019 •

edited

irshadbhat commented Jan 7, 2019 •

edited

irshadbhat commented Jan 8, 2019 •

edited

simonefrancia commented Jan 9, 2019 •

edited

simonefrancia commented Feb 14, 2019 •

edited

irshadbhat commented Feb 14, 2019 •

edited