Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get better result after human validation #36

Open
simonefrancia opened this issue Jan 7, 2019 · 12 comments
Open

Get better result after human validation #36

simonefrancia opened this issue Jan 7, 2019 · 12 comments

Comments

@simonefrancia
Copy link

simonefrancia commented Jan 7, 2019

Hi,
we are using indic-trans in order to make transliteration from Hindi to Roman/Eng.
After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.

चैन should be chain not chaiyn , 
कमल should be kamal not camel , 
मिलने should be milne not milane 

Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?

Thanks

@irshadbhat
Copy link
Member

irshadbhat commented Jan 7, 2019

beamsearch will definitely help in this case. Beamsearch will return n outputs (by default n is 5) and in most of the cases the desired transliteration is either first or second.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn.transform(u'चैन')
[u'chaiyn', u'chain', u'chann', u'chen', u'chan']
trn.transform(u'कमल')
[u'camel', u'kamal', u'camal', u'kamel', u'comel']
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']

As you can see the expected result is the second output in all the three cases.

Adding more training data might not help in this case. Since Roman is not the original script for Hindi, one can choose any spelling, for example, for the word बहुत the actual pronunciation is bahut, but most of the Hindi speakers (including me) prefer the spelling bohat. So bohut, bahut, bohat all are correct for me. Saying that the above transliterations are erroneous doesn't seem right. These actually are one of the possible transliterations.

Though the system can actually fail in some cases. After all it is a machine that learned some parameters using some training data and not some actual human being doing the transliteration. Expecting a 100% result does not seem reasonable.

@simonefrancia
Copy link
Author

Ok, thanks for clear response.
Do you think it's a good idea to consider a transformation certainly good if Viterbi output matches one of n outputs of beamsearch?

@irshadbhat
Copy link
Member

I don't think so. Viterbi output matches the first output of beamsearch in almost all the cases.

You can use back-transliteration to estimate the quality of the target transliteration by comparing its back-transliteration with the source word.

from indictrans import Transliterator
trn = Transliterator(source='hin', target='eng', decode='beamsearch')
trn_revr = Transliterator(source='eng', target='hin')  #back-transliteration 
trn.transform(u'मिलने')
[u'milane', u'milne', u'miline', u'milene', u'mine']
print trn_revr.transform('milane')
मिलाने
print trn_revr.transform('milne')
मिलने

As you can see, in the above example, second beamsearch output was able to correctly back-transliterate to the original word and not the first one, so you can prefer 2nd output in this case over the first.

@simonefrancia
Copy link
Author

Thank you very much!
Mainly I am using terminal commands and I work with text at sentence level.
So, If I consider a sentence, how can I choose the best transformation for every tokenized word? How can I tokenize Hindi sentence?

@irshadbhat
Copy link
Member

irshadbhat commented Jan 8, 2019

You can tokenize Hindi text using polyglot-tokenizer.

@simonefrancia
Copy link
Author

Great! I think I have all to continue.
Thanks

Regards

@simonefrancia
Copy link
Author

simonefrancia commented Jan 9, 2019

Sorry, last question. How can I include polyglot-tokenizer into indictrans in order to make it work inside binary command?
Thanks

@simonefrancia
Copy link
Author

Hi,
I would like to learn more about this repo. I have some problem in the choice of the correct transliteration from Indian to Romanization.
I followed your suggestion and here I will resume my approach.

I have this word ಕವನ in Kannada and If I choose to show the most with beamsearch (n=5), I get these results:

OUTPUT=kavan,cuvan,kavana,covan,kuvan

So how do I chose the "best"? What I do is what you suggest: I "back-transliterate" every word contained in OUTPUT and I check that every back-transliteration corresponds to the initial input word, in this case ಕವನ.
So doing, in this example kavan and cuvan are accepted, but kavana,covan,kuvan are discarded.
Google translitteration for ಕವನ is kavana, but it is discarded from tools.
How can I modify this behavior of the tool?

Thanks

@irshadbhat
Copy link
Member

Based on the original question, my suggestion was not to add more training data rather to get some work around, but that was the case because the language pair under consideration was hin-eng. For hin-eng I suggested this because the current model for hin-eng is trained on around 100k pairs. While comparing this to kan-eng, the model is trained on only 10k pairs. So, for kan-eng adding more training data might help. You can go through my blog to learn how to train a new system.

@simonefrancia
Copy link
Author

simonefrancia commented Feb 14, 2019

Ok, so I think that this is the link http://irshadbhat.github.io/rom-ind/
Do you know where can I find large corpus in order to do training from scratch?
I would like also to know which pairs of language are considered reliable?
Thanks

@irshadbhat
Copy link
Member

irshadbhat commented Feb 14, 2019

If you read the blog, I have mentioned a couple of sources from were I collected/generated the training data. Apart from those you can search online for additional data.
The data I used for training is auto extracted and not gold annotated. Thus it is not 100% correct. If you create some data yourself (say another 10k word pairs), that will give you a much better model.

Regarding your second question, "which pairs of language are considered reliable?", reliability of the model is highly relative. It depends on the downstream task whether you consider the output good or bad. But since you have asked, the best performing models are hin-urd, hin-eng, urd-eng, next will be ben-eng, hin-ben. Rest all are less accurate than these (mainly because of less training data).

@simonefrancia
Copy link
Author

Hi,
we are facing Tamil transliteration problem (tam-eng) and I would like to know which transliteration phonetic scheme were used for the train, Azhagi or Jaffna, if I'm not wrong.
We are having feedbacks from our translation checkers and at the moment they are not so good; but we would like to know if it could be a model problem or instead, our checkers refer to a phonetical system that is not the same which was used for the model training.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants