Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33

Open
martinreynaert opened this issue Dec 1, 2018 · 5 comments
Assignees

Comments

@martinreynaert
Copy link
Collaborator

Hi,

This concerns ranking features:

  (skip[8]?0:(*vit)->pairs1_rank) +

  (skip[10]?0:(*vit)->pairs_combined_rank) +

This is a request for a more informed ranking-feature. This may be a new one or may replace the existing pairs_combined one (preferred).

Ranking feature pairs1 currently takes the count of each anagram confusion value of the pairs transferred from LDcalc to rank. Highest number of pairs transferred ranks highest in rank, given a particular set of Correction Candidates for a particular variant.

This does not always result in the most likely CC given the highest rank in the current situation. Quite spurious confusions over particularly shorter words may be ranked higher than ostensibly often recurring confusions given the particular corpus being corrected.

After some experimentation it seems that weighing the frequencies of the CCs proposed for a particular confusion might help. We have tried the mean of the frequencies, but this results in pretty much the same ranking as we currently get in pairs1.

The median of the CCs frequencies, however, appears more likely to deliver the better ranking.

This will probably have to be implemented at the end of rank.

So, given the overall set of pairs in rank that share a particular character confusion value, this new feature needs to calculate the median of the CCs frequencies (their own, not the summed frequency of their capitalised versions). Also, here, the highest median wins, i.e. is accorded rank 1.

I would very much like to be be able to experiment with this soon.

Thanks!

M.

@kosloot
Copy link
Collaborator

kosloot commented Dec 3, 2018

I am a bit confused about your remark This will probably have to be implemented at the end of rank.
Does this mean that you suggest to calculate the median of all frequencies belonging to a character confusion, for all variants it appears in?

My first impression was, that it is a 'local' calculation, for 1 variant with its N CC's
e.g. consider this variant:

-eveuzoo~1~1~Eveu_zoo~1~2~25723051649~2~6~0~0~1~0~0
-eveuzoo~1~1~Eveuzoo~28~95~35723051649~1~7~0~0~1~0~0
-eveuzoo~1~1~Kveuzoo~2~2~28061646568~2~6~0~0~1~0~0
-eveuzoo~1~1~evenzoo~100002930~100004079~44559939201~2~6~1~0~1~0~2
-eveuzoo~1~1~eveozoo~2~2~40621368225~2~6~0~0~1~0~0
-eveuzoo~1~1~eveu_zoo~1~2~25723051649~2~6~0~0~1~0~0
-eveuzoo~1~1~eveuzoo~67~95~35723051649~1~7~0~0~1~0~0
-eveuzoo~1~1~geve_zoo~100000003~100000003~28302116432~2~6~1~0~1~0~0

this has the frequencies:
1 1 2 3 28 67 100000003 100002930
The median would be 15.5, which seems quite useless.
So this is apparently NOT what you want.

Could you clarify a bit?

@martinreynaert
Copy link
Collaborator Author

No, the local calculation is not what I want. I do suggest to calculate the median of all frequencies belonging to a character confusion, for all variants it appears in.

OK, for my tests is have used the following information:

reynaert@red:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ grep '#9496960451#' .RUNAMALGAM5.clean.ldcalc.debug.ranked | cut -d '#' -f 1,3,4,6,16 >bla3
reynaert@red:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ grep '#745481551#' .RUNAMALGAM5.clean.ldcalc.debug.ranked | cut -d '#' -f 1,3,4,6,16 >bla4

I have imported these output files in Excel and have calculated the average/mean and median over column 3 of this output, i.e. the base frequency for the CCs.

So I based this on the info in the debug file output by TICCL-rank.

I you do this on the output of LDcalc, you get larger subsets per confusion value. So extra filtering in TICCL-rank seems to discard a number of pairs, so we loose some (I hope we do not actually lose some). It would probably be easier to calculate the mean over these from Ldcalc. Who knows the net result might be the same, but I do not know this. Let us say this is an option if it proves too hard to implement this on the subsets actually output to the debug file of rank.

Hope this sufficiently clarifies matters.

@kosloot
Copy link
Collaborator

kosloot commented Dec 3, 2018

Ok,
calculation the median per confusion value is a simple preprocession step on the LDcalc data.
On only the results stored in Rank, it would require a post-procession step, which might be more expensive.
I suggest to start using the LDcalc data, and see what that brings us.
We use that global value then in ranking the CC's per variant

@kosloot kosloot changed the title Request for new rankng feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies Dec 3, 2018
@martinreynaert
Copy link
Collaborator Author

First test on server Black running with command-line:

reynaert@black:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ nohup /exp/sloot/usr/local/bin//TICCL-rank -t max --alph /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.lc.chars --charconf /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.ld2.charconfus -o /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.ranked --debugfile /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/.RUNAMALGAM5.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.debug.ranked --subtractartifrqfeature1 1000000000 --clip 1 --skipcols=9,10,13 --charconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.ranked.chrconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.clean.ldcalc 2>/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.RANK.subtractartifrqfeature1.charconfreq.MEDIAN.20181204.stderr &

@kosloot
Copy link
Collaborator

kosloot commented Jan 22, 2019

@martinreynaert Small addition:
Ik meldde:
ls je (op black:) TICCL-rank draait met de --ALTERNATIVE optie dan
berekent ie de mediaan alleen voor de frequentie van de gevonden CC's
per variant.

Ik zie (minimale) verschillen.

Graag hoor ik welke benadering we gaan kiezen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants