Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICLL-rank: 'filter out' unigram correction variants where a bigram to unigram CC is present. #26

Open
kosloot opened this issue Jul 10, 2018 · 1 comment
Assignees

Comments

@kosloot
Copy link
Collaborator

kosloot commented Jul 10, 2018

@martinreynaert provided the following examples:

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> veroor#1#verloor#100000024#1#0.998869

The last entry is undesirable.

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> zaakt_door#1#zaak_voor#100000001#2#1
<mre> zaakt#1#nazakt#100000000#2#0.998757

The last entry is undesirable.

<mre> verlaa_ten#1#verlaaten#100000010#1#0.984416
<mre> verlaa#1#verlaan#100000000#1#0.998726

Idem

<mre> acobs_Nakomelingen#1#j_acobs_Nakomelingen#1#2#1
<mre> acobs#1#Jacobs#100000001#1#0.993398
<mre> j_acobs#1#Jacobs#100000001#1#0.977545

Here the second is undesirable.

This last one also illustrates why filtering out is not that easy.
It would be handy if is was a sequential process, but unfortunately not.

At the moment TICCL-rank process it's input and output in chunks, but we have to change that and store all results so we can filter the above cases out afterwards.
A major change! More memory consuming, and less easy to handle multi threaded.
Some more investigation is needed.

@martinreynaert
Copy link
Collaborator

martinreynaert commented Dec 15, 2021

Might it be possible to shift solving this to the next module, TICCL-chain?

I still consider this quite a major problem. I also consider this closely related to what currently goes wrong with the ngram filtering (I'll probably come back to this in relation to another issue, but not issues 33 and 34, which actually I do not seem to be related at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants