Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation 'hemp' parameter in FoLiA-stats #29

Open
martinreynaert opened this issue Feb 5, 2019 · 5 comments
Open

Operation 'hemp' parameter in FoLiA-stats #29

martinreynaert opened this issue Feb 5, 2019 · 5 comments

Comments

@martinreynaert
Copy link
Contributor

martinreynaert commented Feb 5, 2019

The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.

reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt
F r a n s c h zal
Z. F r a n k r ij k.
uitgeoefend. Z. F r a n k r ij k.
F r a n k r ij k.
reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n'
F_r_a_n_k_r

1/ ligatures should be seen as single characters.
2/ a final character with a trailing punctuation mark should also be collected.

Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.

@martinreynaert
Copy link
Contributor Author

martinreynaert commented Feb 5, 2019

I was notified FoLiA-stats, as installed on the new server 'violet', should now be able to handle ligatures.

I tested this on 'violet'. Note this was the very first time I ran any FoLiA- or TICCL tool on this new machine.

It seemed very slow.

And it did not work as can be seen from the output file:

reynaert@violet:/reddata$ grep 'F_r_a_n' /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp
F_r_a_n_k_r
F_r_a_n_s_c_h

The command run was:

reynaert@violet:/reddata$ /exp/sloot/usr/local/bin/FoLiA-stats --max-ngram=3 --separator='_' --collect --tags=div -t max --hemp=/reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp -e folia.xml$ -o /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/

@kosloot
Copy link
Contributor

kosloot commented Feb 6, 2019

Ok, closer examining the provided data reveals that the 'ij' ISN'T a ligature but indeed just 2 separate characters. So the patch to handle multi-byte characters didn't work out.
I assume the conversion to FoLiA already 'solved' the ligature.

We really need to be more lax hear and accept 2-byte sequences too.
This might well turn out to permissive, in which case we could put restrictions, like
'only certain 2-grams' and ' a punctuation, but only on the last position'

@kosloot
Copy link
Contributor

kosloot commented Feb 6, 2019

Ok, I improved 'hemp' detection. the bi-gram 'ij' is now always accepted, and bi-grams with a trailing punctuation too, but they are assumed to END the 'hemp'
@martinreynaert please test this, it is installed on violet.

@kosloot
Copy link
Contributor

kosloot commented Feb 13, 2019

@martinreynaert I would like to improve, and clarify 'hemp' detection a bit, especially while we are using the same procedure in FoLiA-correct now. I will use some corner-cases to illustrate the difficulties.

Take the following examples:

  1. H E M P
  2. een H E M P dus
  3. een H E M P in een zin

I suppose the hemp to be detected is H_E_M_P

Some cases with a punctuated hemp:

  1. H E M P.
  2. een H E M P. dus
  3. een H E M P. in een zin
  4. een H E. M P. in een zin

1,2 and 3 will give the hemp: H_E_M_P.
4 will give 2 hemps: H_E. and M_P. as we consider a punctuated 2-gram as a hemp-stopper.
This may be questionable....

1-digit numbers can also be part of an hemp, like in: 1 2 3 yielding 1_2_3, but see that
1_2._3 not detects any hemps. But probably 1_2. is desired, or even 1_2._3?

NOTE: as an exception the bi-gram 'ij' (and case variants) is also part of a hemp.

To summarize:
We need a clear definition of a hemp :)

@kosloot
Copy link
Contributor

kosloot commented Mar 25, 2020

still waiting for an answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants