Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

folia-idf yielded empty output #70

Open
pirolen opened this issue Jul 14, 2023 · 5 comments
Open

folia-idf yielded empty output #70

pirolen opened this issue Jul 14, 2023 · 5 comments

Comments

@pirolen
Copy link

pirolen commented Jul 14, 2023

I got empty output when running folia-idf in the foliautils container.

/data # FoLiA-idf --class FoLiA-txt  -O myidf bla 
Processed :bla/TRAINING_VALIDATION_SET_Combined_VKS_2_Silvestrovskij_0_01_GT.folia.xml with 0 unique words. still 0 files to go.
start calculating the results
created IDF list 'myidf.idf.tsv'
done: 
@kosloot
Copy link
Contributor

kosloot commented Jul 29, 2023

@pirolen it would be helpful if you could send me the file:
bla/TRAINING_VALIDATION_SET_Combined_VKS_2_Silvestrovskij_0_01_GT.folia.xml
so I can try to reproduce the problem

(does it indeed have words with class=FoLiA-txt ?)

@pirolen
Copy link
Author

pirolen commented Jul 30, 2023

Here you are.
Yes, it has that text class, since the file was generated using FoLiA-txt.

TRAINING_VALIDATION_SET_Combined_VKS_2_Silvestrovskij_0_01_GT.folia.xml.txt

@kosloot
Copy link
Contributor

kosloot commented Jul 30, 2023

AHA.
FoLiA-idf extracts its information from either <w> nodes or <str> nodes.
In this case you have no <w>, but you DO have <str>.
To handle this, you must provide the --strings option:

$ FoLiA-idf -O grr --strings --class=FoLiA-txt IDFbug.xml 
Processed :IDFbug.xml with 5355 unique words. still 0 files to go.

Hope this helps.

@pirolen
Copy link
Author

pirolen commented Jul 30, 2023

Brilliant, thanks! Will try out asap, right now I can't...

@kosloot
Copy link
Contributor

kosloot commented Jul 31, 2023

Just a tought: I think you should be better off running ucto on this file first.
That will create <w> nodes AND remove unwanted hyphens. (And also handle punctuation)

NOTE: ucto will by default create words with <t> nodes in textclass current, so you will need to specify that class when running FoLiA-idf
It does not use current as the default. (which is a bug. fixed now in Git)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants