Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate feature names in OSKF and WBKF #4

Open
iris2hu opened this issue May 1, 2022 · 2 comments
Open

Duplicate feature names in OSKF and WBKF #4

iris2hu opened this issue May 1, 2022 · 2 comments

Comments

@iris2hu
Copy link

iris2hu commented May 1, 2022

Hello, thanks for this great project!

Recently we are trying to reproduce the experimental results in your paper:

Lee, Bruce W., Yoo Sung Jang, and Jason Lee. "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Just found that the OSKF method in lingfeat returned exactly the same 16 feature names as in WBKF. Please see examples below:

from lingfeat import extractor

text = "When you see the word Amazon, what’s the first thing that springs to mind – the world’s biggest forest, the longest river or the largest internet retailer – and which do you consider most important?"
LingFeat = extractor.pass_text(text)
LingFeat.preprocess()

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

print('WeeBit Corpus Knowledge Features:', WBKF)
print('OneStopEng Corpus Knowledge Features:', OSKF)

Terminal Output

WeeBit Corpus Knowledge Features:  {'BRich05_S': 1.1274421401321888, 'BRich10_S': 4.858168950304389, 'BRich15_S': 20.647890945896506, 'BRich20_S': 21.932124523445964, 'BClar05_S': 0.5823907653490702, 'BClar10_S': 0.718731752038002, 'BClar15_S': 0.7291195740302404, 'BClar20_S': 0.7486800486626832, 'BNois05_S': 1.5104791224775047, 'BNois10_S': 6.548753840448406, 'BNois15_S': 7.018329580783902, 'BNois20_S': 8.321480132061497, 'BTopc05_S': 3, 'BTopc10_S': 10, 'BTopc15_S': 18, 'BTopc20_S': 23}
OneStopEng Corpus Knowledge Features:  {'BRich05_S': 2.9044833183288574, 'BRich10_S': 3.5476092249155045, 'BRich15_S': 9.398028403520584, 'BRich20_S': 14.846967313438654, 'BClar05_S': 0.00015333294868469238, 'BClar10_S': 0.25143229961395264, 'BClar15_S': 0.6553432226181031, 'BClar20_S': 0.7100768367449443, 'BNois05_S': 1.0000004289882432, 'BNois10_S': 1.4495860709293316, 'BNois15_S': 4.214530509499038, 'BNois20_S': 5.500046277858743, 'BTopc05_S': 2, 'BTopc10_S': 3, 'BTopc15_S': 10, 'BTopc20_S': 15}

According to Appendix B of the above paper, the feature names in OSKF should start with 'O', e.g. 'ORich05_S', 'ORich10_S', etc.

This bug yields 239 distinct feature names (not 255 features as introduced in the paper). Accordingly, in another open-source project of this paper:

https://github.com/brucewlee/pushingonreadability_traditional_ML

The csv files in Research_Data included only 239 linguistic features which we believe were caused by these duplicate feature names.

@brucewlee
Copy link
Owner

Hi. I sincerely apologize for my late reply and thank you for your interest.

I'm a little busy for EMNLP 2022. I will fix the pointed out mistake in mid-June.

If you need any other help in reproducing the results, please email me so I can help!

Thanks :)

@MarioGalindoQ
Copy link

Hi Bruce,
The solution to this bug is easy.
In the file _AdvancedSemantic/OSKF.py form line 90 it is necessary to change:
"BRich" with "ORich", "BClar" with "OClar", "BNois" with "ONois" and "BTopc" with "OTopc"
Obviously you know this, but I wrote the solution to help others.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants