Duplicate feature names in OSKF and WBKF #4

iris2hu · 2022-05-01T08:17:16Z

Hello, thanks for this great project!

Recently we are trying to reproduce the experimental results in your paper:

Lee, Bruce W., Yoo Sung Jang, and Jason Lee. "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Just found that the OSKF method in lingfeat returned exactly the same 16 feature names as in WBKF. Please see examples below:

from lingfeat import extractor

text = "When you see the word Amazon, what’s the first thing that springs to mind – the world’s biggest forest, the longest river or the largest internet retailer – and which do you consider most important?"
LingFeat = extractor.pass_text(text)
LingFeat.preprocess()

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

print('WeeBit Corpus Knowledge Features:', WBKF)
print('OneStopEng Corpus Knowledge Features:', OSKF)

Terminal Output

WeeBit Corpus Knowledge Features:  {'BRich05_S': 1.1274421401321888, 'BRich10_S': 4.858168950304389, 'BRich15_S': 20.647890945896506, 'BRich20_S': 21.932124523445964, 'BClar05_S': 0.5823907653490702, 'BClar10_S': 0.718731752038002, 'BClar15_S': 0.7291195740302404, 'BClar20_S': 0.7486800486626832, 'BNois05_S': 1.5104791224775047, 'BNois10_S': 6.548753840448406, 'BNois15_S': 7.018329580783902, 'BNois20_S': 8.321480132061497, 'BTopc05_S': 3, 'BTopc10_S': 10, 'BTopc15_S': 18, 'BTopc20_S': 23}
OneStopEng Corpus Knowledge Features:  {'BRich05_S': 2.9044833183288574, 'BRich10_S': 3.5476092249155045, 'BRich15_S': 9.398028403520584, 'BRich20_S': 14.846967313438654, 'BClar05_S': 0.00015333294868469238, 'BClar10_S': 0.25143229961395264, 'BClar15_S': 0.6553432226181031, 'BClar20_S': 0.7100768367449443, 'BNois05_S': 1.0000004289882432, 'BNois10_S': 1.4495860709293316, 'BNois15_S': 4.214530509499038, 'BNois20_S': 5.500046277858743, 'BTopc05_S': 2, 'BTopc10_S': 3, 'BTopc15_S': 10, 'BTopc20_S': 15}

According to Appendix B of the above paper, the feature names in OSKF should start with 'O', e.g. 'ORich05_S', 'ORich10_S', etc.

This bug yields 239 distinct feature names (not 255 features as introduced in the paper). Accordingly, in another open-source project of this paper:

https://github.com/brucewlee/pushingonreadability_traditional_ML

The csv files in Research_Data included only 239 linguistic features which we believe were caused by these duplicate feature names.

The text was updated successfully, but these errors were encountered:

brucewlee · 2022-05-29T05:29:26Z

Hi. I sincerely apologize for my late reply and thank you for your interest.

I'm a little busy for EMNLP 2022. I will fix the pointed out mistake in mid-June.

If you need any other help in reproducing the results, please email me so I can help!

Thanks :)

MarioGalindoQ · 2022-10-06T01:16:31Z

Hi Bruce,
The solution to this bug is easy.
In the file _AdvancedSemantic/OSKF.py form line 90 it is necessary to change:
"BRich" with "ORich", "BClar" with "OClar", "BNois" with "ONois" and "BTopc" with "OTopc"
Obviously you know this, but I wrote the solution to help others.
Thank you.

brucewlee mentioned this issue Oct 18, 2022

Wrong formula for the Coleman–Liau index #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate feature names in OSKF and WBKF #4

Duplicate feature names in OSKF and WBKF #4

iris2hu commented May 1, 2022

brucewlee commented May 29, 2022

MarioGalindoQ commented Oct 6, 2022

Duplicate feature names in OSKF and WBKF #4

Duplicate feature names in OSKF and WBKF #4

Comments

iris2hu commented May 1, 2022

brucewlee commented May 29, 2022

MarioGalindoQ commented Oct 6, 2022