Skip to content
This repository has been archived by the owner on Mar 8, 2021. It is now read-only.

remove_punc concatenates words #15

Open
xjlc opened this issue Feb 4, 2015 · 1 comment
Open

remove_punc concatenates words #15

xjlc opened this issue Feb 4, 2015 · 1 comment
Labels

Comments

@xjlc
Copy link

xjlc commented Feb 4, 2015

remove_punc and remove_punc2 concatenate some words. For example, "Woodhouse.--Dear" gets replaced by WoodhouseDear. This leads to arguably questionable results of the later tests. For example, the count of Woodhouse by an implementation of remove_punc that replaces punctuation by " " and later replaces " " by " " is 314. Similarly, the frequency count of "the" is 5204 rather than 5146.
You are probably aware of this, but a cautionary note in the documentation would be warranted in my opinion.

@fbkarsdorp
Copy link
Owner

Hi! Thanks for your comments. I'll have a look at this.

@fbkarsdorp fbkarsdorp added the bug label Feb 4, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants