New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word_tokenize replaces characters #1630
Comments
Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening >>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]
'``'
>>> len(word_tokenize(sent)[0])
2
>>> word_tokenize(sent)[0] == '`'*2
True
>>> len(word_tokenize(sent)[-1])
2
>>> word_tokenize(sent)[-1] == "'" * 2
True I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes. |
Thanks for the explanation. I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same. |
@mwess After some checking, the conversion from It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49 And as for the single quotes, the treebank tokenizer I hope the clarifications helps. |
Thank you very much. It actually helps a lot. |
Altering the original text is not recommended in many applications. I wish the |
When using the word_tokenize function the quotation marks get replaced with different quotation marks.
Example (german):
Is this a bug or is there a reasoning behind this behaviour?
The text was updated successfully, but these errors were encountered: