Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word_tokenize replaces characters #1630

Closed
mwess opened this issue Feb 15, 2017 · 5 comments
Closed

word_tokenize replaces characters #1630

mwess opened this issue Feb 15, 2017 · 5 comments

Comments

@mwess
Copy link

mwess commented Feb 15, 2017

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

@alvations
Copy link
Contributor

alvations commented Feb 15, 2017

Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening " are converted to 2x backticks and closing to 2x single quotes.

>>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]
'``'

>>> len(word_tokenize(sent)[0])
2
>>> word_tokenize(sent)[0] == '`'*2
True

>>> len(word_tokenize(sent)[-1])
2
>>> word_tokenize(sent)[-1] == "'" * 2
True

I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes.

@mwess
Copy link
Author

mwess commented Feb 21, 2017

Thanks for the explanation.
But when I replace the double quotes with one (or two) single quotes or backticks this behaviour doesn't occur.
And I think it is a little bit strange that the tokenizer switches out parts of the original text, since it could lead to problems and is not really transparent.

I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same.

@alvations
Copy link
Contributor

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

@mwess
Copy link
Author

mwess commented May 5, 2017

Thank you very much. It actually helps a lot.

@mwess mwess closed this as completed May 5, 2017
@kovvalsky
Copy link

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants