word_tokenize replaces characters #1630

mwess · 2017-02-15T10:37:22Z

When using the word_tokenize function the quotation marks get replaced with different quotation marks.

Example (german):

import nltk
sentence = "\"Ja.\"" # sentence[0] = "
tokens = nltk.word_tokenize(sentence) #tokens[0] = ``
print(tokens[0] == sentence[0]) # Prints false.

Is this a bug or is there a reasoning behind this behaviour?

The text was updated successfully, but these errors were encountered:

alvations · 2017-02-15T12:01:53Z

Yes, that is the expected output. The double quotes punctuation change to explicitly denote opening and closing double quotes. The opening " are converted to 2x backticks and closing to 2x single quotes.

>>> from nltk import word_tokenize
>>> sent = '"this is a sentence inside double quotes."'
>>> word_tokenize(sent)
['``', 'this', 'is', 'a', 'sentence', 'inside', 'double', 'quotes', '.', "''"]
>>> word_tokenize(sent)[0]
'``'

>>> len(word_tokenize(sent)[0])
2
>>> word_tokenize(sent)[0] == '`'*2
True

>>> len(word_tokenize(sent)[-1])
2
>>> word_tokenize(sent)[-1] == "'" * 2
True

I'm not sure what is the reason for the behavior though. Possibly, it's to be explicit when identifying opening/closing quotes.

mwess · 2017-02-21T11:45:55Z

Thanks for the explanation.
But when I replace the double quotes with one (or two) single quotes or backticks this behaviour doesn't occur.
And I think it is a little bit strange that the tokenizer switches out parts of the original text, since it could lead to problems and is not really transparent.

I guess I'll have to keep it in mind, but I would prefer that the orginal elements of the string remain the same.

alvations · 2017-05-05T04:11:58Z

@mwess After some checking, the conversion from " to `` is an artifact of the original penn treebank word tokenizer.

It only happens when there are double quotes, the regex rules that does the substitutions are https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L49

And as for the single quotes, the treebank tokenizer STARTING_QUOTES regexes we see that it doesn't indicate directionality. I think this is kept to be consistent with Penn Treebank annotations.

I hope the clarifications helps.

mwess · 2017-05-05T12:22:09Z

Thank you very much. It actually helps a lot.

kovvalsky · 2020-04-12T14:19:39Z

Altering the original text is not recommended in many applications. I wish the word_tokenize had a flag to turn off altering the text.

mwess closed this as completed May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_tokenize replaces characters #1630

word_tokenize replaces characters #1630

mwess commented Feb 15, 2017 •

edited

alvations commented Feb 15, 2017 •

edited

mwess commented Feb 21, 2017

alvations commented May 5, 2017

mwess commented May 5, 2017

kovvalsky commented Apr 12, 2020

word_tokenize replaces characters #1630

word_tokenize replaces characters #1630

Comments

mwess commented Feb 15, 2017 • edited

alvations commented Feb 15, 2017 • edited

mwess commented Feb 21, 2017

alvations commented May 5, 2017

mwess commented May 5, 2017

kovvalsky commented Apr 12, 2020

mwess commented Feb 15, 2017 •

edited

alvations commented Feb 15, 2017 •

edited