span_tokenize failed when sentence contains double quotation #1750

albertauyeung · 2017-06-12T08:13:27Z

If we feed in a sentence with double quotation into TreebankWordTokenizer's span_tokenize function, there will be errors. Probably this is because the function sends the raw string input along with the tokenized string to the align_tokens function, without considering that the tokenize function would replace double quotation marks with something else.

alvations · 2017-06-12T08:58:40Z

Thanks @albertauyeung for reporting the issue. Do you have an example where you met an error with the TreebankWordTokenizer.span_tokenize()?

Do you mean something like this?

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 230, in align_tokens
    start = sentence.index(token, point)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/treebank.py", line 167, in span_tokenize
    return align_tokens(tokens, text)
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 232, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "``" not found in "This is a sentence with "quotes inside" and alsom some 'single quotes', etc."

Suboptimal solution:

>>> s = '''This is a sentence with `` quotes inside '' and alsom some 'single quotes', etc.''' 
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 26), (27, 33), (34, 40), (41, 43), (44, 47), (48, 53), (54, 58), (59, 66), (67, 73), (73, 74), (74, 75), (76, 79), (79, 80)]

albertauyeung · 2017-06-12T09:06:15Z

@alvations Yes. That's the exact error I got. Right now it seems we have to preprocess the sentence before submitting to span_tokenize.

alvations · 2017-06-12T09:06:58Z

A simple solution would be to replace the quotes before calling the nltk.tokenize.util.align_tokens function at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L147

    def span_tokenize(self, text):
        tokens = self.tokenize(text)
        tokens = ['"' if tok in ['``', "''"] else tok for tok in tokens]
        return align_tokens(tokens, text)

After the patch:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 25), (25, 31), (32, 38), (38, 39), (40, 43), (44, 49), (50, 54), (55, 62), (63, 69), (69, 70), (70, 71), (72, 75), (75, 76)]

@albertauyeung do you want to take a stab at adding the patch and create a new pull-request?

albertauyeung · 2017-06-12T09:15:15Z

@alvations Yes, sure. Will do!

alvations · 2017-06-13T22:12:10Z

Fixed on #1751

alyaxey · 2017-08-11T13:42:55Z

Please note that this fix will still throw an exception for a text with both types of quotes:
nltk.TreebankWordTokenizer().span_tokenize('" ``')

albertauyeung · 2017-08-12T07:09:50Z

Hi @alyaxey, what is the exception you see?

I executed nltk.TreebankWordTokenizer().span_tokenize('" ``') and got the following:
[(0, 1), (2, 4)]

alyaxey · 2017-08-13T18:47:06Z

Sorry, I've provided a wrong test case. Please take a look at this one:

import nltk
print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))

The expected output is [(0, 2), (2, 3), (4, 5)] if we follow the logic of the current tokenize method. Also [(0, 3), (4, 5)] is acceptable.
Here's my output for developer branch:

Traceback (most recent call last):
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 254, in align_tokens
    start = sentence.index(token, point)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/treebank.py", line 179, in span_tokenize
    return align_tokens(tokens, text)
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 256, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "`" not found in "``` ""

I'd like to suggest a different solution to 1) fix this and similar bugs, 2) provide more flexibility to users, 3) make code more clear. We can add a boolean parameter to tokenize method that enables or disables quotes transformation. We can disable quotes transformation during span_tokenize to avoid any non-space manipulations.

tholor · 2018-05-23T12:23:42Z

I am running into an exception with the current version of span_tokenize for strings that contain brackets before quotation marks. I believe the regex is wrong since it also matches brackets and later replaces quotation marks in the "raw_tokens" with those brackets. Or am I missing something?

Example:

s = ' ( see 6)  Biotin " " affinity'
w_spans = TreebankWordTokenizer().span_tokenize(s)

Exception:

...
  File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/treebank.py", line 179, in span_tokenize
    return align_tokens(tokens, text)
  File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/util.py", line 256, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "(" not found in " ( see 6)  Biotin " " affinity"

Suggested fix:
Change the regex in span_tokenize from r'[(``)(\'\')(")]+' to r'(``)|(\'\')|(")'

tholor · 2018-05-23T16:56:44Z

Ok my bad, it has actually already been fixed in commit 4b21300 and is running like a charm in nltk-3.3

…le quotes

fseasy · 2018-10-23T14:27:26Z

oh, it stiil has problem in nltk-3.3

like this:

File "/home/users/----/.miniconda2/lib/python2.7/site-packages/nltk/tokenize/util.py", line 258, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "''" not found in "''Elton's been through a lot," he told The Sun newspaper."

albertauyeung · 2018-10-24T02:41:37Z

@MeMeDa I confirm I can reproduce this bug. A solution is to add one more regex to match single quotes at the beginning of a string. Please see my branch on https://github.com/albertauyeung/nltk/tree/hotfix-span-tokenizer

zzj0402 · 2020-04-14T06:16:36Z

Confirmed:

raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))

ValueError: substring "enriched" not found in "The Hindu describing his Cricket, once said: `` His batting resembles very closely that of his father -dashing and carefree -and his cover-drive, a joy to watch, has amazing impetus...''And it added that he had ``enriched Madras sport as his father had''."

wadimiusz · 2021-01-28T17:46:41Z

Hi, I also experienced this bug, e. g. it happens on the following text:

''Cosita Linda' - Lisandro (2013)\n\"El Clon (2010) .... Mohammed

The resulting error is as follows:

ValueError: substring "''" not found in "''Cosita Linda' - Lisandro (2013)
"El Clon (2010) .... Mohammed"

Are there any updates on this problem?

Toz3wm · 2021-11-03T19:58:29Z

Experienced this same crash today. Reproductible example:

from nltk import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
list(tokenizer.span_tokenize('\'\'Y\'"'))

yields
ValueError: substring "''" not found in "''Y'""

tomaarsen · 2021-11-04T20:55:40Z

@Toz3wm Thank you for the reproducible example, and for pointing me to this issue. I've submitted a PR which ought to solve this. Perhaps in the meantime you can use tokenizer.tokenize(), (which currently results in ["''Y", "'", "''"]), and then use the lengths of each of the tokens to get the expected outcome of [(0, 3), (3, 4), (4, 5)].

If this PR gets merged, then the tokenized output might become: ["''", 'Y', "'", "''"], with a span of [(0, 2), (2, 3), (3, 4), (4, 5)]. That said, I'm not expecting the PR to be merged in its current state yet.

alvations added the bug label Jun 12, 2017

alvations added the good first issue label Jun 12, 2017

albertauyeung mentioned this issue Jun 13, 2017

Patch for span_tokenize in TreebankWordTokenizer #1751

Merged

alvations closed this as completed Jun 13, 2017

alvations reopened this Aug 11, 2017

alvations added pleaseverify and removed good first issue labels Aug 11, 2017

dmh43 mentioned this issue Jun 7, 2018

add span tokenizers to standard api #2037

Closed

dennisverspuij added a commit to dennisverspuij/nltk that referenced this issue Jul 11, 2018

Ref: nltk#1750 Additional fix in case a sentence starts with two sing…

2753262

…le quotes

dennisverspuij mentioned this issue Jul 11, 2018

Fix span tokenizing when sentence starts with two single quotes #2060

Closed

piero mentioned this issue Mar 20, 2019

Make word_tokenize treat two single quotes ('') consistently. #2254

Closed

tomaarsen self-assigned this Nov 4, 2021

tomaarsen mentioned this issue Nov 4, 2021

Fixed several TreebankWordTokenizer and NLTKWordTokenizer bugs #2877

Merged

stevenbird closed this as completed in #2877 Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

span_tokenize failed when sentence contains double quotation #1750

span_tokenize failed when sentence contains double quotation #1750

albertauyeung commented Jun 12, 2017

alvations commented Jun 12, 2017 •

edited

albertauyeung commented Jun 12, 2017

alvations commented Jun 12, 2017 •

edited

albertauyeung commented Jun 12, 2017

alvations commented Jun 13, 2017

alyaxey commented Aug 11, 2017

albertauyeung commented Aug 12, 2017

alyaxey commented Aug 13, 2017 •

edited

tholor commented May 23, 2018

tholor commented May 23, 2018

fseasy commented Oct 23, 2018 •

edited

albertauyeung commented Oct 24, 2018

zzj0402 commented Apr 14, 2020

wadimiusz commented Jan 28, 2021 •

edited

Toz3wm commented Nov 3, 2021

tomaarsen commented Nov 4, 2021

span_tokenize failed when sentence contains double quotation #1750

span_tokenize failed when sentence contains double quotation #1750

Comments

albertauyeung commented Jun 12, 2017

alvations commented Jun 12, 2017 • edited

albertauyeung commented Jun 12, 2017

alvations commented Jun 12, 2017 • edited

albertauyeung commented Jun 12, 2017

alvations commented Jun 13, 2017

alyaxey commented Aug 11, 2017

albertauyeung commented Aug 12, 2017

alyaxey commented Aug 13, 2017 • edited

tholor commented May 23, 2018

tholor commented May 23, 2018

fseasy commented Oct 23, 2018 • edited

albertauyeung commented Oct 24, 2018

zzj0402 commented Apr 14, 2020

wadimiusz commented Jan 28, 2021 • edited

Toz3wm commented Nov 3, 2021

tomaarsen commented Nov 4, 2021

alvations commented Jun 12, 2017 •

edited

alvations commented Jun 12, 2017 •

edited

alyaxey commented Aug 13, 2017 •

edited

fseasy commented Oct 23, 2018 •

edited

wadimiusz commented Jan 28, 2021 •

edited