Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

span_tokenize failed when sentence contains double quotation #1750

Closed
albertauyeung opened this issue Jun 12, 2017 · 16 comments · Fixed by #2877
Closed

span_tokenize failed when sentence contains double quotation #1750

albertauyeung opened this issue Jun 12, 2017 · 16 comments · Fixed by #2877
Assignees

Comments

@albertauyeung
Copy link
Contributor

If we feed in a sentence with double quotation into TreebankWordTokenizer's span_tokenize function, there will be errors. Probably this is because the function sends the raw string input along with the tokenized string to the align_tokens function, without considering that the tokenize function would replace double quotation marks with something else.

@alvations alvations added the bug label Jun 12, 2017
@alvations
Copy link
Contributor

alvations commented Jun 12, 2017

Thanks @albertauyeung for reporting the issue. Do you have an example where you met an error with the TreebankWordTokenizer.span_tokenize()?

Do you mean something like this?

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 230, in align_tokens
    start = sentence.index(token, point)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/treebank.py", line 167, in span_tokenize
    return align_tokens(tokens, text)
  File "/usr/local/lib/python3.5/site-packages/nltk/tokenize/util.py", line 232, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "``" not found in "This is a sentence with "quotes inside" and alsom some 'single quotes', etc."

Suboptimal solution:

>>> s = '''This is a sentence with `` quotes inside '' and alsom some 'single quotes', etc.''' 
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 26), (27, 33), (34, 40), (41, 43), (44, 47), (48, 53), (54, 58), (59, 66), (67, 73), (73, 74), (74, 75), (76, 79), (79, 80)]

@albertauyeung
Copy link
Contributor Author

@alvations Yes. That's the exact error I got. Right now it seems we have to preprocess the sentence before submitting to span_tokenize.

@alvations
Copy link
Contributor

alvations commented Jun 12, 2017

A simple solution would be to replace the quotes before calling the nltk.tokenize.util.align_tokens function at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L147

    def span_tokenize(self, text):
        tokens = self.tokenize(text)
        tokens = ['"' if tok in ['``', "''"] else tok for tok in tokens]
        return align_tokens(tokens, text)

After the patch:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> tbw = TreebankWordTokenizer()
>>> s = '''This is a sentence with "quotes inside" and alsom some 'single quotes', etc.'''
>>> print(s)
This is a sentence with "quotes inside" and alsom some 'single quotes', etc.
>>> tbw.span_tokenize(s)
[(0, 4), (5, 7), (8, 9), (10, 18), (19, 23), (24, 25), (25, 31), (32, 38), (38, 39), (40, 43), (44, 49), (50, 54), (55, 62), (63, 69), (69, 70), (70, 71), (72, 75), (75, 76)]

@albertauyeung do you want to take a stab at adding the patch and create a new pull-request?

@albertauyeung
Copy link
Contributor Author

@alvations Yes, sure. Will do!

@alvations
Copy link
Contributor

Fixed on #1751

@alyaxey
Copy link

alyaxey commented Aug 11, 2017

Please note that this fix will still throw an exception for a text with both types of quotes:
nltk.TreebankWordTokenizer().span_tokenize('" ``')

@albertauyeung
Copy link
Contributor Author

Hi @alyaxey, what is the exception you see?

I executed nltk.TreebankWordTokenizer().span_tokenize('" ``') and got the following:
[(0, 1), (2, 4)]

@alyaxey
Copy link

alyaxey commented Aug 13, 2017

Sorry, I've provided a wrong test case. Please take a look at this one:

import nltk
print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))

The expected output is [(0, 2), (2, 3), (4, 5)] if we follow the logic of the current tokenize method. Also [(0, 3), (4, 5)] is acceptable.
Here's my output for developer branch:

Traceback (most recent call last):
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 254, in align_tokens
    start = sentence.index(token, point)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    print(nltk.TreebankWordTokenizer().span_tokenize('``` "'))
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/treebank.py", line 179, in span_tokenize
    return align_tokens(tokens, text)
  File "/Users/alyaxey/Downloads/nltk-develop/nltk/tokenize/util.py", line 256, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "`" not found in "``` ""

I'd like to suggest a different solution to 1) fix this and similar bugs, 2) provide more flexibility to users, 3) make code more clear. We can add a boolean parameter to tokenize method that enables or disables quotes transformation. We can disable quotes transformation during span_tokenize to avoid any non-space manipulations.

@tholor
Copy link

tholor commented May 23, 2018

I am running into an exception with the current version of span_tokenize for strings that contain brackets before quotation marks. I believe the regex is wrong since it also matches brackets and later replaces quotation marks in the "raw_tokens" with those brackets. Or am I missing something?

Example:

s = ' ( see 6)  Biotin " " affinity'
w_spans = TreebankWordTokenizer().span_tokenize(s) 

Exception:

...
  File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/treebank.py", line 179, in span_tokenize
    return align_tokens(tokens, text)
  File "/home/mp/miniconda3/envs/py36/lib/python3.6/site-packages/nltk/tokenize/util.py", line 256, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "(" not found in " ( see 6)  Biotin " " affinity"

Suggested fix:
Change the regex in span_tokenize from r'[(``)(\'\')(")]+' to r'(``)|(\'\')|(")'

@tholor
Copy link

tholor commented May 23, 2018

Ok my bad, it has actually already been fixed in commit 4b21300 and is running like a charm in nltk-3.3

@fseasy
Copy link

fseasy commented Oct 23, 2018

oh, it stiil has problem in nltk-3.3

like this:

File "/home/users/----/.miniconda2/lib/python2.7/site-packages/nltk/tokenize/util.py", line 258, in align_tokens
    raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "''" not found in "''Elton's been through a lot," he told The Sun newspaper."

@albertauyeung
Copy link
Contributor Author

@MeMeDa I confirm I can reproduce this bug. A solution is to add one more regex to match single quotes at the beginning of a string. Please see my branch on https://github.com/albertauyeung/nltk/tree/hotfix-span-tokenizer

@zzj0402
Copy link

zzj0402 commented Apr 14, 2020

Confirmed:

raise ValueError('substring "{}" not found in "{}"'.format(token, sentence))
ValueError: substring "enriched" not found in "The Hindu describing his Cricket, once said: `` His batting resembles very closely that of his father -dashing and carefree -and his cover-drive, a joy to watch, has amazing impetus...''And it added that he had ``enriched Madras sport as his father had''."

@wadimiusz
Copy link

wadimiusz commented Jan 28, 2021

Hi, I also experienced this bug, e. g. it happens on the following text:

''Cosita Linda' - Lisandro (2013)\n\"El Clon (2010) .... Mohammed

The resulting error is as follows:

ValueError: substring "''" not found in "''Cosita Linda' - Lisandro (2013)
"El Clon (2010) .... Mohammed"

Are there any updates on this problem?

@Toz3wm
Copy link

Toz3wm commented Nov 3, 2021

Experienced this same crash today. Reproductible example:

from nltk import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
list(tokenizer.span_tokenize('\'\'Y\'"'))

yields
ValueError: substring "''" not found in "''Y'""

@tomaarsen
Copy link
Member

@Toz3wm Thank you for the reproducible example, and for pointing me to this issue. I've submitted a PR which ought to solve this. Perhaps in the meantime you can use tokenizer.tokenize(), (which currently results in ["''Y", "'", "''"]), and then use the lengths of each of the tokens to get the expected outcome of [(0, 3), (3, 4), (4, 5)].

If this PR gets merged, then the tokenized output might become: ["''", 'Y', "'", "''"], with a span of [(0, 2), (2, 3), (3, 4), (4, 5)]. That said, I'm not expecting the PR to be merged in its current state yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants