CoreNLPParser tag() should allow properties overloading #2112

alvations · 2018-09-10T07:50:26Z

With the current CoreNLPParser.tag(), the "retokenization" by Stanford CoreNLP is unexpected:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'),
 ('phone', 'O'),
 ('number', 'O'),
 ('is', 'O'),
 ('1111\xa01111\xa01111', 'NUMBER')]

The expected behavior should be:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> sent = ['my', 'phone', 'number', 'is', '1111', '1111', '1111']
>>> ner_tagger.tag(sent)
[('my', 'O'), ('phone', 'O'), ('number', 'O'), ('is', 'O'), ('1111', 'DATE'), ('1111', 'DATE'), ('1111', 'DATE')]

Proposed solution is to allow properties arguments overloading for .tag() and .tag_sents(), i.e. at https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348 and by default use properties = {'tokenize.whitespace':'true'} because we are concatenating the tokens by spaces in tag_sents().

    def tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a list of
        tokens.

        :param sentences: Input sentences to tag
        :type sentences: list(list(str))
        :rtype: list(list(tuple(str, str))
        """
        # Converting list(list(str)) -> list(str)
        sentences = (' '.join(words) for words in sentences)
        if properties == None:
            properties = {'tokenize.whitespace':'true'}
        return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

    def tag(self, sentence, properties=None):
        """
        Tag a list of tokens.

        :rtype: list(tuple(str, str))

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
        >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
        >>> parser.tag(tokens)
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
        ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        >>> tokens = "What is the airspeed of an unladen swallow ?".split()
        >>> parser.tag(tokens)
        [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
        ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
        ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
        """
        return self.tag_sents([sentence], properties)[0]

    def raw_tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a string.

        :param sentences: Input sentences to tag
        :type sentences: list(str)
        :rtype: list(list(list(tuple(str, str)))
        """
        default_properties = {'ssplit.isOneSentence': 'true',
                              'annotators': 'tokenize,ssplit,' }

        default_properties.update(properties or {})

        # Supports only 'pos' or 'ner' tags.
        assert self.tagtype in ['pos', 'ner']
        default_properties['annotators'] += self.tagtype
        for sentence in sentences:
            tagged_data = self.api_call(sentence, properties=default_properties)
            yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                    for tagged_sentence in tagged_data['sentences']]

That should enforce the list of string tokens input by the users.

Details on https://stackoverflow.com/questions/52250268/why-do-corenlp-ner-tagger-and-ner-tagger-join-the-separated-numbers-together

If we allow the .tag() to overload the properties before the raw_tag_sents, that'll also allow users to easily handle cases like #1876

The text was updated successfully, but these errors were encountered:

dimazest · 2018-09-10T12:51:41Z

Looks good.

Just some minor comments. It should be if properties is None, not if properties == None. assert self.tagtype in ['pos', 'ner'] should be assert self.tagtype in ['pos', 'ner'], "CoreNLP tagger supports only 'pos' or 'ner' tags.".

I don't really like idea of joining and splitting strings, maybe there could be a way to pass a list of words to CoreNLP as a sentence instead of a simple string.

rishabhkumar296 · 2019-01-06T19:25:51Z

Hi, I would like to take this up as my first issue.

dimazest · 2019-01-07T20:16:28Z

It's great you are interested in the issue. If you have any questions, ask them here.

BatMrE · 2021-08-26T11:32:59Z

Hi @dimazest @alvations if its currently open can I get involved with contributing to it?

dimazest · 2021-08-26T16:05:21Z

Sure! A good first step is to open a pull request.

BatMrE · 2021-08-29T12:30:18Z

Hi @dimazest @alvations if have created a PR
https://github.com/nltk/nltk/pull/2789
Can you please check for any changes

alvations added good first issue bug stanford api labels Sep 10, 2018

BatMrE mentioned this issue Aug 29, 2021

Allowed properties overloading for CoreNLPParser tag, Authors.md #2789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreNLPParser tag() should allow properties overloading #2112

CoreNLPParser tag() should allow properties overloading #2112

alvations commented Sep 10, 2018 •

edited

dimazest commented Sep 10, 2018

rishabhkumar296 commented Jan 6, 2019

dimazest commented Jan 7, 2019

BatMrE commented Aug 26, 2021

dimazest commented Aug 26, 2021

BatMrE commented Aug 29, 2021

CoreNLPParser tag() should allow properties overloading #2112

CoreNLPParser tag() should allow properties overloading #2112

Comments

alvations commented Sep 10, 2018 • edited

dimazest commented Sep 10, 2018

rishabhkumar296 commented Jan 6, 2019

dimazest commented Jan 7, 2019

BatMrE commented Aug 26, 2021

dimazest commented Aug 26, 2021

BatMrE commented Aug 29, 2021

alvations commented Sep 10, 2018 •

edited