Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecating the old Stanford Parser #1843

Merged
merged 11 commits into from
Nov 26, 2017
Merged

Deprecating the old Stanford Parser #1843

merged 11 commits into from
Nov 26, 2017

Conversation

artiemq
Copy link
Contributor

@artiemq artiemq commented Oct 2, 2017

Worked on #1839 issue

Also refactored CoreNLPTokenizer and CoreNLPTagger tests

Copy link
Contributor

@dimazest dimazest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it worth creating a drop-in replacement for the Stanford parser, mostly because the API is incompatible: CoreNLPParser requires CoreNLPServer.

Deprecation warnings are useful!

@@ -281,6 +282,15 @@ class StanfordParser(GenericStanfordParser):

_OUTPUT_FORMAT = 'penn'

def __init__():
warnings.simplefilter('always', DeprecationWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether warnings.simplefilter should be called here. It should be up to the user to decide what warnings are shown.

"be deprecated\n"
"Please use \033[91mnltk.parse.corenlp.CoreNLPParser\033[0m instead.'"),
DeprecationWarning, stacklevel=2)
warnings.simplefilter('ignore', DeprecationWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, what if the user had some custom filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is str("...") really needed? would "..." be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is an unmatched '.

@@ -23,6 +23,7 @@
from nltk.parse.api import ParserI
from nltk.parse.dependencygraph import DependencyGraph
from nltk.tree import Tree
from nltk.parse.corenlp import CoreNLPParser, CoreNLPDependencyParser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this:

from nltk.parse import corenlp

and later refer to CoreNLPParser and CoreNLPDependencyParser as corenlp.CoreNLPParser and corenlp.CoreNLPDependencyParser.

@@ -400,6 +426,76 @@ def _make_tree(self, result):
return DependencyGraph(result, top_relation_label='ROOT')


class CoreNLPParser(CoreNLPParser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a new import this becomes:

class CoreNLPParser(corenlp.CoreNLPParser):

I suggest not to use CoreNLPParser as a name, as it's confusing.

@@ -228,6 +228,7 @@ def _execute(self, cmd, input_, verbose=False):

return stdout


class StanfordParser(GenericStanfordParser):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why not to rewrite the implementation of StanfordParser so that it uses CoreNLP parser, but then one needs to be careful with the CoreNLPServer.

@artiemq
Copy link
Contributor Author

artiemq commented Oct 3, 2017

StanfordTagger and StanfordTokenizer have been deprecated in the same way. If you think it really not worth to replace the StanfordParser then should i rewrite tagger and tokenizer too?

@dimazest
Copy link
Contributor

dimazest commented Oct 3, 2017

Right, these are a tagger and a tokenizer, not a parser. I would deprecate the Stanford parser and suggest to use corenlp.CorreNLPParser.

@alvations what do you think?

@alvations
Copy link
Contributor

Deprecating the StanfordParser and leaving the old interface there while advising users to use CoreNLPParser is a good idea. That way the code on the users' side don't break if they decide to still use the old Stanford tools without CoreNLP.

But we have to ensure that it behaves as how a ParserI would function; it should since the inheritance tree is =)

> ParserI, TokenizerI
    > GenericCoreNLPParser
        > CoreNLPParser
        > CoreNLPDependencyParser

Maybe something like this:

/nltk
    /parse
        /corenlp
            GenericCoreNLPParser                            
            CoreNLPParser                                        
            CoreNLPDependencyParser                     
            CoreNLPNeuralDependencyParser 
        /stanford
            GenericStanfordParser                           
            StanfordParser                                    
            StanfordDependencyParser               
            StanfordNeuralDependencyParser       
     /tag
        /stanford
            StanfordTagger               
            StanfordPOSTagger        
            StanfordNERTagger        
            CoreNLPTagger               
            CoreNLPPOSTagger        
            CoreNLPNERTagger        
     /tokenize
        /stanford
            StanfordTokenizer           
            CoreNLPTokenizer          
        /stanford_segmenter
            StanfordSegmenter        

@dimazest
Copy link
Contributor

dimazest commented Oct 4, 2017

Another way is to make CoreNLPParser to implement all those interfaces and get rid of CoreNLPTokenizer and CoreNLP*Tagger. The reasoning being is that we provide an interface to a tool which is a tagger, parser and tokenizer.

@artiemq
Copy link
Contributor Author

artiemq commented Oct 6, 2017

What should i do with different return types? The original CoreNLPParser.tokenize() returns a generator of string, but StanfordTokenizer.tokenize() returns a list, should i create something like CoreNLPParser.stanford_tokenize() that returns a list or just leave tokenizer part as is?

Also StanfordParser and CoreNLPParser have different method name for parsing. StanfordParser has parse() but CoreNLPParser has raw_parse().

@dimazest
Copy link
Contributor

dimazest commented Oct 6, 2017

@artiemq a list and a generator are both iterables, so I would not pay attention to the difference. Personally, I prefer a generator, because then it up to the user to decide what container to use (a list, a set or none and just iterate over the generator). I would not add .stanford_tokenize() as it will clutter the already quite mysterious API.

.raw_parse() is defined only on StanfordParser, and consequently in CoreNLP, because in the beginning it was meant to be a drop in replacement. Since CoreNLP is not a drop in replacement, it worth removing it and follow the interface defined in https://github.com/nltk/nltk/blob/develop/nltk/parse/api.py#L14

It seems that the difference between .raw_parse() and .parse() is that .raw_parse() takes a sentence as a string, while .parse() expects a list of tokens.

import requests

self.url = url
self.encoding = encoding

assert tagtype in ['pos', 'ner', None]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception here saying that tagtype must be either 'pos', 'ner' or None.

tagged_data = self.api_call(sentence, properties=default_properties)
assert len(tagged_data['sentences']) == 1
# Taggers only need to return 1-best sentence.
yield [(token['word'], token[self.tagtype]) for token in tagged_data['sentences'][0]['tokens']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they can return several, according to the API (I guess so). I would not limit the functionality here and if CoreNLP returns several parses, pass all of them.

@@ -23,9 +23,11 @@
from nltk.parse.api import ParserI
from nltk.parse.dependencygraph import DependencyGraph
from nltk.tree import Tree
from nltk.parse import corenlp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant.

@@ -14,7 +14,7 @@
from six import string_types
import codecs

import nltk.data
from nltk.data import LazyLoader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to leave this code as it was.

@@ -181,6 +181,7 @@
from six.moves.urllib.error import HTTPError, URLError

import nltk
from nltk import data, internals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, it might not worth to play with imports

@artiemq
Copy link
Contributor Author

artiemq commented Oct 9, 2017

I've updated raw_tag_sents() method, now it returns list of parses for each sentence, but tag() still returns list(tuple(str, str)) with only one parse, is it fine?

@dimazest
Copy link
Contributor

Looks good, thanks!

@alvations
Copy link
Contributor

Lets get this merged and prepare for #1892 =)

@stevenbird stevenbird merged commit b2d622e into nltk:develop Nov 26, 2017
@stevenbird
Copy link
Member

Thanks @artiemq, @dimazest, @alvations

@artiemq artiemq deleted the issue_1839 branch November 26, 2017 06:04
@hexingren
Copy link

hexingren commented Apr 25, 2018

Hello,

I was trying to move to CoreNLPPOSTagger and CoreNLPNERTagger from StanfordNERTagger with NLTK v3.2.5. The CoreNLPPOSTagger worked as expected but CoreNLPNERTagger threw an HTTPError: 500 Server Error. I tried both CoreNLP v3.9.1 (the latest version) and v3.8.0, and they both threw the same error. I posted more details at #2010

Does anyone have some idea on this problem? Or I should be safe if I stay with StanfordNERTagger using v3.2.5? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants