Deprecating the old Stanford Parser #1843

artiemq · 2017-10-02T19:14:31Z

Worked on #1839 issue

Also refactored CoreNLPTokenizer and CoreNLPTagger tests

…yParser

dimazest

I'm not sure it worth creating a drop-in replacement for the Stanford parser, mostly because the API is incompatible: CoreNLPParser requires CoreNLPServer.

Deprecation warnings are useful!

dimazest · 2017-10-02T19:40:22Z

nltk/parse/stanford.py

@@ -281,6 +282,15 @@ class StanfordParser(GenericStanfordParser):

    _OUTPUT_FORMAT = 'penn'

+    def __init__():
+        warnings.simplefilter('always', DeprecationWarning)


I'm not sure whether warnings.simplefilter should be called here. It should be up to the user to decide what warnings are shown.

dimazest · 2017-10-02T19:40:53Z

nltk/parse/stanford.py

+                          "be deprecated\n"
+                          "Please use \033[91mnltk.parse.corenlp.CoreNLPParser\033[0m instead.'"),
+                      DeprecationWarning, stacklevel=2)
+        warnings.simplefilter('ignore', DeprecationWarning)


Also here, what if the user had some custom filter.

is str("...") really needed? would "..." be enough?

It looks like there is an unmatched '.

dimazest · 2017-10-02T19:42:35Z

nltk/parse/stanford.py

@@ -23,6 +23,7 @@
 from nltk.parse.api import ParserI
 from nltk.parse.dependencygraph import DependencyGraph
 from nltk.tree import Tree
+from nltk.parse.corenlp import CoreNLPParser, CoreNLPDependencyParser


I would do this:

from nltk.parse import corenlp

and later refer to CoreNLPParser and CoreNLPDependencyParser as corenlp.CoreNLPParser and corenlp.CoreNLPDependencyParser.

dimazest · 2017-10-02T19:43:42Z

nltk/parse/stanford.py

@@ -400,6 +426,76 @@ def _make_tree(self, result):
        return DependencyGraph(result, top_relation_label='ROOT')


+class CoreNLPParser(CoreNLPParser):


With a new import this becomes:

class CoreNLPParser(corenlp.CoreNLPParser):

I suggest not to use CoreNLPParser as a name, as it's confusing.

dimazest · 2017-10-02T19:46:49Z

nltk/parse/stanford.py

@@ -228,6 +228,7 @@ def _execute(self, cmd, input_, verbose=False):

        return stdout

+
 class StanfordParser(GenericStanfordParser):


Actually, why not to rewrite the implementation of StanfordParser so that it uses CoreNLP parser, but then one needs to be careful with the CoreNLPServer.

artiemq · 2017-10-03T17:13:16Z

StanfordTagger and StanfordTokenizer have been deprecated in the same way. If you think it really not worth to replace the StanfordParser then should i rewrite tagger and tokenizer too?

dimazest · 2017-10-03T17:56:17Z

Right, these are a tagger and a tokenizer, not a parser. I would deprecate the Stanford parser and suggest to use corenlp.CorreNLPParser.

@alvations what do you think?

alvations · 2017-10-04T02:12:49Z

Deprecating the StanfordParser and leaving the old interface there while advising users to use CoreNLPParser is a good idea. That way the code on the users' side don't break if they decide to still use the old Stanford tools without CoreNLP.

But we have to ensure that it behaves as how a ParserI would function; it should since the inheritance tree is =)

> ParserI, TokenizerI
    > GenericCoreNLPParser
        > CoreNLPParser
        > CoreNLPDependencyParser

Maybe something like this:

/nltk
    /parse
        /corenlp
            GenericCoreNLPParser                            
            CoreNLPParser                                        
            CoreNLPDependencyParser                     
            CoreNLPNeuralDependencyParser 
        /stanford
            GenericStanfordParser                           
            StanfordParser                                    
            StanfordDependencyParser               
            StanfordNeuralDependencyParser       
     /tag
        /stanford
            StanfordTagger               
            StanfordPOSTagger        
            StanfordNERTagger        
            CoreNLPTagger               
            CoreNLPPOSTagger        
            CoreNLPNERTagger        
     /tokenize
        /stanford
            StanfordTokenizer           
            CoreNLPTokenizer          
        /stanford_segmenter
            StanfordSegmenter

dimazest · 2017-10-04T14:27:27Z

Another way is to make CoreNLPParser to implement all those interfaces and get rid of CoreNLPTokenizer and CoreNLP*Tagger. The reasoning being is that we provide an interface to a tool which is a tagger, parser and tokenizer.

artiemq · 2017-10-06T14:41:59Z

What should i do with different return types? The original CoreNLPParser.tokenize() returns a generator of string, but StanfordTokenizer.tokenize() returns a list, should i create something like CoreNLPParser.stanford_tokenize() that returns a list or just leave tokenizer part as is?

Also StanfordParser and CoreNLPParser have different method name for parsing. StanfordParser has parse() but CoreNLPParser has raw_parse().

dimazest · 2017-10-06T15:04:33Z

@artiemq a list and a generator are both iterables, so I would not pay attention to the difference. Personally, I prefer a generator, because then it up to the user to decide what container to use (a list, a set or none and just iterate over the generator). I would not add .stanford_tokenize() as it will clutter the already quite mysterious API.

.raw_parse() is defined only on StanfordParser, and consequently in CoreNLP, because in the beginning it was meant to be a drop in replacement. Since CoreNLP is not a drop in replacement, it worth removing it and follow the interface defined in https://github.com/nltk/nltk/blob/develop/nltk/parse/api.py#L14

It seems that the difference between .raw_parse() and .parse() is that .raw_parse() takes a sentence as a string, while .parse() expects a list of tokens.

dimazest · 2017-10-08T23:06:35Z

nltk/parse/corenlp.py

        import requests

        self.url = url
        self.encoding = encoding

+        assert tagtype in ['pos', 'ner', None]


Throw an exception here saying that tagtype must be either 'pos', 'ner' or None.

dimazest · 2017-10-08T23:08:49Z

nltk/parse/corenlp.py

+            tagged_data = self.api_call(sentence, properties=default_properties)
+            assert len(tagged_data['sentences']) == 1
+            # Taggers only need to return 1-best sentence.
+            yield [(token['word'], token[self.tagtype]) for token in tagged_data['sentences'][0]['tokens']]


But they can return several, according to the API (I guess so). I would not limit the functionality here and if CoreNLP returns several parses, pass all of them.

dimazest · 2017-10-08T23:09:10Z

nltk/parse/stanford.py

@@ -23,9 +23,11 @@
 from nltk.parse.api import ParserI
 from nltk.parse.dependencygraph import DependencyGraph
 from nltk.tree import Tree
+from nltk.parse import corenlp


This looks redundant.

dimazest · 2017-10-08T23:09:43Z

nltk/corpus/reader/plaintext.py

@@ -14,7 +14,7 @@
 from six import string_types
 import codecs

-import nltk.data
+from nltk.data import LazyLoader


It might be a good idea to leave this code as it was.

dimazest · 2017-10-08T23:10:12Z

nltk/downloader.py

@@ -181,6 +181,7 @@
 from six.moves.urllib.error import HTTPError, URLError

 import nltk
+from nltk import data, internals


Same here, it might not worth to play with imports

artiemq · 2017-10-09T20:40:08Z

I've updated raw_tag_sents() method, now it returns list of parses for each sentence, but tag() still returns list(tuple(str, str)) with only one parse, is it fine?

dimazest · 2017-10-10T15:31:55Z

Looks good, thanks!

alvations · 2017-11-23T06:10:45Z

Lets get this merged and prepare for #1892 =)

stevenbird · 2017-11-26T01:44:02Z

Thanks @artiemq, @dimazest, @alvations

hexingren · 2018-04-25T15:38:26Z

Hello,

I was trying to move to CoreNLPPOSTagger and CoreNLPNERTagger from StanfordNERTagger with NLTK v3.2.5. The CoreNLPPOSTagger worked as expected but CoreNLPNERTagger threw an HTTPError: 500 Server Error. I tried both CoreNLP v3.9.1 (the latest version) and v3.8.0, and they both threw the same error. I posted more details at #2010

Does anyone have some idea on this problem? Or I should be safe if I stay with StanfordNERTagger using v3.2.5? Thanks.

artiemq added 2 commits October 2, 2017 21:50

Refactor CoreNLPTokenizer and CoreNLPTagger tests

3d4ae49

Add CoreNLPParser, CoreNLPDependencyParser and CoreNLPNeuralDependenc…

7cf7bd3

…yParser

dimazest requested changes Oct 2, 2017

View reviewed changes

artiemq added 2 commits October 2, 2017 23:08

Fix broken initialization

87f952e

Rename duck-classes and remove simplefilter

62b31aa

alvations requested a review from dimazest October 4, 2017 02:48

alvations added dependency parsing documentation good first issue nice idea tests labels Oct 4, 2017

alvations added this to the 3.2.6 milestone Oct 4, 2017

Implement TaggerI in CoreNLPParser

b974b61

dimazest reviewed Oct 8, 2017

View reviewed changes

artiemq added 3 commits October 9, 2017 23:08

Fix imports

0348044

Add docs for tag methods

83d64e0

Change deprecation warnings message

bf32338

dimazest approved these changes Oct 10, 2017

View reviewed changes

alvations added the stanford api label Oct 17, 2017

Merge branch 'develop' into issue_1839

5e73d77

alvations mentioned this pull request Nov 8, 2017

Reimplementing Monolingual Word Aligner in NLTK #1489

Open

stevenbird added 2 commits November 26, 2017 12:37

Merge branch 'develop' into issue_1839

fe8175d

Merge branch 'develop' into issue_1839

c9e7620

stevenbird merged commit b2d622e into nltk:develop Nov 26, 2017

artiemq deleted the issue_1839 branch November 26, 2017 06:04

PrimozGodec mentioned this pull request Nov 6, 2023

POS - remove unsupported and unused StanfordPOSTagger biolab/orange3-text#1021

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecating the old Stanford Parser #1843

Deprecating the old Stanford Parser #1843

artiemq commented Oct 2, 2017

dimazest left a comment

dimazest Oct 2, 2017

dimazest Oct 2, 2017

dimazest Oct 2, 2017

dimazest Oct 2, 2017

dimazest Oct 2, 2017

dimazest Oct 2, 2017

dimazest Oct 2, 2017

artiemq commented Oct 3, 2017 •

edited

dimazest commented Oct 3, 2017

alvations commented Oct 4, 2017

dimazest commented Oct 4, 2017

artiemq commented Oct 6, 2017

dimazest commented Oct 6, 2017

dimazest Oct 8, 2017

dimazest Oct 8, 2017

dimazest Oct 8, 2017

dimazest Oct 8, 2017

dimazest Oct 8, 2017

artiemq commented Oct 9, 2017

dimazest commented Oct 10, 2017

alvations commented Nov 23, 2017

stevenbird commented Nov 26, 2017

hexingren commented Apr 25, 2018 •

edited

		@@ -400,6 +426,76 @@ def _make_tree(self, result):
		return DependencyGraph(result, top_relation_label='ROOT')


		class CoreNLPParser(CoreNLPParser):

		@@ -228,6 +228,7 @@ def _execute(self, cmd, input_, verbose=False):

		return stdout


		class StanfordParser(GenericStanfordParser):

Deprecating the old Stanford Parser #1843

Deprecating the old Stanford Parser #1843

Conversation

artiemq commented Oct 2, 2017

dimazest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artiemq commented Oct 3, 2017 • edited

dimazest commented Oct 3, 2017

alvations commented Oct 4, 2017

dimazest commented Oct 4, 2017

artiemq commented Oct 6, 2017

dimazest commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artiemq commented Oct 9, 2017

dimazest commented Oct 10, 2017

alvations commented Nov 23, 2017

stevenbird commented Nov 26, 2017

hexingren commented Apr 25, 2018 • edited

artiemq commented Oct 3, 2017 •

edited

hexingren commented Apr 25, 2018 •

edited