New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreNLPParser tag() should allow properties overloading #2112
Comments
Looks good. Just some minor comments. It should be I don't really like idea of joining and splitting strings, maybe there could be a way to pass a list of words to CoreNLP as a sentence instead of a simple string. |
Hi, I would like to take this up as my first issue. |
It's great you are interested in the issue. If you have any questions, ask them here. |
Hi @dimazest @alvations if its currently open can I get involved with contributing to it? |
Sure! A good first step is to open a pull request. |
Hi @dimazest @alvations if have created a PR |
With the current
CoreNLPParser.tag()
, the "retokenization" by Stanford CoreNLP is unexpected:The expected behavior should be:
Proposed solution is to allow
properties
arguments overloading for.tag()
and.tag_sents()
, i.e. at https://github.com/nltk/nltk/blob/develop/nltk/parse/corenlp.py#L348 and by default useproperties = {'tokenize.whitespace':'true'}
because we are concatenating the tokens by spaces intag_sents()
.That should enforce the list of string tokens input by the users.
Details on https://stackoverflow.com/questions/52250268/why-do-corenlp-ner-tagger-and-ner-tagger-join-the-separated-numbers-together
If we allow the
.tag()
to overload the properties before theraw_tag_sents
, that'll also allow users to easily handle cases like #1876The text was updated successfully, but these errors were encountered: