Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Invalid control character at: line 1120 column 21 (char 28474) #43

Open
aliabbasjp opened this issue Oct 7, 2016 · 2 comments

Comments

@aliabbasjp
Copy link

Follwing error

17:42:39: Parsing finished. Moving parsed files into place ...
Traceback (most recent call last):
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 2168, in interpreter
    out = run_command(tokens)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1113, in run_command
    out = command(tokens[1:])
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1437, in parse_corpus
    parsed = to_parse.parse(**kwargs)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/corpus.py", line 930, in parse
    **kwargs
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/make.py", line 356, in make_corpus
    coref=coref, metadata=metadata)
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/conll.py", line 1113, in convert_json_to_conll
    data = json.load(fo)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1120 column 21 (char 28474)
@interrogator
Copy link
Owner

Thanks for these reports. This is a weird one---the json output of the CoreNLP parser cannot be understood by Python's json module. So, the problem is not really on corpkit's side, but CoreNLP's.

Similar bugs have been reported to CoreNLP: stanfordnlp/CoreNLP#241

I'm guessing that it relates to the encoding in your text files. Would you be able to zip and upload the files in the unparsed/parsed versions of the corpus? This would help me diagnose the problem and make a fix.

@interrogator
Copy link
Owner

interrogator commented Oct 7, 2016

Also, I'd recommend encoding your text files as UTF-8---that should fix this problem in your case. Or, as per the instructions on the issue linked above, update the CoreNLP installed to the GitHub version. If corpkit installed CoreNLP for you, it should be in your ~/corenlp directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants