Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kytea stops at first newline in input file (Doesn't process files without newlines) #17

Open
olsgaard opened this issue May 4, 2016 · 2 comments

Comments

@olsgaard
Copy link

olsgaard commented May 4, 2016

When calling KyTea with a raw input file, KyTea stops processing at the first newline character.

Some sentences (particularly headlines) are delimited by a newline rather than any punctuation. If you remove the newline following a headline in a blog post or a newspaper article the headline sentence will continue into the first sentence in the article.

On the other hand, KyTea seems to quit without doing any processing if the raw input file doesn't contain any newline character.

Moreover, KyTea doesn't seem to do word segmentation when input is tokenised ($ kytea infile.tok -in tok), seemingly making the -nows flag redundant.

@neubig
Copy link
Owner

neubig commented May 4, 2016

Thanks for the report. I can't confirm this behavior.
Can you give me more information about the version of KyTea you are using, the exact command you ran, the environment you are running it in, and an example file that causes the problem?

@olsgaard
Copy link
Author

olsgaard commented May 6, 2016

Thanks for the quick reply. I run OSX El Capitan 10.11.3 and Ubuntu 12.10 both with kytea-0.4.7 compiled from source.

I did some more tests and it appears that I over interpreted my initial test case. Kytea doesn't stop processing at the first newline character, but at the last newline character.

If you create a file in unix (using the > operator on the command line or with nano), it will automatically append a newline at the end of the file, but files generated with Python, Perl or Sublime text doesn't necessarily have a newline at the end of the file, and this is how I stumbled upon the problem.

$ echo "社長兼業務部長" > infile.txt
$ cat infile.txt
社長兼業務部長。
$ kytea infile.txt
社長/名詞/しゃちょう 兼/名詞/けん 業務/名詞/ぎょうむ 部長/名詞/ぶちょう 。/補助記号/。 
$ perl -pi -e 'chomp if eof' infile.txt // Deletes the trailing \n from infile.txt   
$ cat infile.txt 
社長兼業務部長。%
$ kytea infile.txt // hangs for a while, then doesn't give any output
$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants