Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

Open
dlutz2 opened this issue Oct 20, 2016 · 2 comments
Open

Comments

@dlutz2
Copy link

dlutz2 commented Oct 20, 2016

The various decode operations in AbstractNLPDecoder and its underlying tokenizer, use String.getBytes() which converts the String to bytes using the OS's default character set, which can corrupt the String if the default character set differs from the one used by the String. This case will occur on Windows for any UTF-8 data (beyond the ASCII range) since Windows default character set is CP-1252.
Using operations that include specifying the desired character set, such as InputStreamReader will avoid this.

@jdchoi77
Copy link
Member

Thanks for the comment; could you please give an example of where that you think needs to be fixed using InputStreamReader? We'll do the evaluation and apply the update. Thanks.

@dlutz2
Copy link
Author

dlutz2 commented Oct 25, 2016

The simple test below if run on a platform whose default character set is UTF-8 or by explicitly setting the character set (-Dfile.encoding=UTF-8) will produce the expected results. Running on Windows without explicitly setting the character set uses the OS default character set ( equivalent to using -Dfile.encoding=windows-1252) and will garble the non-Latin characters.
Note that running this in a development environment like Eclipse, may not show the error since Eclipse automatically adds the -Dfile.encoding property to the invocation.
The reference to InputStreamReader was just a suggestion, could also do something like someString.getBytes(someCharSet), as long as the Strings/Streams/Files are read with an explicit character set. It would be nice if this character set was a parser/tokenizer config option. If it must be hardcoded, then UTF-8 would likely be the best guess.
thanks

public static void main(String[] args) throws IOException {

    System.out.println("Default Charset=" + Charset.defaultCharset());

    String configFile = "src/main/resources/org/opensextant/relish/config-decode-en.xml";

    NLPDecoder parser = new NLPDecoder(IOUtils.createFileInputStream(configFile));

    String text = "We live in Europe (قارة اوروبة).";

    List<NLPNode[]> sentences = parser.decodeDocument(text);
    for (NLPNode[] sentence : sentences) {
        for (NLPNode node : sentence) {
            System.out.println(node);
        }
    }
}

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants