AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

dlutz2 · 2016-10-20T17:46:19Z

The various decode operations in AbstractNLPDecoder and its underlying tokenizer, use String.getBytes() which converts the String to bytes using the OS's default character set, which can corrupt the String if the default character set differs from the one used by the String. This case will occur on Windows for any UTF-8 data (beyond the ASCII range) since Windows default character set is CP-1252.
Using operations that include specifying the desired character set, such as InputStreamReader will avoid this.

jdchoi77 · 2016-10-25T04:37:48Z

Thanks for the comment; could you please give an example of where that you think needs to be fixed using InputStreamReader? We'll do the evaluation and apply the update. Thanks.

dlutz2 · 2016-10-25T14:35:53Z

The simple test below if run on a platform whose default character set is UTF-8 or by explicitly setting the character set (-Dfile.encoding=UTF-8) will produce the expected results. Running on Windows without explicitly setting the character set uses the OS default character set ( equivalent to using -Dfile.encoding=windows-1252) and will garble the non-Latin characters.
Note that running this in a development environment like Eclipse, may not show the error since Eclipse automatically adds the -Dfile.encoding property to the invocation.
The reference to InputStreamReader was just a suggestion, could also do something like someString.getBytes(someCharSet), as long as the Strings/Streams/Files are read with an explicit character set. It would be nice if this character set was a parser/tokenizer config option. If it must be hardcoded, then UTF-8 would likely be the best guess.
thanks

public static void main(String[] args) throws IOException {

    System.out.println("Default Charset=" + Charset.defaultCharset());

    String configFile = "src/main/resources/org/opensextant/relish/config-decode-en.xml";

    NLPDecoder parser = new NLPDecoder(IOUtils.createFileInputStream(configFile));

    String text = "We live in Europe (قارة اوروبة).";

    List<NLPNode[]> sentences = parser.decodeDocument(text);
    for (NLPNode[] sentence : sentences) {
        for (NLPNode node : sentence) {
            System.out.println(node);
        }
    }
}

}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

dlutz2 commented Oct 20, 2016

jdchoi77 commented Oct 25, 2016

dlutz2 commented Oct 25, 2016

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

AbstractNLPDecoder and Tokenizer makes character encoding assumption #21

Comments

dlutz2 commented Oct 20, 2016

jdchoi77 commented Oct 25, 2016

dlutz2 commented Oct 25, 2016