Parsing fails when "bad data" is not in the first input stream buffer #73

jomnius · 2014-12-09T01:36:44Z

Trying to parse external TSV data, which I cannot fix before parsing. The weird problem is that data contains "bad" values e.g.

    "Englanti|Lontoo|kartat|matkaoppaat|n�aht�avyydet"

and If such string is early in file, everything is ok. If it's "further away", CHCSVParser _loadMoreIfNecessary fails to get over the "�". Guess it's interpreted as two separate halves of real character and first half is 0x00?

_stringBuffer could contain 15000 bytes, but [NSString initWithBytes] just returns nil until 15k+ has been incremented to zero one number at a time. Document ends on input file row 20, when file contains 1200 rows.

Forcing encoding as UTF8 helped a little bit, but not with this character.

NSInputStream *stream = [NSInputStream inputStreamWithURL:[NSURL URLWithString:urlPath]];
NSStringEncoding encoding = NSUTF8StringEncoding;
CHCSVParser *p = [[CHCSVParser alloc] initWithInputStream:stream usedEncoding:&encoding delimiter:'\t'];

Any ideas why first buffer[CHUNK_SIZE] from _sniffEncoding would be different than the rest from _loadMoreIfNecessary, can't see much difference? StreamEncoding is always NSUTF8StringEncoding. Any way to fix input stream data before trying to parse it?

Bad data in first buffer

    "Englanti|Lontoo|kartat|matkaoppaat|n\Ufffdaht\Ufffdavyydet"

The text was updated successfully, but these errors were encountered:

jomnius · 2015-02-15T10:16:17Z

Debugged this a little bit. Problem in my case is that data is auto-recognised as UTF-8 based on first 512 bytes, while it later contains unicode characters due the corrupted input. I have no control over the input data, it's generated by external closed system.

Problem seems be that CHCSVParser is pretty much built on top of NSString methods, which do not like unexpected u'\U0000fffd' characters while expecting UTF-8. I tried to skip over bad data with [self _advance], but that was calling NSString methods instead of actually skipping over raw data.

Btw according to NSString documentation "If the length of the byte string is greater than the specified length a nil value is returned", so I don't really understand what readLength--; is supposed to do. It should cause failure immediately and in my case it did. About 15000 times at some point.

- (void)_loadMoreIfNecessary {
...
        // try to turn the next portion of the buffer into a string
        NSUInteger readLength = [_stringBuffer length];
        while (readLength > 0) {
            NSString *readString = [[NSString alloc] initWithBytes:[_stringBuffer bytes] length:readLength encoding:_streamEncoding];
            if (readString == nil) {
                readLength--;
            } else {
                [_string appendString:readString];
                break;
            }
        };

No easy fixes: either I have to parse and fix input before the real parsing or modify CHCSVParser to drop NSString and work with raw data. Pre-parsing should be easier.

golopupinsky mentioned this issue Jan 30, 2015

Problems parsing cvs with æ,ø and å characters #65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing fails when "bad data" is not in the first input stream buffer #73

Parsing fails when "bad data" is not in the first input stream buffer #73

jomnius commented Dec 9, 2014

jomnius commented Feb 15, 2015

Parsing fails when "bad data" is not in the first input stream buffer #73

Parsing fails when "bad data" is not in the first input stream buffer #73

Comments

jomnius commented Dec 9, 2014

jomnius commented Feb 15, 2015