Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing fails when "bad data" is not in the first input stream buffer #73

Open
jomnius opened this issue Dec 9, 2014 · 1 comment
Open

Comments

@jomnius
Copy link

jomnius commented Dec 9, 2014

Trying to parse external TSV data, which I cannot fix before parsing. The weird problem is that data contains "bad" values e.g.

    "Englanti|Lontoo|kartat|matkaoppaat|n�aht�avyydet"

and If such string is early in file, everything is ok. If it's "further away", CHCSVParser _loadMoreIfNecessary fails to get over the "�". Guess it's interpreted as two separate halves of real character and first half is 0x00?

_stringBuffer could contain 15000 bytes, but [NSString initWithBytes] just returns nil until 15k+ has been incremented to zero one number at a time. Document ends on input file row 20, when file contains 1200 rows.

Forcing encoding as UTF8 helped a little bit, but not with this character.

NSInputStream *stream = [NSInputStream inputStreamWithURL:[NSURL URLWithString:urlPath]];
NSStringEncoding encoding = NSUTF8StringEncoding;
CHCSVParser *p = [[CHCSVParser alloc] initWithInputStream:stream usedEncoding:&encoding delimiter:'\t'];

Any ideas why first buffer[CHUNK_SIZE] from _sniffEncoding would be different than the rest from _loadMoreIfNecessary, can't see much difference? StreamEncoding is always NSUTF8StringEncoding. Any way to fix input stream data before trying to parse it?

Bad data in first buffer

    "Englanti|Lontoo|kartat|matkaoppaat|n\Ufffdaht\Ufffdavyydet"
@jomnius
Copy link
Author

jomnius commented Feb 15, 2015

Debugged this a little bit. Problem in my case is that data is auto-recognised as UTF-8 based on first 512 bytes, while it later contains unicode characters due the corrupted input. I have no control over the input data, it's generated by external closed system.

Problem seems be that CHCSVParser is pretty much built on top of NSString methods, which do not like unexpected u'\U0000fffd' characters while expecting UTF-8. I tried to skip over bad data with [self _advance], but that was calling NSString methods instead of actually skipping over raw data.

Btw according to NSString documentation "If the length of the byte string is greater than the specified length a nil value is returned", so I don't really understand what readLength--; is supposed to do. It should cause failure immediately and in my case it did. About 15000 times at some point.

- (void)_loadMoreIfNecessary {
...
        // try to turn the next portion of the buffer into a string
        NSUInteger readLength = [_stringBuffer length];
        while (readLength > 0) {
            NSString *readString = [[NSString alloc] initWithBytes:[_stringBuffer bytes] length:readLength encoding:_streamEncoding];
            if (readString == nil) {
                readLength--;
            } else {
                [_string appendString:readString];
                break;
            }
        };

No easy fixes: either I have to parse and fix input before the real parsing or modify CHCSVParser to drop NSString and work with raw data. Pre-parsing should be easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant