Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems parsing cvs with æ,ø and å characters #65

Open
brwnx opened this issue May 3, 2014 · 10 comments
Open

Problems parsing cvs with æ,ø and å characters #65

brwnx opened this issue May 3, 2014 · 10 comments
Labels

Comments

@brwnx
Copy link

brwnx commented May 3, 2014

[csvString CSVComponents]; fails when values contain special characters, like æ,ø and å

Thanks

@davedelong davedelong added the High label Jul 9, 2014
@davedelong
Copy link
Owner

This seems to work for me (although I admit I have not pushed all of my changes yet). Can you provide an example of how this fails for you?

@davedelong
Copy link
Owner

Hi @brwnx,

I have a unit test in place to test for this, but it appears to be passing. Can you provide more information about how you're seeing this fail?

@skyvalleystudio
Copy link

I think I had the same problem. When I have special characters in names the parser stops at that character for the line. The CSV file I had came from exporting from Excel. However, I believe it is Excel that is failing to export UTF-8 characters correctly.

eg:
148,S†TTERLIN Jasha,MOV,MOVISTAR TEAM

should have been:
148,SÜTTERLIN Jasha,MOV,MOVISTAR TEAM

So, the fault was with the file Excel created when I used Save As ... CSV.

@davedelong
Copy link
Owner

@skyvalleystudio both of those strings parse correctly with the latest release of the parser.

@skyvalleystudio
Copy link

I tried with the July version and still had the problem (first on the line with bib 148). My test file is here:

https://drive.google.com/file/d/0B7DnwOciz86uWWk0UDNXV1IteXM/edit?usp=sharing

Download with:
https://docs.google.com/uc?authuser=0&id=0B7DnwOciz86uWWk0UDNXV1IteXM&export=download

I still think Excel is not really saving in unicode.

@davedelong
Copy link
Owner

Thanks @skyvalleystudio, I'll start working on it. Is this CSV file something that I could check into the repository as part of the unit tests?

@skyvalleystudio
Copy link

Feel free to use the file. I wish I understood character sets better right about now...

I work around the problem by exporting to UTF-16 .txt in Excel. Then replacing Tab with Comma and renaming the file. The result imports fine with your parser.

@davedelong
Copy link
Owner

It's a file encoding problem. It's coming across the Ü in the file, which is encoded as 0x86. However, 0x86 in UTF-8 is the beginning of a multi-byte character, but it's not able to successfully extract a multi-byte character, likely because the file isn't actually encoded as UTF-8 (if it were, it would not have encoded Ü as 0x86).

You could work around this by explicitly specifying a different encoding for the file, but I'll try and figure out what the parser is supposed to do.

@jomnius
Copy link

jomnius commented Dec 12, 2014

Any progress with this? I have same problem, just realised I created a duplicate issue report :/

Tried forcing different encodings to parser, none helped. Have no control over actual file, have to use it as given. Don't care how long parsing takes, so would be happy to modify each row in my own code before parser sees it.

@golopupinsky
Copy link

I am also facing this with some special chars on ~2-7mb files on both iOS and OSX.
Choosing encoding manually helps sometimes and sometimes it doesn't.
I also don't have control over the file encoding/structure.

@jomnius's #73 is totally related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants