Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding UTF-8 BOM to CSV exports #792

Open
lognaturel opened this issue Mar 15, 2023 · 4 comments
Open

Consider adding UTF-8 BOM to CSV exports #792

lognaturel opened this issue Mar 15, 2023 · 4 comments
Labels
needs discussion Discussion needed before work can begin

Comments

@lognaturel
Copy link
Member

I'm opening this issue so we have a record that this was considered but I'm currently leaning towards doing nothing.

CSV exports currently do not include UTF-8 byte-order-marks. In general, this is a good practice because "the Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use." (https://en.wikipedia.org/wiki/Byte_order_mark)

Unfortunately, modern versions of Excel continue to open CSVs that don't have BOMs as ASCII. That regularly trips up users:

  • They only have a few UTF-8 characters so don't notice and end up with bad chars in their analysis
  • They notice bad chars but are stuck on what to do next
  • They somehow find the tip at the bottom of this docs section and are annoyed by that process

If we add the BOM, exported CSVs would open as expected in Excel when double-clicked. However, it's likely that other downstream tools would then have trouble opening our CSVs. The advantage of the current state of things is that there is one known bad behavior. With a BOM, there are likely to be various different kinds of problems that manifest differently.

@lognaturel lognaturel added the needs discussion Discussion needed before work can begin label Mar 15, 2023
@matthew-white
Copy link
Member

matthew-white commented Mar 22, 2023

Do we have a sense of how many ODK users use Excel? Probably many, right? I think the dream would be to provide an option to download submission data as .xlsx such that there is a single workbook with multiple sheets rather than multiple CSV files, each of which needs to be individually imported. However, I'm not sure what options there are in Node to stream to an .xlsx file. Assuming that we're not going to pursue that route, I think it's an interesting question whether to add the BOM to CSV files.

However, it's likely that other downstream tools would then have trouble opening our CSVs.

I agree that there would likely be some issue. I think it's likely that there are downstream tools that only work with ASCII data and would have trouble with the BOM, yet are still useful to a subset of users. (I would mention odkmeta, which only works with ASCII data, but I'm not sure how many users still use it.) That said, I would guess that most downstream tools would be able to process the BOM. If that's the case, and if many users use Excel and would find the BOM convenient, maybe that's reason to include the BOM.

However, since we know that Excel can work without the BOM — it's just less convenient — my instinct is that we should leave things as-is. If we were to make a change, I think we could do so as an option to include the BOM or not. Maybe the option even defaults to including the BOM. But we probably shouldn't just always include the BOM given that we think that downstream tools would break.

@lognaturel
Copy link
Member Author

tools that only work with ASCII data and would have trouble with the BOM,

Any custom scripts that read those files would need to handle the BOM explicitly or use a library that does. I've experienced not handling the BOM and it's pretty surprising/weird. Maybe it's easy enough for a dev to look this up and find out more. I don't have a good sense of other systems that would or would not handle it. Do you have an idea of what odkmeta would do, for example?

would guess that most downstream tools would be able to process the BOM

Citation? 😊

we probably shouldn't just always include the BOM given that we think that downstream tools would break.

Yes, that sounds right. 😢

@matthew-white
Copy link
Member

matthew-white commented Apr 4, 2023

Do you have an idea of what odkmeta would do, for example?

Usually the type column is the first column of the survey sheet. odkmeta will look for the type column, but won't find it in this case, because it thinks the BOM is part of the column header.

would guess that most downstream tools would be able to process the BOM

Citation? 😊

Haha emphasis on "guess"! But for example, Stata for a long time didn't support Unicode at all (that's why odkmeta has issues). But these days it's happy to import a CSV file with Unicode. I have to think that most data analysis tools that are in widespread use are able to import Unicode somehow. Though maybe there's software that's able to import Unicode, but would still get tripped up on the BOM? Anyway, all guesses!

I'm still mulling over the idea of an optional BOM. Maybe we could include the BOM by default in downloads from Frontend, but give users a way to opt out if it's breaking? I'm not sure I've ever seen such an option in other software though. 🤔

@lognaturel
Copy link
Member Author

I just lost a couple of minutes of my life to UTF-8 BOMs again. I saved a CSV from Excel and it did have the BOM. Then I tried to parse it in a Python script and the first column name wasn't found. I added encoding='utf-8-sig' to the open and that worked.

It's not clear to me what someone without prior knowledge of BOMs would have done. Maybe printed the contents of the parsed file, noticed \ufeff and ended up at this Stackoverflow post? It's hard for me to gauge how much of a roadblock it would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion Discussion needed before work can begin
Projects
None yet
Development

No branches or pull requests

2 participants