Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional crasher when importing CSVs #119

Open
shacker opened this issue Apr 1, 2021 · 7 comments
Open

Occasional crasher when importing CSVs #119

shacker opened this issue Apr 1, 2021 · 7 comments

Comments

@shacker
Copy link
Owner

shacker commented Apr 1, 2021

Slightly awkward - I am the project's main author and filing this bug because I need help. I see occasional crash reports when people import CSVs into the demo site. The tracebacks don't tell me anything useful beyond Exception Value: 'utf-8' codec can't decode bytes in position 15-16: invalid continuation byte. I don't have access to the uploaded files because they're InMemory files. I've tried everything I can think of to reproduce the problem but just can't make it crash.

If you uploaded a CSV and got it to crash, can you provide details in this thread? Thanks.

@datatalking
Copy link

Hi Shacker,

I found an error of similar nature was reported by SO a few years back, I could check with my data scientist friends if this doesn't help out. https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte

@datatalking
Copy link

One of the things i like about shaker/django-todo is doesn't crash.

Ok so further finds that without being able to check, I’d guess one of two things is happening: they’re passing a filepath instead of an object or the file isn’t utf-8 encoded.

Should we use a chardet type function to score the UTF as one of the approved types before importing, insert a "this file might need to be converted to utf-8" warning?

@shacker
Copy link
Owner Author

shacker commented Apr 3, 2021

@datatalking Good theory - could be the file encoding. Maybe I (or one of us) just needs to intentionally save a CSV file without some other encoding and see what happens. We shouldn't need to warn though - if that turns out to be the problem, we could wrap the file opener so that it opens "as" UTF-8. Do you know of a good way to save a CSV as non-UTF-8 that you could test with? (or provide and I can test it?)

@bernd-wechner
Copy link
Contributor

@shacker, I run into file encoding issues with CSV files a lot. I have written a couple of small routines I use routinely to fix this. I will post them here later for you.

@bernd-wechner
Copy link
Contributor

Here's the function I wrote:

import magic
def file_encoding(filepath):
    '''
    Text encoding is a bit of a schmozzle in Python and csv data files. Alas.
    
    A quick summary:
    
    1. CSV files can be written with a UTF-8 or UTF-16 encoding from time to time
    2. Python wants to know the encoding when we open the file
    3. UTF-16 is fine, but UTF-8 comes in two flavours, with and without a BOM
    4. The BOM (byte order mark) is an optional and irrelevant to UTF-8 field specifying
    5. In fact Unicode standards recommend against including a BOM with UTF-8
        https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
    6. Python assumes it's not there
    7. Most apps write it (at least sometimes)
    8. The encoding must therefore be specified as:
        utf-16    for UTF-16 files
        utf-8     for UTF-8 files with no BOM
        utf-8-sig for UTF files with a BOM 
    9. The "magic" library reliably determines the encoding efficiently by looking
       at the magic numbers at the start of a file
    10. Alas it returns a rich string describing the encoding.
    11. It contains either UTF-16 or UTF-18
    12. It contains "(with BOM)" if a BOM is detected
    13. Because of this schmozzle a quick function to translate "magic" output
        to standard encoding names is here.
    
    :param filepath: The path to a file
    '''
    # Support UTF-8 or UTF-16 encodings and this is the first
    # word that the magic library returns when reporting the file type.
    m = magic.from_file(filepath)
    utf16 = m.find("UTF-16")>=0
    utf8 = m.find("UTF-8")>=0
    bom = m.find("(with BOM)")>=0
    
    if utf16:
        return "utf-16"
    elif utf8:
        if bom:
            return "utf-8-sig"
        else:
            return "utf-8"

and how I use it:

import csv
with open(data_file, "r", encoding=file_encoding(data_file), newline='') as file:
        reader = csv.DictReader(file)

Basically solved all my encoding issues with diverse CSV files I've encountered.

@shacker
Copy link
Owner Author

shacker commented Sep 28, 2021

@bernd-wechner Awesome, thanks a bunch! Do you by chance have a CSV that can crash django-todo on import? If so, can you share the file?

@bernd-wechner
Copy link
Contributor

Alas no, not I. No need for CSV import yet. I do have some PRs open for you though fixing stuff that I did need ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants