Set field_size_limit to max long int size on csv plugin #352

humrochagf · 2020-07-17T22:52:19Z

This pull request touches issue #306 and fixes the issue for all platforms without falling back to hardcoded value.

The overflow was happening due to the fact of python CSV implementation uses a long to store max limit (https://github.com/python/cpython/blob/c88239f864a27f673c0f0a9e62d2488563f9d081/Modules/_csv.c#L21)

Using the previous implementation of sys.maxsize has a drawback on Windows 64bit LLP64 data model that even using 64bit wide pointers, they keep long as a 32bit integer to preserve type compatibility (see more in https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models)

So, to have a reliable max long size on the target system you can use ctypes.c_ulong(-1) that cycles back to ulong max value divided by 2 to get the size of long on the target platform.

Regarding memory impact, the long value for the limit is already allocated to store the max limit and it's used only to block sizes higher than that. The CSV itself is allocated on demand.

I ran the profiler on a simple load of a big file from brasil.io https://brasil.io/dataset/genero-nomes/nomes/ and I got the same values after the csv import operation:

Filename: test.py (with unicodecsv.field_size_limit commented to use default)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)

versus

Filename: test.py (using max long size)

Line #    Mem usage    Increment   Line Contents
================================================
     4     38.1 MiB     38.1 MiB   @profile
     5                             def import_file():
     6     87.3 MiB     49.2 MiB       import rows
     7
     8    179.0 MiB     91.7 MiB       data = rows.import_from_csv('./nomes.csv')
     9
    10    179.0 MiB      0.0 MiB       for field in data:
    11    179.0 MiB      0.0 MiB           print(field)

Set field_size_limit to max long int size on csv plugin

b73b2cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set field_size_limit to max long int size on csv plugin #352

Set field_size_limit to max long int size on csv plugin #352

humrochagf commented Jul 17, 2020

Set field_size_limit to max long int size on csv plugin #352

Are you sure you want to change the base?

Set field_size_limit to max long int size on csv plugin #352

Conversation

humrochagf commented Jul 17, 2020