read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516

dragoljub · 2018-11-05T19:01:14Z

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)

Problem description

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.

 4244 function calls (4210 primitive calls) in 10.273 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.202   10.202   10.204   10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.011    0.011   10.262   10.262 parsers.py:414(_read)
        1    0.011    0.011   10.273   10.273 <string>:1(<module>)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)

Expected Output

3229 function calls (3222 primitive calls) in 2.944 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.881    2.881    2.882    2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.045    0.045    0.045    0.045 internals.py:4801(_stack_arrays)
        1    0.010    0.010    2.944    2.944 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
      320    0.001    0.000    0.001    0.000 common.py:777(is_integer_dtype)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.p

Output of `pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.2
pip: 18.1
setuptools: 40.4.3
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.11.0
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

Output of `pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 20.10.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-05T19:11:02Z

Can you benchmark just the change in python, and just the change in pandas separately?

dragoljub · 2018-11-05T19:15:00Z

Can you benchmark just the change in python, and just the change in pandas separately?

Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try newer pandas version on old Python.

TomAugspurger · 2018-11-05T19:44:15Z

I think 0.23.2 is the first version of pandas to support 3.7

…

On Mon, Nov 5, 2018 at 1:15 PM Gagi ***@***.***> wrote: Can you benchmark just the change in python, and just the change in pandas separately? Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try newer pandas version on old Python. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIm27-U5HB6S2JcM2DPtr6YzlPC90ks5usI48gaJpZM4YPD2Y> .

dragoljub · 2018-11-05T22:38:25Z

I ran the test on an older Python 3.5 stack with the latest Pandas version 0.23.4 but with a lot of older versions of other modules and it looks to be running faster on Python 3.5. Now, I'm not quite sure if its pandas directly on python 3.7.1 or one of its dependencies.

Does the parser's _read() method rely on some other library that may be the culprit?

 %prun df2 = pd.read_csv(csv)
         5154 function calls (5041 primitive calls) in 2.004 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.960    1.960    1.962    1.962 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    2.004    2.004 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.000    0.000 common.py:1835(_get_dtype_type)

Installed Python 3.5 Stack With Latest pandas 0.23.4:

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.5.0
pip: 18.0
setuptools: 38.6.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.10.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

<\details>

dragoljub · 2018-11-06T00:27:24Z

Interestingly if I specify float_precision='round_trip' I get similar parsing speeds. If I specify 'high' or None then its back to the same 3.5x difference.

Python 3.7.1

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4320 function calls (4286 primitive calls) in 4.074 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.862    3.862    3.864    3.864 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.168    0.168    4.074    4.074 <string>:1(<module>)
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.906    3.906 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
      952    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}

Python 3.5.2

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4582 function calls (4545 primitive calls) in 3.716 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.665    3.665    3.667    3.667 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.031    0.031    0.031    0.031 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.710    3.710 parsers.py:414(_read)
        1    0.006    0.006    3.716    3.716 <string>:1(<module>)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)

dragoljub · 2018-11-06T00:58:57Z

Adding another data point. If I specify the 'python' engine. It looks like on Python 3.7.1 pandas._libs.lib.maybe_convert_numeric is 3X slower than on Python on 3.5.2

Could this be due to the cython version?

Python 3.7.1

%prun df2 = pd.read_csv(csv, engine='python')
         7003613 function calls (7003575 primitive calls) in 14.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    9.698    0.970    9.698    0.970 {pandas._libs.lib.maybe_convert_numeric}
  1000004    3.221    0.000    3.221    0.000 {built-in method builtins.next}
        1    0.386    0.386    4.066    4.066 parsers.py:2926(_get_lines)
        1    0.263    0.263   14.399   14.399 parsers.py:1029(read)
        4    0.154    0.038    0.247    0.062 parsers.py:2738(_remove_empty_lines)
  1000002    0.138    0.000    3.359    0.000 parsers.py:2681(_next_iter_line)
  2000072    0.125    0.000    0.125    0.000 {method 'append' of 'list' objects}
        1    0.116    0.116    0.116    0.116 {pandas._libs.lib.to_object_array}
  1000001    0.103    0.000    0.138    0.000 parsers.py:2869(<genexpr>)
2000130/2000117    0.069    0.000    0.069    0.000 {built-in method builtins.len}
       14    0.067    0.005    0.204    0.015 {built-in method builtins.max}

Python 3.5.2

 %prun df2 = pd.read_csv(csv, engine='python')
         7004040 function calls (7004000 primitive calls) in 8.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000003    3.662    0.000    3.662    0.000 {built-in method builtins.next}
       10    3.270    0.327    3.270    0.327 {pandas._libs.lib.maybe_convert_numeric}
        1    0.378    0.378    4.492    4.492 parsers.py:2926(_get_lines)
        1    0.263    0.263    8.398    8.398 parsers.py:1029(read)
        4    0.167    0.042    0.245    0.061 parsers.py:2738(_remove_empty_lines)
  1000002    0.141    0.000    3.803    0.000 parsers.py:2681(_next_iter_line)
        1    0.128    0.128    0.128    0.128 {pandas._libs.lib.to_object_array}
  2000067    0.109    0.000    0.109    0.000 {method 'append' of 'list' objects}
  1000001    0.108    0.000    0.133    0.000 parsers.py:2869(<genexpr>)
       14    0.060    0.004    0.193    0.014 {built-in method builtins.max}

jbrockmendel · 2018-11-06T05:00:59Z

These last numbers are with what pandas version?

dragoljub · 2018-11-06T05:05:20Z

These last numbers are with what pandas version?

They are both Pandas 0.23.4

dragoljub · 2018-11-06T23:01:17Z

I tried building the latest Pandas version from source on Python 3.7.1 and still got the same slower performance. Are there any build/compile/cython flags I can set to optimize the parser?

jreback · 2018-11-06T23:04:50Z

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

dragoljub · 2018-11-06T23:22:52Z

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

I tried all three different float_precision= flags and for 'high' and None the 3.5x slowdown was still present in Python 3.7.1 vs python 2.5.2.

I also tried specifying a float_format= in the pd.to_csv() and I still see the same consistent 3.5x gap.

Can you reproduce in Python 3.6?

I should reiterate this perf difference is on the same version of Pandas 0.23.4 just different version of Python.

dragoljub · 2018-11-06T23:43:33Z

is there a way to specify 'xstrtod', or is that specified by float_precision=None?

I see no performance changes between 'high' and None.

dragoljub · 2018-11-07T17:40:58Z

Is anyone able to reproduce this on Python 3.7.1? I tested the code above on Python 3.7.0 using the Python.org interactive interpreter and it seemed to run faster than in my local 3.7.1 install.

dragoljub · 2018-11-07T18:46:45Z

Something is definitely up. I did a side by side compare reading the same CSV file on disk. Python 3.5 reads at 111 MB/sec and Python 3.7 reads at only 28 MB/Sec from the same SSD. Both running Pandas 0.23.4.

Could Python 3.7 have changed something in their I/O system?

Python 3.5.2 & Pandas 0.23.4

In [38]: %timeit pd.read_csv(r'out.csv', float_precision='high')
1.86 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python 3.7.1 & Pandas 0.23.4

In [17]: %timeit pd.read_csv(r'out.csv', float_precision='high')
7.97 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

TomAugspurger · 2018-11-07T18:52:46Z

I don't see the difference you're seeing

Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.59 s, sys: 214 ms, total: 2.81 s
Wall time: 2.73 s

3.7

Python 3.7.1 (default, Oct 23 2018, 14:07:42)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.61 s, sys: 211 ms, total: 2.82 s
Wall time: 2.74 s

Both of those are using Anaconda's packages.

dragoljub · 2018-11-07T22:17:01Z

Tom, thanks for running this benchmark. Can you post your pd.show_versions() I want to re-create your stack exactly to do some more testing.

TomAugspurger · 2018-11-07T22:19:05Z

3.5

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.15.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

3.7:

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

dragoljub · 2018-11-08T05:31:00Z

I tried several different fresh python installs on windows. Every Python 3.7 install 32 or 64 bit with Pandas 0.23.4 pip installed results in the slower CSV parsing speed. For fun I tried Installing a fresh Python 3.6.7 install and it again parses the same CSV 3X faster.

Is there anyone that could test this on Windows 10 and Python 3.7.1? 😕

jorisvandenbossche · 2018-11-08T10:44:52Z

cc @chris-b1 in case you can test on Windows

gfyoung · 2018-11-08T11:07:33Z

Indeed, I can confirm that there is a 3.5X slowdown when using Python 3.7.1 on Windows 10.

When I use Python 3.5.6, the performance is unchanged from 0.22.0 to 0.23.4.

These observations are consistent with what @dragoljub was observing and appears to suggest that this is a Cython / Python suggest and not pandas.

toniatop · 2018-11-08T11:08:35Z

On windows 10, python 3.6 and python 3.7 I note a noticeable slowdown as well.

(py36) PS C:\Users\ttttt> ipython
Python 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 7.03 s

In [3]: %time _ = pd.read_csv('out.csv')
Wall time: 7.04 s

python 3.7

(py37) PS C:\Users\ttttt> ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
In [6]: df.to_csv('out.csv')
In [7]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 29.4 s

In [8]: %time _ = pd.read_csv('out.csv')
Wall time: 31.3 s

TomAugspurger · 2018-11-08T11:51:25Z

For people on windows, how are you installing pandas? From source, wheels, or conda packages? And if conda, from defaults or from conda-forge?

toniatop · 2018-11-08T12:02:36Z

Here conda-forge:

PS C:\Users\ttttt> activate py37
(py37) PS C:\Users\ttttt> conda install ipython pandas
Solving environment: done
## Package Plan ##
  environment location: C:\Miniconda\envs\py37
  added / updated specs:
    - ipython
    - pandas
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    ipython-7.1.1              |py37h39e3cac_1000         1.1 MB  conda-forge
    wcwidth-0.1.7              |             py_1          17 KB  conda-forge
    six-1.11.0                 |        py37_1001          21 KB  conda-forge
    pytz-2018.7                |             py_0         226 KB  conda-forge
    icc_rt-2017.0.4            |       h97af966_0         8.0 MB
    pygments-2.2.0             |             py_1         622 KB  conda-forge
    pickleshare-0.7.5          |        py37_1000          12 KB  conda-forge
    certifi-2018.10.15         |        py37_1000         137 KB  conda-forge
    backcall-0.1.0             |             py_0          13 KB  conda-forge
    mkl_random-1.0.1           |   py37h77b88f5_1         267 KB
    decorator-4.3.0            |             py_0          10 KB  conda-forge
    numpy-1.15.4               |   py37ha559c80_0          36 KB
    mkl-2019.0                 |              118       178.1 MB
    pandas-0.23.4              |py37h830ac7b_1000         8.7 MB  conda-forge
    prompt_toolkit-2.0.7       |             py_0         218 KB  conda-forge
    python-dateutil-2.7.5      |             py_0         218 KB  conda-forge
    colorama-0.4.0             |             py_0          15 KB  conda-forge
    mkl_fft-1.0.6              |   py37hdbbee80_0         120 KB
    jedi-0.13.1                |        py37_1000         228 KB  conda-forge
    intel-openmp-2019.0        |              118         1.7 MB
    parso-0.3.1                |             py_0          59 KB  conda-forge
    traitlets-4.3.2            |        py37_1000         130 KB  conda-forge
    ipython_genutils-0.2.0     |             py_1          21 KB  conda-forge
    numpy-base-1.15.4          |   py37h8128ebf_0         3.9 MB
    blas-1.0                   |              mkl           6 KB
    ------------------------------------------------------------
                                           Total:       203.7 MB

TomAugspurger · 2018-11-08T12:06:50Z

Thanks @toniatop. Can you create a couple environments with just defaults to see if it's an issue with how it was compiled for conda-forge?

toniatop · 2018-11-08T12:34:21Z

Redone everything forcing --channel anaconda, same results.

TomAugspurger · 2018-11-08T12:42:58Z

cc @jjhelmus any thoughts on
#23516 (comment)? The tldr is that

pd.read_csv is 3-4x slower on python 3.7 vs. python 3.6
the slowdown is (apparently) windows only
observed on packages from conda-forge and defaults

dragoljub · 2018-11-09T20:15:55Z

I'd be surprised that change matters, but I'm at a loss here, so maybe! Another possibility is that cython made some tweaks to threading logic for python 3.7 compat - again, wouldn't think that's the issue here, but possible some kind of bad interaction.
cython/cython#1978

Good info. I'm just surprised that people do not see this on Linux. I'll try OSX next.

TomAugspurger · 2018-11-09T20:24:25Z

FYI my timings above were on OSX (no slowdown)

…

On Fri, Nov 9, 2018 at 2:16 PM Gagi ***@***.***> wrote: I'd be surprised that change matters, but I'm at a loss here, so maybe! Another possibility is that cython made some tweaks to threading logic for python 3.7 compat - again, wouldn't think that's the issue here, but possible some kind of bad interaction. cython/cython#1978 <cython/cython#1978> Good info. I'm just surprised that people do not see this on Linux. I'll try OSX next. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23516 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIl8M-FYdJ366ZxFHd1Rdur0sKyLEks5uteKAgaJpZM4YPD2Y> .

cgohlke · 2018-11-09T21:15:06Z

Looks like the slowdown first shows up in Python 3.7.0a4:

>C:\python-3.7.0a3-amd64\python.exe -m cProfile -s tottime pandascsv.py
         235992 function calls (229477 primitive calls) in 21.525 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      200   11.316    0.057   11.316    0.057 {method 'astype' of 'numpy.ndarray' objects}
      100    6.596    0.066    6.596    0.066 {pandas._libs.writers.write_csv_rows}
        1    2.111    2.111    2.112    2.112 {method 'read' of 'pandas._libs.parsers.TextReader' objects}

>C:\python-3.7.0a4-amd64\python.exe -m cProfile -s tottime pandascsv.py
         236639 function calls (230127 primitive calls) in 26.550 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    9.849    9.849    9.850    9.850 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
      200    8.766    0.044    8.766    0.044 {method 'astype' of 'numpy.ndarray' objects}
      100    6.469    0.065    6.469    0.065 {pandas._libs.writers.write_csv_rows}

dragoljub · 2018-11-09T21:55:23Z

Very Interesting. I'll try Py 3.7.0a3 to confirm this on my systems. Is the diff between 0a3 to 0a4 easy to find from Pythons release notes?

cgohlke · 2018-11-09T21:59:19Z

easy to find from Pythons release notes?

https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4

Maybe bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540?

See also python/cpython@v3.7.0a3...v3.7.0a4

dragoljub · 2018-11-10T00:01:48Z

I can also confirm the changes from Python 3.7.0a3 to 3.7.0a4 show the slowdown on my Win10 test system. Thanks for finding when the slowdown occurred.

Python 3.7.0a3 -- Fast Parse

 %prun df2 = pd.read_csv(csv)
         5781 function calls (5743 primitive calls) in 3.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.953    2.953    2.955    2.955 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.063    0.063    0.063    0.063 internals.py:5017(_stack_arrays)
        1    0.016    0.016    3.052    3.052 parsers.py:414(_read)
        1    0.009    0.009    3.062    3.062 <string>:1(<module>)
        1    0.009    0.009    0.009    0.009 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1235(find_spec)
      988    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      718    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1119(_get_spec)
      133    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       68    0.000    0.000    0.000    0.000 generic.py:7(_check)

Python 3.7.0a4 -- Slow Parse

 %prun df2 = pd.read_csv(csv)
         8007 function calls (7219 primitive calls) in 14.192 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   14.092   14.092   14.094   14.094 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.061    0.061    0.062    0.062 internals.py:5017(_stack_arrays)
        1    0.016    0.016   14.192   14.192 parsers.py:414(_read)
        1    0.008    0.008    0.008    0.008 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
    115/4    0.000    0.000    0.001    0.000 abc.py:194(__subclasscheck__)
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1322(find_spec)
 1324/988    0.000    0.000    0.002    0.000 {built-in method builtins.isinstance}
  937/725    0.000    0.000    0.002    0.000 {built-in method builtins.issubclass}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1206(_get_spec)
    89/78    0.000    0.000    0.000    0.000 {built-in method builtins.len}
      192    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}

dragoljub · 2018-11-10T03:02:44Z

I tried playing around with the UTF-8 mode settings with ENV variables and cmd line args on Windows and was not able to get faster parsing speed on Python 3.7.0a4.

https://www.python.org/dev/peps/pep-0540/#proposal

The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python's encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings.

So is it possible that somewhere the C parser extension we can just set the locale to UTF-8 and this issue would go away on Windows? I was hoping the ENV variable settings would fix the issue but it did not make a difference in my testing.

cgohlke · 2018-11-10T07:06:52Z

I compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time...

3.7.0a3:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	705	347	28.52%	14.04%	parsers.cp37-win_amd64.pyd
   isdigit	207	207	8.37%	8.37%	ucrtbase.dll
 - _errno	105	39	4.25%	1.58%	ucrtbase.dll
   toupper	24	24	0.97%	0.97%	ucrtbase.dll
   isspace	21	21	0.85%	0.85%	ucrtbase.dll
   [python37.dll]	1	1	0.04%	0.04%	python37.dll

3.7.0a4:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	8,613	478	83.04%	4.61%	parsers.cp37-win_amd64.pyd
 + isdigit	6,642	208	64.04%	2.01%	ucrtbase.dll
 + _isdigit_l	6,434	245	62.03%	2.36%	ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate	5,806	947	55.98%	9.13%	ucrtbase.dll
 + __acrt_getptd	2,121	1,031	20.45%	9.94%	ucrtbase.dll
   FlsGetValue	647	647	6.24%	6.24%	KernelBase.dll
 - RtlSetLastWin32Error	296	235	2.85%	2.27%	ntdll.dll
   _guard_dispatch_icall_nop	101	101	0.97%	0.97%	ucrtbase.dll
   GetLastError	46	46	0.44%	0.44%	KernelBase.dll
 + __acrt_update_multibyte_info	1,475	246	14.22%	2.37%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,229	513	11.85%	4.95%	ucrtbase.dll
 + __acrt_update_locale_info	1,263	235	12.18%	2.27%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,028	429	9.91%	4.14%	ucrtbase.dll
   _ischartype_l	383	383	3.69%	3.69%	ucrtbase.dll

dragoljub · 2018-11-10T15:05:51Z

Great work debugging this. I would guess any other code paths calling isdigit would also be slowed down on windows.

jbrockmendel · 2018-11-10T16:05:14Z

Just a note for people looking at xstrtod (and thanks for doing so BTW, this looks like a really tough issue): there are two of them (#19361). Off the top of my head I'm not sure which is used in what context.

dragoljub · 2018-11-13T02:01:28Z

I may have found a pure python example that seems to show a similar but smaller 2.5X slowdown. Also note the variability is 15X higher for the 3.7.1 code. Possibly indicating that locale argument is passed/used in some calls but not others.

Can someone test this on linux and see if you see a difference?

Python 3.7.1

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
537 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python 3.5.2

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
215 ms ± 986 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

--> Based on comments from: https://bugs.python.org/msg329789
it appears this is a pure Unicode test. So it may be unrelated.

@cgohlke has posed a nice minimal example showing the slowdown: https://bugs.python.org/msg329790 Thanks! 👍

chris-b1 · 2018-11-18T15:31:13Z

Thanks for the investigation @cgohlke - for 0.24 I suppose we should just shim in an ASCII isdigit
function?

(MUSL, MIT licensed)
https://github.com/esmil/musl/blob/master/src/ctype/isdigit.c#L5

dragoljub · 2018-11-18T19:15:30Z

@chris-b1 I was thinking the same thing since its quite a simple function, however than changing locale would be limited. I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

cgohlke · 2018-11-19T06:11:40Z

I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

The source code for the Windows UCRT is available with recent Windows SDK. It is usually installed under C:\Program Files (x86)\Windows Kits\10\Source.

The isdigit and _isdigit_l functions are defined in ucrt\convert\_ctype.cpp:

extern "C" extern __inline int (__cdecl isdigit)(int const c)
{
    return __acrt_locale_changed()
        ? (_isdigit_l)(c, nullptr)
        : fast_check(c, _DIGIT);
}

extern "C" extern __inline int (__cdecl _isdigit_l)(int const c, _locale_t const locale)
{
    _LocaleUpdate locale_update(locale);

    return _isdigit_l(c, locale_update.GetLocaleT());
}

The following comment is from the _wsetlocale function:

// If no call has been made to setlocale to change locale from "C" locale
// to some other locale, we keep locale_changed = 0. Other functions that
// depend on locale use this variable to optimize performance for C locale
// which is normally the case in applications.

dragoljub · 2018-11-19T20:18:19Z

So if I’m understanding it correctly. Even if we set the locale in Python to “C” the windows isdigit function would still resort to calling the locale isdigit version slowing down parsing because the local has ‘changed’.

Is that the case in python. 3.7.0a3? Setting locale to “C” slows parsing down?

dragoljub · 2018-11-27T17:31:50Z

@jreback @TomAugspurger Do you think a simple shim of the isdigit function in the C parser code would be a fix we could entertain?

It would assume 'ASCII' compatible encoding for numeric columns which I think should cover all/most csv file encodings for digits.

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

chris-b1 · 2018-11-28T01:51:01Z

Yeah, if you want to submit PR, ping me, or if not, I'll try to get to it soon

dragoljub · 2018-11-28T04:22:54Z

@chris-b1 Go for it! 😄

dragoljub · 2018-12-05T20:43:49Z

@chris-b1 @jreback Thanks for getting this done! Really appreciate it! 👍 😄

gfyoung added Performance Memory or execution speed performance IO CSV read_csv, to_csv Needs Info Clarification about behavior needed to assess issue labels Nov 5, 2018

gfyoung removed the Needs Info Clarification about behavior needed to assess issue label Nov 8, 2018

chris-b1 added this to the 0.24.0 milestone Nov 18, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Nov 25, 2018

chris-b1 mentioned this issue Nov 28, 2018

PERF: ascii c string functions #23981

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 28, 2018

jreback closed this as completed in #23981 Dec 5, 2018

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2 #23516

Comments

dragoljub commented Nov 5, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV

INSTALLED VERSIONS

Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV

INSTALLED VERSIONS

TomAugspurger commented Nov 5, 2018

dragoljub commented Nov 5, 2018

TomAugspurger commented Nov 5, 2018 via email

dragoljub commented Nov 5, 2018

INSTALLED VERSIONS

dragoljub commented Nov 6, 2018

dragoljub commented Nov 6, 2018

jbrockmendel commented Nov 6, 2018

dragoljub commented Nov 6, 2018

dragoljub commented Nov 6, 2018

jreback commented Nov 6, 2018

dragoljub commented Nov 6, 2018 • edited

dragoljub commented Nov 6, 2018

dragoljub commented Nov 7, 2018

dragoljub commented Nov 7, 2018 • edited

TomAugspurger commented Nov 7, 2018 • edited

dragoljub commented Nov 7, 2018

TomAugspurger commented Nov 7, 2018

dragoljub commented Nov 8, 2018

jorisvandenbossche commented Nov 8, 2018

gfyoung commented Nov 8, 2018 • edited

toniatop commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

toniatop commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

toniatop commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

dragoljub commented Nov 9, 2018

TomAugspurger commented Nov 9, 2018 via email

cgohlke commented Nov 9, 2018

dragoljub commented Nov 9, 2018

cgohlke commented Nov 9, 2018 • edited

dragoljub commented Nov 10, 2018

dragoljub commented Nov 10, 2018

cgohlke commented Nov 10, 2018

dragoljub commented Nov 10, 2018

jbrockmendel commented Nov 10, 2018

dragoljub commented Nov 13, 2018 • edited

chris-b1 commented Nov 18, 2018

dragoljub commented Nov 18, 2018

cgohlke commented Nov 19, 2018

dragoljub commented Nov 19, 2018

dragoljub commented Nov 27, 2018

chris-b1 commented Nov 28, 2018

dragoljub commented Nov 28, 2018

dragoljub commented Dec 5, 2018

Output of `pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV`

Output of `pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV`

dragoljub commented Nov 6, 2018 •

edited

dragoljub commented Nov 7, 2018 •

edited

TomAugspurger commented Nov 7, 2018 •

edited

gfyoung commented Nov 8, 2018 •

edited

cgohlke commented Nov 9, 2018 •

edited

dragoljub commented Nov 13, 2018 •

edited