PERF/TST: Figure out what's going on with vb_suite test result instability (`frame_reindex_*` in particular) #3158

stephenwlin · 2013-03-24T18:11:15Z

Example:

frame_reindex_axis1                  3.1934       2.0423       1.5636
frame_reindex_axis0                  1.4083       0.4687       3.0049

Target [b42175f] : dummy commit
Base   [da54321] : SCR: format tweaks to test_perf

with the following:

stephen@trantor:~/Projects/pandas$ git diff da54321 b42175f
diff --git a/TODO.rst b/TODO.rst
index fc87174..97aa56f 100644
--- a/TODO.rst
+++ b/TODO.rst
@@ -1,6 +1,6 @@
 DOCS 0.7.0
 ----------
-- ??? no sort in groupby
+- ???? no sort in groupby
 - DONE concat with dict
 - Gotchas re: integer indexing

These random outliers are making it hard to figure out what's really happening (the optimization I wanted to test seems to improve things by 25%-40% in a C testbed, but I can't get clean results.)

These tests aren't doing anything fancy either, so if they're affected, it's possible the entire suite is affected as well (just less dramatically...)

Possibly some combination of #3129, #3146 (cache and alignment) are related, or maybe some other source of non-determinism in Python and numpy. I can't reproduce this instability in C by trying to proxy it either by randomly adjusting byte alignment, but I haven't tried screwing around with memory paging yet.

The text was updated successfully, but these errors were encountered:

stephenwlin · 2013-03-24T19:12:54Z

(I suspect the problem is due to positive but non-deterministic performance optimizations, which means that, if we can't resolve the source of the non-determinism, then we have to choose between less consistent but better average performance versus more consistent but worse average performance, which is not a nice choice.)

stephenwlin · 2013-03-25T00:46:54Z

It doesn't seem like memmove() is the culprit of frame_reindex_axis0 instability:

-- frame_reindex_axis0 --
mean change with dummy == -0.0048 (i.e. -0.48%)
stddev change with dummy == 0.3068 (i.e. 35.91%)
hypothesis: mean change with dummy != 0.0 => p == 0.8760 (insignificant)
mean change with memmove == 0.0148 (i.e. 1.50%)
stddev change with memmove == 0.3442 (i.e. 41.09%)
hypothesis: mean change with memmove != 0.0 => p == 0.6673 (insignificant)
hypothesis: variance of change greater with dummy => p == 0.8743 (insignificant)
hypothesis: variance of change greater with memmove => p == 0.1257 (insignificant)
hypothesis: variance of change differs => p == 0.2515 (insignificant)

The changes quoted (before the percentage values) are natural logs of the ratios (target/base).

This is based on three commits: the base commit is current master but with memmove() disabled, dummy is a commit based on the base commit with a null change, and memmove is a commit based on base with memmove() re-enabled. The results are based on 100 vbench runs each of dummy vs base and memmove vs base. The variance is nominally greater in the memmove case but it doesn't seem like the results are significant (p < 0.05).

It's possible my p-values are off, but in any case just grepping the results makes clear that the large variance was present without memmove(). A lot of other results are somewhat curious, too...still trying to analyze it to make sure I haven't done anything wrong.

ghost · 2013-03-27T15:02:41Z

added -n to control number of vb runs to average over for each vb
added -N to collect results from repeated runs on HEAD, that should make it very
easy to get min, max, variance, etc'. even confidence intervals.

λ time ./test_perf.sh -H -N 3 -r axis0 -d o.pickle
This script compares the performance of two commits.
Make sure the python 'vbench' library is installed.

Setting the BUILD_CACHE_DIR env var to a temp directory will
potentially speed up subsequent runs.

*** LOG_FILE = /home/user1/src/pandas/vb_suite.log

profile_head
Running 2 benchmarks
Running 2 benchmarks
Running 2 benchmarks
*** 
------------------------------------------------------------------------
Test name                       |     #0     |     #1     |     #2     |
------------------------------------------------------------------------
frame_reindex_axis0             |     0.9001 |     0.8989 |     0.8986 |
stats_rank2d_axis0_average      |    40.2542 |    40.3576 |    40.2745 |
------------------------------------------------------------------------
Test name                       |     #0     |     #1     |     #2     |
------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [3844347] : BLD: test_perf, add -N, run HEAD vbenchmarks multiple times seperately

*** Results were also written to the logfile at '/home/user1/src/pandas/vb_suite.log'
*** The results DataFrame was written to '/home/user1/src/pandas/o.pickle'


real    0m18.495s
user    0m17.718s
sys 0m0.762s

ghost · 2013-04-02T23:38:33Z

Pushed some more bells and whistles to test_perf, vb_suite let's the timeit module choose
it's own iteration count, which led to excessive runtimes without any boost to stability of results
afaict, now controllable via the -c/--ncalls cmd arg.

I ran a few iterations, still variable (though much quicker about it), one
thing I seem to notice is that the variation is strongly bimodal, even though
my box was quiet for the duration. Also, when there's a spike,
it runs through the entire suite, see columns #1 and #3.

                                                 #0          #1          #2          #3          #4
name                                                                                               
setattr_dataframe_index                    0.004625    0.003004    0.003195    0.004816    0.004578
getattr_dataframe_index                    0.000191    0.000191    0.000191    0.000191    0.000191
frame_mult_no_ne                          10.584211   10.447598   10.479403   11.007833   12.671614
frame_add                                 10.768604   10.713387   10.706806   10.860205   11.874008
frame_multi_and_no_ne                    106.887960  125.117826   95.654011   92.890835   96.286201
frame_add_no_ne                           10.350227   11.106586   10.406637   10.443211   10.560036
frame_multi_and_st                        45.163012   60.193014   45.246792   45.523214   46.834993
frame_mult_st                             10.841370   11.248827   10.709620   10.581207   11.228418
frame_multi_and                           43.519592   43.063021   35.759592   34.731817   36.842394
frame_mult                                10.872030   10.694170   10.713577   10.757208   11.470413
frame_add_st                              11.837816   11.473227   11.109018   11.459017   10.744190
series_constructor_ndarray                 0.013399    0.020409    0.013638    0.020790    0.013399
frame_constructor_ndarray                  0.040627    0.063610    0.043201    0.063801    0.043392
ctor_index_array_string                    0.014162    0.020981    0.014019    0.020981    0.014162
series_ctor_from_dict                      2.901793    4.373789    2.982616    4.417038    2.894592
frame_get_numeric_data                     0.058603    0.089598    0.059795    0.091028    0.059414
frame_ctor_nested_dict_int64              99.905586  147.639990   99.647570  144.639587  102.165985
frame_ctor_nested_dict                    80.397224  116.105175   79.887819  114.512205   82.165766
frame_ctor_list_of_dict                  104.632187  136.345196  103.941393  134.804010  105.473423
frame_boolean_row_select                   0.201607    0.296593    0.197792    0.312614    0.199986
frame_reindex_both_axes                    0.331402    0.499201    0.333214    0.499964    0.332403
frame_fancy_lookup                         1.435184    2.169800    1.453590    2.197599    1.404572
frame_iteritems_cached                     0.054598    0.083017    0.055981    0.081015    0.053406
frame_reindex_both_axes_ix                 0.406599    0.612402    0.409603    0.611019    0.409174
frame_insert_500_columns                  90.004826  132.124567   89.684772  131.825781   90.292788
frame_get_dtype_counts                     0.096035    0.144386    0.103378    0.145435    0.095987
frame_wide_repr                            0.554991    0.826406    0.556040    0.818205    0.549984
frame_iteritems                            2.042198    3.060579    2.065802    3.067398    2.058601
frame_fancy_lookup_all                    29.863596   30.314827   24.519205   32.511568   23.088789
frame_fillna_inplace                      15.973997   19.704390   15.753412   19.800377   15.702391
series_string_vector_slice               190.852976  275.263023  188.290787  273.380995  190.084171
frame_reindex_upcast                      22.756624   28.754187   20.848370   27.637005   21.184397
frame_to_string_floats                    44.160223   64.106607   43.102169   63.887978   42.306566
frame_reindex_axis0                        1.006842    0.920630    0.856400    0.931978    0.858164
frame_reindex_axis1                        2.419615    3.101826    2.277994    3.038025    2.289772
groupby_multi_different_functions         15.848398   22.987604   15.746832   22.593784   15.834999
groupby_frame_apply_overhead               9.788418   14.041424    9.688568   14.140987    9.358597
groupby_frame_median                       9.151220   12.225819    8.844614   12.213230    8.840609
series_value_counts_strings                4.848003    7.175398    4.818583    7.148600    4.795408
groupby_apply_dict_return                 48.514414   73.194981   50.197601   73.086405   48.549414
groupby_frame_cython_many_columns          4.768229    5.860615    4.372168    5.847406    4.351425
groupby_indices                           10.500813   14.535189   11.385632   14.624834   10.096216
groupby_first_float32                      5.170012    6.640196    4.579020    6.672382    4.540825
groupby_last_float32                       5.550194    7.085371    4.863214    7.158184    4.772377
groupby_multi_cython                      20.029211   27.652216   19.573402   28.120232   19.629812
series_value_counts_int64                  2.928400    4.360199    2.777624    4.222822    2.896833
groupby_multi_python                     104.568195  154.496574  105.849409  155.120993  104.800367
groupby_first                              6.722784    6.712580    4.733801    6.899405    4.831600
groupby_transform                        207.882404  210.259581  151.535606  210.865211  155.929804
groupby_multi_different_numpy_functions   22.729206   22.763395   15.735197   22.696781   15.843391

stephenwlin · 2013-04-03T02:21:19Z

OK, good to know it's not just my machine that has this issue.

I'm also not sure if it's bimodal with a large enough sample size,
actually..it seemed pretty (log-)normally distributed when I ran 100
iterations of reindexing tests. (This was before your multiple runs against
head additions, so I was looking at the log of the ratios between base and
target times rather than the raw times)

ghost · 2013-04-05T00:56:22Z

I pushed another update to test_perf, it now sets cpu affinity to protect againt cold cache surprises
due to thread migration on multicores and also does a gc collect between benchmarks, including
the libc trim voodoo from #2659.

using

 ./test_perf.sh --ncalls 1 --head --hrepeats 10 --outdf o.pickle

before and after definitely seems to have made an impact:

df=pd.load('o.pickle')
df=df[df.icol(0)>0.1]
df['max-min [% of mean]']=(df.max(1)-df.min(1))/df.mean(1)*100

before

Out[58]: 
                                      #0          #1  max-min [% of mean]
name                                                                     
frame_mult_no_ne               10.228872   11.081934             8.068553
frame_add                      10.586977   10.816097             5.836094
frame_multi_and_no_ne         107.260942   91.881990            32.076782
frame_add_no_ne                10.552168   11.853933            14.014677
frame_multi_and_st             45.034885   45.400143            30.600854
frame_mult_st                  10.640144   10.517120            16.973979
frame_multi_and                35.143137   35.255194            20.836927
frame_mult                     10.622978   10.655880             1.857611
frame_add_st                   10.926008   11.259079            11.381603
series_ctor_from_dict           2.861023    4.272938            44.420730
frame_ctor_nested_dict_int64   97.750902  142.644882            41.801547
frame_ctor_nested_dict         79.347849  115.172148            39.968815
frame_ctor_list_of_dict       102.699041  131.904125            32.164763
frame_boolean_row_select        0.298023    0.354052            52.197071
frame_reindex_both_axes         0.344992    0.579119            58.129294

after

In [59]: df[:15].iloc[:,[0,1,-1]]
Out[59]: 
                                      #0          #1  max-min [% of mean]
name                                                                     
frame_mult_no_ne               10.447979   10.501146             8.025565
frame_add                      10.565996   10.575056             4.184737
frame_multi_and_no_ne         105.782032   92.047930            15.303844
frame_add_no_ne                10.586023   10.318995             2.968311
frame_multi_and_st             45.217037   45.341015             3.462760
frame_mult_st                  10.675907   10.501862             4.573387
frame_multi_and                45.274973   45.464039             2.581288
frame_mult                     10.690928   10.643005             4.730513
frame_add_st                   10.578156   10.483980             2.867322
series_ctor_from_dict           2.854109    2.865076             2.026706
frame_ctor_nested_dict_int64  100.254774  102.154016             2.055916
frame_ctor_nested_dict         77.082872   78.242064             2.039485
frame_ctor_list_of_dict        99.193096  100.985050             5.003410
frame_boolean_row_select        0.237942    0.235081            45.588937
frame_reindex_both_axes         0.344038    0.356197             9.131816

There are still outliers in the data (and more in the entire dataset), but that's definitely
a marked improvement.

jreback · 2013-04-05T01:19:59Z

FYI The vbenchs like frame_add, frame_add_no_ne and such are using Numexpr (or not for comparison)
ne will try to use as many cores as u have. in theory this could cause affinity issues (as they mess with the caches), maybe should move these benches to the end?

ghost · 2013-04-05T01:50:30Z

Ok, good to know.

After disabling the ext4 journal commit interval and rerunning again 25 times,
here are all the vbenches that have more then 10% in the first col.
some of these are really not an issue, some are repeat offenders.
(frame_multi_and_no_ne and frame_reindex_axis1 for example)

Some of these are IO functions, which I'd expect to be a bit wonky.

@jreback: got any more insights?

                                        max-min / mean %       std         min         max
name                                                                                      
reindex_frame_level_align                      10.093073  0.025031    0.896613    0.991344
frame_reindex_axis0                            10.251804  0.022512    0.812372    0.898361
indexing_dataframe_boolean_no_ne               10.320826  3.102481  135.820309  150.233984
stats_rank_average                             10.432113  0.841592   28.482993   31.641324
frame_reindex_axis1                            10.799220  0.060237    2.086322    2.328316
groupby_last_float32                           11.341204  0.106584    4.812638    5.368948
frame_mult_no_ne                               11.934345  0.233660   10.227362   11.478027
groupby_indices                                11.951571  0.331111   10.515690   11.878411
indexing_dataframe_boolean_rows_object         12.426390  0.011096    0.468651    0.528018
frame_fancy_lookup_all                         12.527807  0.600790   22.018671   24.979353
query_store_table                              12.993687  0.156258    4.257679    4.854043
frame_drop_dup_na_inplace                      13.472561  0.087586    3.214677    3.657023
frame_multi_and_no_ne                          14.501948  2.623803   91.494640  104.933977
frame_boolean_row_select                       16.282358  0.005699    0.196298    0.230710
frame_reindex_upcast                           17.674176  0.855286   19.288619   23.082654
groupby_frame_cython_many_columns              19.232371  0.158495    4.280965    5.124966
frame_reindex_both_axes_ix                     19.647393  0.014742    0.408014    0.491619
join_dataframe_index_single_key_small          19.757504  0.487586    7.910967    9.624640
timeseries_period_downsample_mean              20.035858  0.323574    7.982969    9.604692
append_frame_single_homogenous                 21.374095  0.015835    0.231981    0.284990
read_store                                     23.731843  0.078693    1.738310    2.167940
groupby_simple_compress_timing                 27.898217  2.048904   36.627372   47.003667
reindex_multiindex                             31.810866  0.074761    1.168331    1.557271

stephenwlin · 2013-04-05T01:59:45Z

nice! that looks much better.

On Thu, Apr 4, 2013 at 8:56 PM, y-p notifications@github.com wrote:

I pushed another update to test_perf, it now sets cpu affinity to protect
againt cold cache surprises
due to thread migration on multicores and also does a gc collect between
benchmarks, including
the libc trim voodoo from #2659 #2659
.

using

./test_perf.sh --ncalls 1 --head --hrepeats 10 --outdf o.pickle

before and after definitely seems to have made an impact:

df=pd.load('o.pickle')
df=df[df.icol(0)>0.1]
df['max-min [% of mean]']=(df.max(1)-df.min(1))/df.mean(1)*100

before

Out[58]:
#0 #1 max-min [% of mean]
name
frame_mult_no_ne 10.228872 11.081934 8.068553
frame_add 10.586977 10.816097 5.836094
frame_multi_and_no_ne 107.260942 91.881990 32.076782
frame_add_no_ne 10.552168 11.853933 14.014677
frame_multi_and_st 45.034885 45.400143 30.600854
frame_mult_st 10.640144 10.517120 16.973979
frame_multi_and 35.143137 35.255194 20.836927
frame_mult 10.622978 10.655880 1.857611
frame_add_st 10.926008 11.259079 11.381603
series_ctor_from_dict 2.861023 4.272938 44.420730
frame_ctor_nested_dict_int64 97.750902 142.644882 41.801547
frame_ctor_nested_dict 79.347849 115.172148 39.968815
frame_ctor_list_of_dict 102.699041 131.904125 32.164763
frame_boolean_row_select 0.298023 0.354052 52.197071
frame_reindex_both_axes 0.344992 0.579119 58.129294

after

In [59]: df[:15].iloc[:,[0,1,-1]]
Out[59]:
#0 #1 max-min [% of mean]
name
frame_mult_no_ne 10.447979 10.501146 8.025565
frame_add 10.565996 10.575056 4.184737
frame_multi_and_no_ne 105.782032 92.047930 15.303844
frame_add_no_ne 10.586023 10.318995 2.968311
frame_multi_and_st 45.217037 45.341015 3.462760
frame_mult_st 10.675907 10.501862 4.573387
frame_multi_and 45.274973 45.464039 2.581288
frame_mult 10.690928 10.643005 4.730513
frame_add_st 10.578156 10.483980 2.867322
series_ctor_from_dict 2.854109 2.865076 2.026706
frame_ctor_nested_dict_int64 100.254774 102.154016 2.055916
frame_ctor_nested_dict 77.082872 78.242064 2.039485
frame_ctor_list_of_dict 99.193096 100.985050 5.003410
frame_boolean_row_select 0.237942 0.235081 45.588937
frame_reindex_both_axes 0.344038 0.356197 9.131816

There are still outliers in the data (and more in the entire dataset), but
that's definitely
a marked improvement.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3158#issuecomment-15933593
.

jreback · 2013-04-05T02:07:04Z

there is another source of variability here. the groupby timing in simple_compress is related to how random the input and this how many groups are formed.

ghost · 2013-04-05T02:09:51Z

any idea why it's the "no_ne" vbenches that are misbehaving?

jreback · 2013-04-05T02:11:40Z

they are really large (100k rows), prob cache missing a lot

jreback · 2013-04-05T02:15:24Z

also multi_ne are 3 Boolean ops, but because the first 2 ops are Boolean among random nums maybe 3rd op has more variance - just a guess

ghost · 2013-04-05T02:17:39Z

Actually I trimmed those down to 20000 because they were slowing down the
suite too much for doing this sort of testing.

knocking frame_multi_and_no_ne down to 5000/1000 rows doesn't do much.
worse if anything.

                       max-min / mean %       std     min       max
name                                                               
frame_multi_and_no_ne         54.453012  0.416195  4.7369  7.462978

                       max-min / mean %       std        min        max
name                                                                   
frame_multi_and_no_ne         46.707422  2.355675  24.782896  36.755085

ghost · 2013-04-05T02:18:16Z

At least it's now a breeze to test a hypothesis.

ghost · 2013-04-05T02:30:29Z

looks like the first result is always off (cache probably), without it things look
better. That look pretty close to me.

                                        max-min / mean %       std        min        max
name                                                                                    
frame_reindex_columns                          10.150123  0.012239   0.342131   0.378132
join_dataframe_integer_2key                    10.278537  0.192762   6.077051   6.716967
stats_rank_average                             10.570859  0.983739  28.692961  31.923056
join_dataframe_index_single_key_bigger         10.972698  0.270094   8.975029  10.004997
melt_dataframe                                 11.212243  0.069079   1.907825   2.129078
reindex_frame_level_reindex                    11.430236  0.032220   0.875950   0.979900
frame_reindex_upcast                           11.762654  0.886662  20.659924  23.224115
stat_ops_series_std                            11.769578  0.049148   1.159906   1.302958
frame_reindex_axis0                            12.432828  0.038989   0.890017   1.007080
append_frame_single_homogenous                 15.578219  0.015927   0.262022   0.306129
frame_reindex_axis1                            16.136271  0.121112   2.275944   2.653122
join_dataframe_index_single_key_small          17.261100  0.569997   8.114100   9.642839
reindex_frame_level_align                      18.296387  0.054913   0.951052   1.132965
frame_boolean_row_select                       45.346263  0.034772   0.228882   0.342846

ghost · 2013-04-05T03:05:50Z

@jreback, is numexpr completely out of the code in the no_ne vbenches? because the longer
time for the first iteration could be the jit doing it's thing.

jreback · 2013-04-05T03:08:17Z

it should be completely bypassed on the _ne; different code path

stephenwlin · 2013-04-05T03:13:19Z

also does a gc collect

I guess this might be due to other libraries rather than pandas, but, just for curiousity, do we know if pandas actually uses circular references anywhere anyway? I feel like everything I've seen is a directed acyclic graph, at worst....if there really are any reference cycles we ought to remove them or convert them to use weak refs in one direction, imho.

@wesm, do you know off the top of your head?

max-min / mean %

This could be too sensitive to outliers to be a useful headline sort metric...why don't you do something like std / mean instead, or interquartile range / mean?

really large (100k rows), prob cache missing a lot

I found out in my experimental SIMD proxy testing that testing very large arrays can be misleading anyway, btw, so maybe we shouldn't be doing this anyway (or be supplementing the tests with other array sizes)...will explain in a post soon.

ghost · 2013-04-05T03:16:01Z

I'll do that. But I think outliers would produce false positives rather then false negatives,
for which the other columns give you a good overview.
and I'm using more samples this time, so the mean shouldn't be that sensitive
from the typical proportion of outliers I've seen so far.

interquartile range would obscure the outliers wouldn't it? that would produce false negatives.

stephenwlin · 2013-04-05T03:35:02Z

I just think complete outliers could always be because of one-time hiccups on the host system; what is annoying is the consistent overall variability; but I guess if you're ok with false positives and have the underlying data anyway it'd be ok.

ghost · 2013-04-05T03:43:45Z

Well, It's not a lot of data, and I'm actively looking for causes, I think it's ok. std/mean
did produce slightly different results.

It'd be good if you and jeff could repeat the test on your own machines:

pull master + setup.py develop

pip install affinity
time ./test_perf.sh -H -c 1 -N 25 -d 1.pickle # takes about 15-20 minutes

df=pd.load('./1.pickle')
df=df[df.icol(0)>0.1] # vbs over 100us
_p=(df.max(1)-df.min(1))/df.mean(1)*100
_std=df.std(1)
_std_o_mean=df.std(1)/df.mean(1)
_min=df.min(1)
_max=df.max(1)
df=DataFrame(None,index=_std.index)
df['max-min / mean %']=_p
df['std/mean']=_std_o_mean
df['std']=_std
df['min']=_min
df['max']=_max
# or whatever you prefer
df=df.sort('max-min / mean %') 
df=df[df.icol(0)>10]

stephenwlin · 2013-04-05T03:51:06Z

I found out in my experimental SIMD proxy testing that testing very large arrays can be misleading anyway, btw, so maybe we shouldn't be doing this anyway (or be supplementing the tests with other array sizes)...will explain in a post soon.

see new comment on #3146

stephenwlin · 2013-04-05T04:54:40Z

Everything looks good to me, I think either the affinity, gc thing, or both did it :)

Here are the "worst" offenders:

                                             std/mean %  max-min / mean %   std     min     max
read_csv_thou_vb                                   1.26              5.38  0.37   28.80   30.38
write_store_table                                  1.27              5.58  0.72   55.27   58.43
write_store_table_wide                             1.41              5.44  1.32   92.38   97.46
frame_drop_duplicates_na                           1.44              5.43  0.20   13.94   14.71
frame_sort_index_by_columns                        1.74              5.36  0.58   32.83   34.62
frame_fillna_many_columns_pad                      1.88              8.48  0.25   12.92   14.03
timeseries_add_irregular                           1.89              9.85  0.33   17.08   18.78
indexing_dataframe_boolean_no_ne                   2.06              7.27  1.36   63.43   68.21
merge_2intkey_sort                                 2.40              9.66  0.80   33.01   36.25
frame_reindex_upcast                               2.69              8.39  0.31   11.02   11.98
frame_fancy_lookup_all                             3.08             16.10  0.40   12.64   14.70
write_store_table_mixed                            4.21             12.85  5.87  135.24  153.17
stats_rank_average                                 4.21             13.63  1.04   22.98   26.34
dti_reset_index_tz                                 4.91             14.11  0.56   10.39   11.99
frame_ctor_list_of_dict                            5.61             17.28  4.00   66.53   78.86

which isn't much...maybe I got lucky though, will try again with more runs

ghost · 2013-04-05T05:00:42Z

The burn-in helped some too....

I did another 25 runs on an old laptop, a few stragglers but all in all, pretty solid.
At the very list, the median of 5 would definitly produce very clean results.

If you or jeff come back with another positive report, I think we can close this.

stephenwlin · 2013-04-05T05:07:28Z

burn-in? quae?

anyway, if it is resolved, we should systematically try to figure out what actually made the difference: if gc is really having an effect then we need to fix cycles or hopefully get someone else to.

i looked at the linked issue about malloc_trim() and I don't actually think that matters much, but i could be wrong: i think it just marks memory as freed when it's basically already free anyway, just not released back to the system. calling malloc_trim() should actually slow things down, then, because it'll require more new requests...although possibly that explains the variability? (do requests for new memory pages really cost that much?)

ghost · 2013-04-05T05:17:48Z

It was partly cargo cult I admit, I just wanted to make sure any kind of
special casing in the python memory pool management (fragmentation?)
wasn't affecting the results, as that's a kind of global state.
Not at all sure it makes any difference, I also thought about running each test
in a new process using multiprocessing for the same reason, but since it seems
to have worked, I guess that's not necessary.

burn-in: do foo + N runs, and throw away the foo results. I thought I was seeing
the code hitting a cold cache since a disproportional chunk of the outliers
were always in the first column of results.

As to finding out the exact cause in excruciating detail, you're the man for the job.
...I broke it in for ya.

stephenwlin · 2013-04-05T06:01:14Z

ok, i'll look into it, it'll be a learning experience =D

stephenwlin · 2013-04-05T06:50:38Z

Second run (100 iterations) looks fine as well

                                             std/mean %  max-min / mean %   std     min     max
frame_fancy_lookup_all                             2.05             13.31  0.27   12.98   14.74
merge_2intkey_nosort                               2.13             20.61  0.30   13.96   16.87
frame_reindex_upcast                               2.35              9.99  0.27   10.85   11.98
series_align_left_monotonic                        2.42             13.79  0.27   11.18   12.75
reshape_pivot_time_series                          2.65             12.78  3.86  143.19  161.79
frame_multi_and_no_ne                              3.38             14.59  2.43   70.80   81.30
lib_fast_zip_fillna                                3.42             18.44  0.35    9.93   11.79
indexing_dataframe_boolean_no_ne                   3.67             28.89  2.48   65.38   84.92
write_store_table                                  3.73             14.10  2.05   53.72   61.48
frame_sort_index_by_columns                        3.96             17.48  1.34   32.95   38.87
stats_rank_average                                 3.97             19.89  0.98   22.67   27.55
groupbym_frame_apply                               4.21             16.92  1.61   37.12   43.58
frame_insert_500_columns                           4.21             14.27  2.83   65.66   75.25

ghost · 2013-04-05T12:41:39Z

hooray.
let's keep this open until you do the heavy forensics.
the GC thing might not be a cyclic reference problem at all, by triggering
a collect() explicitly, we reduce the probability of a reclaim cycle in the middle
of our benchmark. Again, that's a complicated piece of machinery, I don't
know if my mental model fits the real deal. I bet you will though in a couple of days :)

jreback · 2013-04-05T13:01:32Z

for completeness, my run (with your above call),

I got some errors like this:

indexing_dataframe_boolean_no_ne died with:
No module named expressions
Skipping...

In [29]: df
Out[29]: 
                                       max-min / mean %  std/mean        std  \
write_csv_standard                            13.311859  0.024586   6.255969   
stats_rank_average                            13.444024  0.037948   0.993256   
frame_reindex_upcast                          13.743141  0.028133   0.490885   
stat_ops_series_std                           13.809028  0.037830   0.007968   
timeseries_sort_index                         18.774980  0.036265   0.679534   
groupby_first_float32                         20.197005  0.039377  13.013614   
join_dataframe_index_single_key_small         22.947052  0.058263   0.367445   
frame_to_csv                                  27.141578  0.062693  16.363624   
frame_sort_index_by_columns                   33.596222  0.110327   3.726274   
index_int64_intersection                      38.007098  0.140326   3.028997   
groupby_multi_python                          50.885136  0.138745   5.790584   
datetimeindex_add_offset                      61.074919  0.121092   0.024817   

                                              min         max  
write_csv_standard                     239.377975  273.250103  
stats_rank_average                      23.990154   27.508974  
frame_reindex_upcast                    16.968012   19.366026  
stat_ops_series_std                      0.202894    0.231981  
timeseries_sort_index                   18.425941   21.944046  
groupby_first_float32                  326.046944  392.794847  
join_dataframe_index_single_key_small    5.759954    7.207155  
frame_to_csv                           223.124981  293.968201  
frame_sort_index_by_columns             32.346010   43.693066  
index_int64_intersection                20.231009   28.434992  
groupby_multi_python                    39.759874   60.997009  
datetimeindex_add_offset                 0.198841    0.324011

ghost · 2013-04-05T13:26:48Z

I got that running in a venv with no numexpr.
correction: a venv with a stale version of pandas, before the core.expressions module was introduced

Pushed an update, the gc is now disabled during vbenches. vbench actually
has code to do this hidden away from the main entry point. Looks like wes
covered this territory already at one point. How common that is.

stephenwlin · 2013-04-05T17:12:07Z

the GC thing might not be a cyclic reference problem at all, by triggering
a collect() explicitly, we reduce the probability of a reclaim cycle in the middle
of our benchmark.

oh, right, gc will have to run anyway to verify the lack of cycles. although without cycles, memory usage should stay bounded and hopefully not trigger whatever heuristics CPython uses to determine when to run gc, what ought to be fairly conservative since gc is really just a backup memory management method anyway (but I suppose it might not be, or we might be using up enough memory, even in a bounded fashion, to trigger them)

ghost · 2013-04-05T17:27:43Z

some of the vbs are mem hungery (and were even more so until a recent commit),
and they were being back to back with several iterations, All that memory
activity is bound to create some GC activity if not some meyham (normally not a problem)
in the internal memory pools.

Anyway as I added above, gc is now disabled during the benchmark, just need to make
sure tests don't stress the memory system in terms of allocation capacity.

Btw, the gc theory would also explain why some vbs are more prone then others:
first of all there's the repeated order of tests/memory usage pattern, and secondly
if some vbs create many small objects,that probably provoked a gc cycle somewhere
during the tests.
I wonder how much telematry the gc module provides to test that theory out.

stephenwlin · 2013-04-05T17:52:02Z

back to back with several iterations

but in theory all of that memory goes back immediately after each iteration? i guess it depends on the vagarities of python's allocator/deallocator (particular free store management) whether that really keeps the memory usage bounded across iterations, though.

jreback · 2013-04-05T18:03:52Z

I don't think the memory is released back to the os on an iteration. Instead the python process keeps it and tries to allocate from the new pool. This causes fragmentation over time. In theory it would be better to run each iteration from a top-level script (that is not python) to have a constant allocation regime.....

stephenwlin · 2013-04-05T18:28:36Z

right, it might not go back to the OS (that's why malloc_trim() has the effect it does, but if the same tests are being run over and over, there are no cycles, and the python allocator/deallacator is written robustly (i.e. it correctly finds all free blocks that are big enough within the allocated heap space), then in theory the same memory ought to be allocated each time and the amount of memory consumed should stay bounded? that's why i'm curious if there are any cycles...

that's an argument for the burn-in to make a difference, though: in theory only the first cycle will request new pages from the OS and every subsequent iteration will just reuse the same memory, if all the stars align just right...

i actually think instead of malloc_trim() we might want to do the opposite, and just allocate a whole bunch of memory to the process ahead of time...then we won't be profiling system calls as much.

ghost · 2013-04-05T21:23:39Z

The malloc_trim issue was all about a libc optimization behaving unexpectedly IIUC,
Surely a libc function could do nothing directly about a chunk of memory if python still
held on to it.

My guess would be that memory pressure/pool frgamentation is not the issue here, merely
the time taken up by the collection process itself to scan through objects in memory.

as jeff said, and I mentioned earlier as an idea, running each test in a new process would
probably give the most isolation as every test would run with a blank slate.
I consider the problem solved at this point, but that''s certainly an interesting experiment to
go through ( Edit : that could be multiprocessing or just multiple invocations of test_perf in single mode).

Here's a blog post that suggests actively managing GC can yield build perf wins in certain situtations,
that's obviously worth looking into for lib optimizations, we should power up on our tooling
and see if we can locate hostspots to squeeze some cache and GC wins from.

stephenwlin · 2013-04-05T22:54:32Z

hmm, anyway, it's not a big deal, but i think we might be talking about different things at this point. i just was under the impression that in the absence of cycles all the memory allocated by a test should be released by the end of it, in theory, so each iteration past the first one should have stable memory usage if it's running the same tests each time.

even with stable memory usage, though, it's possible that gc will run because of whatever heuristics CPython uses to determine it's time to start a trace, so it's definitely better to isolate that as a variable anyway. but if memory usage is not stable and it's because of cycles, we ought to eliminate them, if possible, that's all.

ghost · 2013-04-05T23:05:45Z

I agree with the first part and test_perf now calls gc.disable() around benchmarks, so
that bit is already implemented (even if not "isolated").

Are we still talking about vbench variability, or thinking out loud about #2659?

stephenwlin · 2013-04-05T23:28:37Z

Well, I don't think #2659 is an issue really--just because the OS reports higher memory usage doesn't mean it has any practical effect on performance unless things start swapping, in which case glibc is probably smart enough to give back memory on demand. I just think cycles ought to be eliminated if they're there, independently of how much it's contributing to other issues or not.

ghost · 2013-04-05T23:32:26Z

sounds good.

ghost · 2013-04-05T23:35:03Z

Can we close this then?

wesm · 2013-04-06T18:12:47Z

Nice job on all this forensics. I'd say it's safe to close if no objections.

I think I got rid of all the circular references in pandas-- the main issue is the wily "engine" objects inside the Index classes which utilize weakrefs. One of these days when we migrate all the index object code to Cython/C we can clean up that mess.

wesm · 2013-04-07T03:13:14Z

Closing

stephenwlin · 2013-04-09T05:31:50Z

hmm...well turning off affinity doesn't seem to have any effect, actually, and neither does doing the gc.collect(), as it shouldn't because apparently timeit() already turns off gc:

    def timeit(self, number=default_number):
        """Time 'number' executions of the main statement.

        To be precise, this executes the setup statement once, and
        then returns the time it takes to execute the main statement
        a number of times, as a float measured in seconds.  The
        argument is the number of times through the loop, defaulting
        to one million.  The main statement, the setup statement and
        the timer function to be used are passed to the constructor.
        """
        if itertools:
            it = itertools.repeat(None, number)
        else:
            it = [None] * number
        gcold = gc.isenabled()
        gc.disable()
        try:
            timing = self.inner(it, self.timer)
        finally:
            if gcold:
                gc.enable()
        return timing

When running multiple runs within against head, you're doing this all within one interpreter session, right? Not 100% sure yet I'm guessing that's what's helping the most, especially with dropping the burn-in runs. But we can't do that when running two different commits, no?

I'm guessing this means we might have to run multiple runs on each commit, with burn-in, and compare the entire sets, to get the best results...@y-p can we do this already conveniently with your recent changes, or is there a bit more scripting required for that?

ghost · 2013-04-09T05:54:26Z

So basically, I did nothing except the libc trim which I didn't think has any effect either,
yet the results were convincing. I've actually been running benches continually,
and I agree that strange behaviour is still present.

The burn-in too, might have been a red-herring I'm not sure, as I didn't
do things rigorously (hypothesis testing anyone?) because of the amount of time it takes to iterate.

you can control the burn in with -b (default 1), -N will run head multiple times and give you
a frame of results. you can also invoke the script multiple times with -H (-N 1), if you
think the python interpreter at init has an effect.

I've been trying various combinations of -c -n -N -b, and I can't get something consistently better.
Also tried dropping down to single user mode on linux to shut down service, and ran one hour
vs regular runlevel (without user active), not much difference.

ghost · 2013-04-09T05:57:36Z

fwiw, top is normal operation, bottom is single user. certainly quieter ( for low-medium energy noise)
, but the spikes are still there.
note these are not equal number of datapoints, so viz may be misleading.

800 points in both.

stephenwlin · 2013-04-09T21:26:48Z

yeah, i'm not totally sure what it is either, but presumably if we use whatever methodology you're using for the multiple-runs-against-head tests in the normal comparison runs as well, then we should be ok

ghost · 2013-04-09T23:03:32Z

So, Panel didn't respect OrderedDict up to about 5 minutes ago, and now
I don't trust my results. either way. again. sigh

wesm closed this as completed Apr 7, 2013

PERF/TST: Figure out what's going on with vb_suite test result instability (frame_reindex_* in particular) #3158

PERF/TST: Figure out what's going on with vb_suite test result instability (frame_reindex_* in particular) #3158

Comments

stephenwlin commented Mar 24, 2013

stephenwlin commented Mar 24, 2013

stephenwlin commented Mar 25, 2013

ghost commented Mar 27, 2013

ghost commented Apr 2, 2013

stephenwlin commented Apr 3, 2013

ghost commented Apr 5, 2013

before

after

jreback commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

jreback commented Apr 5, 2013

ghost commented Apr 5, 2013

jreback commented Apr 5, 2013

jreback commented Apr 5, 2013

ghost commented Apr 5, 2013

ghost commented Apr 5, 2013

ghost commented Apr 5, 2013

ghost commented Apr 5, 2013

jreback commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

jreback commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

jreback commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

stephenwlin commented Apr 5, 2013

ghost commented Apr 5, 2013

ghost commented Apr 5, 2013

wesm commented Apr 6, 2013

wesm commented Apr 7, 2013

stephenwlin commented Apr 9, 2013

ghost commented Apr 9, 2013

ghost commented Apr 9, 2013

stephenwlin commented Apr 9, 2013

ghost commented Apr 9, 2013

PERF/TST: Figure out what's going on with vb_suite test result instability (`frame_reindex_*` in particular) #3158

PERF/TST: Figure out what's going on with vb_suite test result instability (`frame_reindex_*` in particular) #3158