Support multilevel groupby #69

jtleider · 2018-08-05T17:33:42Z

Hi,

This code closes #18, adding support for multilevel groupby. It also fixes a bug where in some cases descriptives for categorical and continuous variables were being shown in separate columns if a dtype category groupby variable was used.

Best,
Julien

… variable

tompollard · 2018-08-06T17:37:48Z

Excellent, thanks again Julien. This is something that I've been putting off for a while! @jraffa, if possible, please could you take a look at this change from a user perspective?

Two things in particular that we need to think about are (1) if/how p-values should be reported for multilevel grouping (2) how n (%) should be reported for categorical variables.

jraffa · 2018-08-10T17:54:42Z

Couple of comments:

Percentages: Seems like within a (row) variable the column percentages add up to 100%. This is fine, but may not be the desired result. I wonder if having an option to use by row, or by row within the first tier of the column variable is a good idea, or complicates things too much. I usually think about what is the denominator. When setting groupby = ['death','MechVent']:

a. Columnwise: denominator for first column for ICU variable is 110+50+205+103=468 (as in the table header.)
b. Rowwise: For CCU: 110+27+11+14=162
c. Rowwise within death=0: 110+27 = 137

Columnwise is probably a good default. Should probably be explained somewhere in the docs.

Hypothesis testing: The present way of doing the testing seems to take the column levels (n and m levels), and makes n*m groups. So setting groupby = ['death','MechVent'] results in the comparison via (e.g.), one-way ANOVA with 4 levels (0.0,0.1,1.0,1.1). This seem to be an ok behaviour. In theory two-way or multi-way ANOVA is possible, but results in two+ p-values (with no interaction). Instead of multiway ANOVA, I think it's more likely that someone would want to compare the the values within a level of the first tier of a column. e.g., Compare among those who died, the mean SysABP: 122.51 (35.68) vs. 110.24 (39.40) for those with vent and no vent, resulting in separate pvalues for death=0 and death = 1. So I would have these two potential methods:

a. If factor one has n levels, and factor two has m levels: Have the default treat crosses of the n and m levels to do a n*m-1 degree of freedom test (as currently done).
b. The other is to stratify into n groups, and do the testing within each group on the m levels of factor two.

I think type b. is probably more intuitive to someone who hasn't read the docs. But I could see the other argument on the other side as well.

Let me know if I have confused you.

jtleider added 4 commits August 4, 2018 16:04

Support multilevel groupby

e92cd13

Add test that groupby with categorical variable runs without error

9b71fb5

Fix issue where descriptives were misaligned with categorical groupby…

c4ef642

… variable

Fix bug that was misaligning table with multiple groupby levels

6de33b5

tompollard requested a review from jraffa August 6, 2018 17:37

tompollard force-pushed the master branch 3 times, most recently from 9ff9874 to 1db34db Compare November 16, 2019 02:51

tompollard force-pushed the master branch 3 times, most recently from 6f62620 to fabcd16 Compare May 6, 2020 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multilevel groupby #69

Support multilevel groupby #69

jtleider commented Aug 5, 2018

tompollard commented Aug 6, 2018

jraffa commented Aug 10, 2018 •

edited

Support multilevel groupby #69

Are you sure you want to change the base?

Support multilevel groupby #69

Conversation

jtleider commented Aug 5, 2018

tompollard commented Aug 6, 2018

jraffa commented Aug 10, 2018 • edited

jraffa commented Aug 10, 2018 •

edited