Adding interaction terms to the design matrix #181

khalilouardini · 2023-10-17T12:51:35Z

This PR has several purposes:

it leverages formulaic to allow using interaction terms of the form a:b:...:zfor design_factors and formulas based on a combination of single design factors and of such interaction terms a + b + a:b:...:z (no support for other more complex structures or alternative syntax enabled by formulaic's grammar yet such as i.e. ~ C(X, contr.treatment("x")), a * b, contr.poly, ...)
allows the user to pass a design_matrixto a dds dataset
fixes pyproject.toml syntax (linters complained)

Completion milestones:

adds @anaischossegros test on edge-case from deseq2 vignette
test to check that passing a design_matrix populates all necessary attributes such as design_factors (kind of as end2end integration tests are passing)
continuous factors well-handled
deseq2 examples matched (at least at the level of the design matrix)
arbitrary number of interactions well handled
deseq2 examples matched end2end

pydeseq2/dds.py

pydeseq2/interaction_utils.py

jeandut · 2023-10-19T17:13:57Z

Youpi tests pass ! @khalilouardini and @BorisMuzellec you are on !

BorisMuzellec · 2023-10-20T08:09:02Z

Hi @jeandut and @khalilouardini, thanks for this PR!

It's nice that you went all the way to even support recursive interactions like "a: b:c".

I experimented a bit with your code, and there seem to be a few remaining issues though.
Running

from pydeseq2.utils import build_design_matrix, load_example_data

counts = load_example_data()
metadata = load_example_data(modality = "metadata")

design = build_design_matrix(metadata=clinical, 
                   design_factors=["condition", "condition:group"])

I get the following design:

          intercept  condition_B_vs_A  condition:group_AY_vs_A_vs_AX  \
sample1            1                 0                              0   
sample2            1                 0                              1   
sample3            1                 0                              0   
sample4            1                 0                              1   
sample5            1                 0                              0   
...              ...               ...                            ...   
sample96           1                 1                              0   
sample97           1                 1                              0   
sample98           1                 1                              0   
sample99           1                 1                              0   
sample100          1                 1                              0   

           condition:group_BX_vs_A_vs_AX  condition:group_BY_vs_A_vs_AX  
sample1                                0                              0  
sample2                                0                              0  
sample3                                0                              0  
sample4                                0                              0  
sample5                                0                              0  
...                                  ...                            ...  
sample96                               0                              1  
sample97                               1                              0  
sample98                               0                              1  
sample99                               1                              0  
sample100                              0                              1  

[100 rows x 5 columns]

The columns seem to contain the intended values, but:

There's an issue with variable names. We would expect something like condition:group_AY_vs_AX instead of condition:group_AY_vs_A_vs_AX.
The design is not full rank, e.g. ((design.iloc[:,-1] + design.iloc[:,-2] - design.iloc[:,1])**2).sum() returns 0. This means that one of the last two columns is redundant. Not sure how to determine this automatically though, perhaps looking at the way DESeq2 handles this could help.

Let me know if you need help with this :)

EDIT: I think that in the example above issue 2 is due to the fact the design is of the form "~a + a:b", which creates a redundancy that probably wouldn't be there if it was just "~a:b".

BorisMuzellec

See my comment above

jeandut · 2023-10-20T14:56:04Z

@BorisMuzellec see the modifications I made. LGTM but it's hard to be sure wo test data available.
In order for us to converge faster and minimize iterations of reviews if the PR is still not doing exactly what you want I would appreciate if you could give me couples of:
(design_factors, metadata, continuous_factors) -> expected_design_matrix
For a sufficiently representative set of design_factors and continuous_factors.

jeandut · 2023-10-20T16:07:52Z

As a start is this matching DeSeq2 ? and if not what should be changed ?

jeandut · 2023-10-20T16:15:20Z

And introducing continuous factors, same question:

BorisMuzellec · 2023-10-23T12:40:39Z

Thanks @jeandut for the code updates. The variable names look fine now :).

Matrix rank

We still have the rank issue with designs of the form `"~ factor1 + factor1:factor2" though. E.g., in the same example as above,

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
design = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])

we have the following design:

    intercept  condition_B_vs_A  condition:group_AY_vs_AX  \
sample1            1                 0                         0   
sample2            1                 0                         1   
sample3            1                 0                         0   
sample4            1                 0                         1   
sample5            1                 0                         0   
...              ...               ...                       ...   
sample96           1                 1                         0   
sample97           1                 1                         0   
sample98           1                 1                         0   
sample99           1                 1                         0   
sample100          1                 1                         0   

           condition:group_BX_vs_AX  condition:group_BY_vs_AX  
sample1                           0                         0  
sample2                           0                         0  
sample3                           0                         0  
sample4                           0                         0  
sample5                           0                         0  
...                             ...                       ...  
sample96                          0                         1  
sample97                          1                         0  
sample98                          0                         1  
sample99                          1                         0  
sample100                         0                         1

which does not have full column rank, because condition_B_vs_A = condition:group_BX_vs_AX condition:group_BY_vs_AX (since group has only two values X and Y, knowing BX and BY is enough to know B).

In comparison, the design matrix output by DESeq2 only has the following columns ["intercept", "condition_B_vs_A", "condition:group_AY_vs_AX", "condition:group_BY_vs_AX"].

I think that when there are interaction terms in the design, we need to check whether those variables are also present on their own, and if so remove an additional column.
I added a test to check that the design has full rank in this case, which is why the CI now fails.

In-place modification

On a side note, the present code modifies the metadata that is being passed (it adds colums). E.G:

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
print(metadata)

 condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

_ = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])
print(metadata)

      condition group condition:group
sample1           A     X              AX
sample2           A     Y              AY
sample3           A     X              AX
sample4           A     Y              AY
sample5           A     X              AX
...             ...   ...             ...
sample96          B     Y              BY
sample97          B     X              BX
sample98          B     Y              BY
sample99          B     X              BX
sample100         B     Y              BY

It would be better to avoid this, e.g. by adding an inplace argument to the interaction term utilities, or deleting added columns after the code is done running.

I'm not a huge fan of adding to many dependencies, but I'm starting to wonder if we could save us some pain by relying on formulaic, as suggested in #125...

pydeseq2/dds.py

pydeseq2/interaction_utils.py

thondeboer · 2024-01-05T22:15:21Z

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

jeandut · 2024-01-05T22:35:59Z

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

This attempt has not been abandoned. Currently it has been because of lack of bandwidth that I could not make more progress, I don't want to make hard commitments but I hope this gets done in Q1.
However indeed this turned out to be more complicated than I expected mainly because of the versatility of the formula and its interaction with the rank-reduction step. For this reason I will rely on formulaic
In the meantime all those manipulations can be done in Python on a case by case basis outside of pydeseq2

examples/plot_pandas_io_example.py

tests/test_build_design_matrix.py

khalilouardini requested review from BorisMuzellec, maikia, arthurPignetOwkin and mandreux-owkin as code owners October 17, 2023 12:51

khalilouardini marked this pull request as draft October 17, 2023 12:51

This was referenced Oct 17, 2023

Refactor utils.py #182

Open

Tutorial including interactions terms in the design #183

Open

[Enhancement] expanded argument not used in utils.build_design_matrix #184

Open