Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding interaction terms to the design matrix #181

Open
wants to merge 111 commits into
base: main
Choose a base branch
from

Conversation

khalilouardini
Copy link

@khalilouardini khalilouardini commented Oct 17, 2023

This PR has several purposes:

  • it leverages formulaic to allow using interaction terms of the form a:b:...:zfor design_factors and formulas based on a combination of single design factors and of such interaction terms a + b + a:b:...:z (no support for other more complex structures or alternative syntax enabled by formulaic's grammar yet such as i.e. ~ C(X, contr.treatment("x")), a * b, contr.poly, ...)
  • allows the user to pass a design_matrixto a dds dataset
  • fixes pyproject.toml syntax (linters complained)

Completion milestones:

  • adds @anaischossegros test on edge-case from deseq2 vignette
  • test to check that passing a design_matrix populates all necessary attributes such as design_factors (kind of as end2end integration tests are passing)
  • continuous factors well-handled
  • deseq2 examples matched (at least at the level of the design matrix)
  • arbitrary number of interactions well handled
  • deseq2 examples matched end2end

pydeseq2/dds.py Outdated Show resolved Hide resolved
pydeseq2/dds.py Outdated Show resolved Hide resolved
pydeseq2/dds.py Outdated Show resolved Hide resolved
@jeandut jeandut marked this pull request as ready for review October 19, 2023 17:13
@jeandut
Copy link
Contributor

jeandut commented Oct 19, 2023

Youpi tests pass ! @khalilouardini and @BorisMuzellec you are on !

@BorisMuzellec
Copy link
Collaborator

BorisMuzellec commented Oct 20, 2023

Hi @jeandut and @khalilouardini, thanks for this PR!

It's nice that you went all the way to even support recursive interactions like "a: b:c".

I experimented a bit with your code, and there seem to be a few remaining issues though.
Running

from pydeseq2.utils import build_design_matrix, load_example_data

counts = load_example_data()
metadata = load_example_data(modality = "metadata")

design = build_design_matrix(metadata=clinical, 
                   design_factors=["condition", "condition:group"])

I get the following design:

          intercept  condition_B_vs_A  condition:group_AY_vs_A_vs_AX  \
sample1            1                 0                              0   
sample2            1                 0                              1   
sample3            1                 0                              0   
sample4            1                 0                              1   
sample5            1                 0                              0   
...              ...               ...                            ...   
sample96           1                 1                              0   
sample97           1                 1                              0   
sample98           1                 1                              0   
sample99           1                 1                              0   
sample100          1                 1                              0   

           condition:group_BX_vs_A_vs_AX  condition:group_BY_vs_A_vs_AX  
sample1                                0                              0  
sample2                                0                              0  
sample3                                0                              0  
sample4                                0                              0  
sample5                                0                              0  
...                                  ...                            ...  
sample96                               0                              1  
sample97                               1                              0  
sample98                               0                              1  
sample99                               1                              0  
sample100                              0                              1  

[100 rows x 5 columns]

The columns seem to contain the intended values, but:

  1. There's an issue with variable names. We would expect something like condition:group_AY_vs_AX instead of condition:group_AY_vs_A_vs_AX.

  2. The design is not full rank, e.g. ((design.iloc[:,-1] + design.iloc[:,-2] - design.iloc[:,1])**2).sum() returns 0. This means that one of the last two columns is redundant. Not sure how to determine this automatically though, perhaps looking at the way DESeq2 handles this could help.

Let me know if you need help with this :)

EDIT: I think that in the example above issue 2 is due to the fact the design is of the form "~a + a:b", which creates a redundancy that probably wouldn't be there if it was just "~a:b".

Copy link
Collaborator

@BorisMuzellec BorisMuzellec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

@BorisMuzellec see the modifications I made. LGTM but it's hard to be sure wo test data available.
In order for us to converge faster and minimize iterations of reviews if the PR is still not doing exactly what you want I would appreciate if you could give me couples of:
(design_factors, metadata, continuous_factors) -> expected_design_matrix
For a sufficiently representative set of design_factors and continuous_factors.

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

As a start is this matching DeSeq2 ? and if not what should be changed ?
Capture d’écran 2023-10-20 à 18 07 11

@jeandut
Copy link
Contributor

jeandut commented Oct 20, 2023

And introducing continuous factors, same question:
Capture d’écran 2023-10-20 à 18 15 01

@BorisMuzellec
Copy link
Collaborator

BorisMuzellec commented Oct 23, 2023

Thanks @jeandut for the code updates. The variable names look fine now :).

Matrix rank

We still have the rank issue with designs of the form `"~ factor1 + factor1:factor2" though. E.g., in the same example as above,

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
design = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])

we have the following design:

    intercept  condition_B_vs_A  condition:group_AY_vs_AX  \
sample1            1                 0                         0   
sample2            1                 0                         1   
sample3            1                 0                         0   
sample4            1                 0                         1   
sample5            1                 0                         0   
...              ...               ...                       ...   
sample96           1                 1                         0   
sample97           1                 1                         0   
sample98           1                 1                         0   
sample99           1                 1                         0   
sample100          1                 1                         0   

           condition:group_BX_vs_AX  condition:group_BY_vs_AX  
sample1                           0                         0  
sample2                           0                         0  
sample3                           0                         0  
sample4                           0                         0  
sample5                           0                         0  
...                             ...                       ...  
sample96                          0                         1  
sample97                          1                         0  
sample98                          0                         1  
sample99                          1                         0  
sample100                         0                         1  

which does not have full column rank, because condition_B_vs_A = condition:group_BX_vs_AX condition:group_BY_vs_AX (since group has only two values X and Y, knowing BX and BY is enough to know B).

In comparison, the design matrix output by DESeq2 only has the following columns ["intercept", "condition_B_vs_A", "condition:group_AY_vs_AX", "condition:group_BY_vs_AX"].

I think that when there are interaction terms in the design, we need to check whether those variables are also present on their own, and if so remove an additional column.
I added a test to check that the design has full rank in this case, which is why the CI now fails.

In-place modification

On a side note, the present code modifies the metadata that is being passed (it adds colums). E.G:

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
print(metadata)

 condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

_ = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])
print(metadata)

      condition group condition:group
sample1           A     X              AX
sample2           A     Y              AY
sample3           A     X              AX
sample4           A     Y              AY
sample5           A     X              AX
...             ...   ...             ...
sample96          B     Y              BY
sample97          B     X              BX
sample98          B     Y              BY
sample99          B     X              BX
sample100         B     Y              BY

It would be better to avoid this, e.g. by adding an inplace argument to the interaction term utilities, or deleting added columns after the code is done running.

I'm not a huge fan of adding to many dependencies, but I'm starting to wonder if we could save us some pain by relying on formulaic, as suggested in #125...

pydeseq2/dds.py Outdated Show resolved Hide resolved
@thondeboer
Copy link

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

@jeandut
Copy link
Contributor

jeandut commented Jan 5, 2024

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

  1. This attempt has not been abandoned. Currently it has been because of lack of bandwidth that I could not make more progress, I don't want to make hard commitments but I hope this gets done in Q1.
  2. However indeed this turned out to be more complicated than I expected mainly because of the versatility of the formula and its interaction with the rank-reduction step. For this reason I will rely on formulaic
  3. In the meantime all those manipulations can be done in Python on a case by case basis outside of pydeseq2

@jeandut jeandut marked this pull request as ready for review April 10, 2024 18:22
@jeandut jeandut marked this pull request as draft April 11, 2024 19:58
@jeandut jeandut marked this pull request as ready for review April 19, 2024 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants