Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Overflow in .deseq2() and and lfc_shrink() #169

Open
BeyondTheProof opened this issue Aug 24, 2023 · 3 comments
Open

[BUG] Overflow in .deseq2() and and lfc_shrink() #169

BeyondTheProof opened this issue Aug 24, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@BeyondTheProof
Copy link

BeyondTheProof commented Aug 24, 2023

To Reproduce

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors=['condition', "RNA_Qubit_actual", 'treatment'],
    continuous_factors=["RNA_Qubit_actual"],
    refit_cooks=True,
    n_cpus=12,
)

dds.deseq2()

Output:

... done in 0.09 seconds.

/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:553: RuntimeWarning: overflow encountered in exp
  mu_ = np.maximum(size_factors * np.exp(X @ beta), min_mu)
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:404: RuntimeWarning: invalid value encountered in subtract
  -logbinom
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:557: RuntimeWarning: overflow encountered in exp
  mu_ = np.maximum(size_factors * np.exp(X @ beta), min_mu)
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:560: RuntimeWarning: invalid value encountered in divide
  + ((1 / disp + counts) * mu_ / (1 / disp + mu_)) @ X
...
Fitting dispersions...
... done in 5.68 seconds.

Fitting dispersion trend curve...
... done in 10.39 seconds.

Fitting MAP dispersions...
... done in 6.41 seconds.

Fitting LFCs...
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:406: RuntimeWarning: invalid value encountered in multiply
  - counts * np.log(mu)
... done in 6.06 seconds.

Refitting 0 outliers.

This part finishes, but gives runtime warnings. Will this make the output incorrect? Why would this happen?
Also, I'm getting something similar with stat_res.lfc_shrink():

cooks_filter=True

stat_res = DeseqStats(
    dds,
    contrast=["RNA-Qubit-actual", "", ""],
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True,
)
stat_res.run_wald_test()

# Filter based on Cook's outliers
if cooks_filter:
    stat_res._cooks_filtering()

# Does either independent filtering, where genes are filtered out based on low counts
if stat_res.independent_filter:
    stat_res._independent_filtering()
# or uses the Benjamini-Hochberg procedure to adjust p-values
else:
    stat_res._p_value_adjustment()

# Shrink log fold changes toward 0
stat_res.lfc_shrink()

Output:

Fitting MAP LFCs...
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
  counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
  counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
...
... done in 27.00 seconds.

Using version 0.4.0

Expected behavior
Not getting overflows. Maybe due to precision? Maybe need long floats? My suspicion is that this is happening when certain combinations of design factors yield 0 counts, and it is impossible to determine a dispersion for that gene. What is the best approach here?

Desktop (please complete the following information):

  • OS: Ubuntu
  • Version 20.04.5 LTS (Focal Fossa)
@BeyondTheProof BeyondTheProof added the bug Something isn't working label Aug 24, 2023
@BorisMuzellec
Copy link
Collaborator

Hi @BeyondTheProof, it's a bit hard for me to guess what's going wrong here...
Apparently the overflow starts in fit_genewise_dispersions when calling irls_solver to fit initial mu values, but I'm not sure why.
Does your data have very large counts values?

@sermare
Copy link

sermare commented Sep 27, 2023

A silly solution @BeyondTheProof Is to limit the number of genes to < 500. Here:

n = 499  # Change this to the desired number of columns not greater than 500
random_genes = np.random.choice(genes_to_keep, size=n, replace=False)

# Create a copy of the DataFrame with only the randomly selected columns
counts_unstranded_random = gbm_counts_unstranded[random_genes]

`Fitting size factors...
... done in 0.01 seconds.

Fitting dispersions...
... done in 0.32 seconds.

Fitting dispersion trend curve...
... done in 0.25 seconds.

Fitting MAP dispersions...
... done in 0.36 seconds.

Fitting LFCs...
... done in 0.15 seconds.

Refitting 30 outliers.

Fitting dispersions...
... done in 0.03 seconds.

Fitting MAP dispersions...
... done in 0.03 seconds.

Fitting LFCs...
... done in 0.02 seconds.
`

@sermare
Copy link

sermare commented Sep 27, 2023

Let me correct myself just do:


def rename_duplicate_columns(df):
    cols = pd.Series(df.columns)
    for dup in cols[cols.duplicated()].unique():
        cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
    df.columns = cols

avg_counts = gbm_counts_unstranded.mean()

# Filter columns based on the average
filtered_columns = avg_counts[(avg_counts > 10)].index # & (avg_counts < 10000)

print(len(filtered_columns))

# Subset the data frame based on the filtered columns
filtered_df = gbm_counts_unstranded[filtered_columns]

It should work after it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants