Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in the deconvolution step #148

Open
Xuyen21 opened this issue Jan 8, 2024 · 4 comments
Open

Problem in the deconvolution step #148

Xuyen21 opened this issue Jan 8, 2024 · 4 comments
Assignees

Comments

@Xuyen21
Copy link

Xuyen21 commented Jan 8, 2024

Following the guide for wastewater experimental branch. I got stuck in the last stage of vpipe deconvolution as you can see in this error log: deconvoluted.err.log.

More specifically, the problem happens when running this code inside lollipop: (deconvolute.py)

preproc = preproc.general_preprocess(
        variants_list=variants_list,
        variants_pangolin=variants_pangolin,
        variants_not_reported=variants_not_reported,
        to_drop=to_drop,
        start_date=start_date,
        end_date=end_date,
        no_date=no_date,
        remove_deletions=remove_deletions,
    )

will return an empty data frame. I noticed that in the generated variants_pangolin.yaml file start_date and end_date is not added in the previous step. Adding it manually does not solve the problem.

The content of the input files is as follows:
results/tallymut.tsv.zst contains:

sample               batch  reads proto location_code       date  \
0  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
1  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
2  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
3  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
4  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
5  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
6  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
7  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
8  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18
9  A2_10_2022_09_18  20230331_HN3YHDRX2    151   v41            10 2022-09-18

  location          sample.1             batch.1   pos    gene base  cov  var  \
0   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   210     NaN    T    0    0
1   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   241     NaN    T    0    0
2   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   405  ORF1ab    G    0    0
3   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   670  ORF1ab    G    0    0
4   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   733  ORF1ab    C    0    0
5   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2   913  ORF1ab    T    0    0
6   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2  2749  ORF1ab    T    0    0
7   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2  2790  ORF1ab    T    0    0
8   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2  2832  ORF1ab    G    0    0
9   Schanf  A2_10_2022_09_18  20230331_HN3YHDRX2  3037  ORF1ab    T    0    0

   frac omxbb19   al     om1   de   ga
0   NaN     NaN  NaN     NaN  mut  NaN
1   NaN     mut  mut  shared  mut  mut
2   NaN     mut  NaN     NaN  NaN  NaN
3   NaN     mut  NaN     NaN  NaN  NaN
4   NaN     NaN  NaN     NaN  NaN  mut
5   NaN     NaN  mut     NaN  NaN  NaN
6   NaN     NaN  NaN     NaN  NaN  mut
7   NaN     mut  NaN     NaN  NaN  NaN
8   NaN     NaN  NaN     mut  NaN  NaN
9   NaN     mut  mut  shared  mut  mut

I just printed out 10 out of 2096 rows of the above data frame in python.
deconv_bootstrap_cowwid.yaml:

bootstrap: 100

kernel: 'gaussian'

kernel_params:
  bandwidth: 10

regressor: robust
regressor_params:
  f_scale: 0.01

deconv_params:
  min_tol: 1e-3

results/variants_pangolin.yaml:

variants_pangolin:
  omxbb19: XBB.1.9
  al: B.1.1.7
  om1: BA.1
  de: B.1.617.2
  ga: P.1

var_dates.yaml:

var_dates:
  '2022-06-12':
  - B.1.1.7 
  - B.1.617.2
  - P.1
  - BA.1
  '2022-07-17':
  - B.1.1.7 
  - B.1.617.2
  - P.1
  - BA.1
  '2022-08-14':
  - B.1.1.7 
  - B.1.617.2
  - P.1
  - BA.1
  '2022-09-18':
  - B.1.1.7 
  - B.1.617.2
  - P.1
  - BA.1

What can I change to make vpipe deconvolution work? Alternatively, what anaconda and jupiter notebook version do I need for the lollipop code to run?

@DrYak
Copy link
Member

DrYak commented Jan 18, 2024

Hi, sorry for the slow answer, I only came back from vacation this week.

Following the guide for wastewater experimental branch.

By the way, the wastewater specifics have now been merged into the main master branch.

I got stuck in the last stage of vpipe deconvolution as you can see in this error log: deconvoluted.err.log.

Could you also provide the deconvoluted.out.log? This would provide details about the parameter users to deconvolute. (It recapitulates all what was loaded from the various .yaml files and/or autoguessed from the data).

will return an empty data frame.

Indeed that's the problematic part. For some reason it can't generate a deconvolution for the given input parameters.
Most likely it's not considering the correct date range or the correct variants for the period.

I noticed that in the generated variants_pangolin.yaml file start_date and end_date is not added in the previous step.

Normally, the dates should be autoguessed from the range of the "date" column in results/tallymut.tsv.zst
(It should be mentionned in the deconvoluted.out.log).

Could you also provide your V-pipe config file?

Among other:

  • are you making your own variants-config YAML? Or are you letting V-pipe re-use the results/variants_pangolin.yaml automatically created during the previous step?

Regarding file var_dates.yaml:
You only need entries when the mixture of present variants (e.g.: as detected by COJAC) changes.

E.g.: if before 2022-07-1 you have a different mixture of variant, and it changes afterward, you just write:

var_dates:
  '2022-05-01':
  #  at the beginning of the project, only B.1.1.7  'Alpha', P.1 'Gamma' are present
  - B.1.1.7 
  - P.1
  '2022-07-01':
  # starting from huly, Delta B.1.617.2 and Omicron BA.1 showed up to the party
  - B.1.1.7 
  - B.1.617.2
  - P.1
  - BA.1

This will cause Lollipop to do one deconvolution for all samples between May and July while looking only for quantification of B.1.1.7 and P.1,
then a second deconvolution for all samples after July and this time also looking for B.1.617.2 and BA.1 in addition,
then concatenating the curves chronologically.

The way you wrote you yaml, LolliPop will start one deconvolution each month (from 2022-06-12 to 2022-07-17, then 2022-07-17 to 2022-08-14, then 2022-08-14 to 2022-09-18, then everything after 2022-09-18) but you asked each time to estimate the proportion for the same mixture of variants (B.1.1.7, B.1.617.2, P.1, BA.1).

@DrYak DrYak self-assigned this Jan 18, 2024
@Xuyen21
Copy link
Author

Xuyen21 commented Jan 18, 2024

Hi @DrYak ,
Thank you for your reply.

I run this time with only 2 variants (Alpha and Delta) from the references/voc this time.

Here is the deconvoluted.out.log file
deconvoluted.out.log

Could you also provide your V-pipe config file?

Here is my config.yaml file:

general:
    virus_base_config: 'sars-cov-2'
    primers_trimmer: samtools
    # for Oxford nanopore
    aligner: minimap
    reprocessor: skip

input:
    datadir: samples/
    samples_file: samples.tsv
    # for Oxford nanopore
    paired: false
    # generated with COJAC (or obtained from us)
    variants_def_directory: references/voc/
    protocols_file: references/primers.yaml

output:
    datadir: results/

    trim_primers: true
    snv: false
    local: false
    global: false
    visualization: false
    diversity: false
    QA: false
    upload: false
    dehumanized_raw_reads: false
    # note no wastewater output flag for now, rules called explicitly

# for Oxford nanopore
minimap_align:
    preset: 'map-ont'

# if dates and location are extracted from sample names:
timeline:
    # timeline_tsv: timeline.tsv
    regex_yaml: regex.yaml
    locations_table: wastewater_plants.tsv

deconvolution:
    threads: 8
    # this file corresponds to the parameters used now on our curves:
    # (provided by us)
    deconvolution_config: deconv_bootstrap_cowwid.yaml
    # file that specifies which variant are present at which time point, as determined by looking at COJAC's results
    # done manually by user
    variants_dates: var_dates.yaml
    # automatically generated
    variants_config: results/variants_pangolin.yaml

are you making your own variants-config YAML? Or are you letting V-pipe re-use the results/variants_pangolin.yaml automatically created during the previous step?

I let V-pipe reuse the results/variants_pangolin.yaml file.

Regarding file var_dates.yaml:
You only need entries when the mixture of present variants (e.g.: as detected by COJAC) changes.

The variants Alpha (B.1.1.7 ) and Delta (B.1.617.2) always appear, so I adjust the var_dates.yaml like this:

var_dates:
  '2022-06-12':
  - B.1.1.7 
  - B.1.617.2

My regex.yaml file:

sample: \w{2}_(?P<location>\d+)_(?P<year>20\d{2})_(?P<month>[01]?\d)_(?P<day>[0-3]?\d)

My wastewater_plants.tsv file:

code	location
05	Davos
10	Schanf

After all that, it returned the same error as before :)

@DrYak
Copy link
Member

DrYak commented Jan 19, 2024

Well I don't see anything anomalous...
Something weird is happening.

Could you share me the compressed tallymut.tsv.zstd over, e.g. PolyBox, Switch Drives, etc. so I could have I try to see what's wrong?

@Xuyen21
Copy link
Author

Xuyen21 commented Jan 25, 2024

Hi @DrYak
Sorry for the late response.

Could you share me the compressed tallymut.tsv.zstd over, e.g. PolyBox, Switch Drives, etc. so I could have I try to see what's wrong?

Here is the file over google drives:
tallymut.tsv.zst

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants