Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application to Smart-seq data #1

Open
jmfa opened this issue Aug 30, 2023 · 15 comments
Open

Application to Smart-seq data #1

jmfa opened this issue Aug 30, 2023 · 15 comments

Comments

@jmfa
Copy link

jmfa commented Aug 30, 2023

Hi,
Thanks for this very useful package! I'm interesting in running the tool on some smart-seq2 datasets.
How should I set up the files so that I can keep using the same snakefile?

Thanks in advance,
Joao

@jefftc
Copy link
Contributor

jefftc commented Sep 13, 2023 via email

@jmfa
Copy link
Author

jmfa commented Sep 18, 2023

Hi Jeff,

Thanks for your help.
I did manage to modify the snakefile to get the tool running.
However, I got this error during the parse_vcf_files step:

[Mon Sep 18 11:07:08 2023]
rule parse_vcf_files:
    input: output/05_call/variants.0.snp_only.vcf
    output: output/05_call/variants.table.txt
    jobid: 34

Traceback (most recent call last):
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 72, in <module>
    main()
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 27, in main
    assert len(cols) == 10
AssertionError
[Mon Sep 18 11:07:09 2023]
Error in rule parse_vcf_files:
    jobid: 34
    output: output/05_call/variants.table.txt

RuleException:
CalledProcessError in line 1066 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED:
Command 'set -euo pipefail;  /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py' returned non-zero exit status 1.
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2347, in run_wrapper
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED", line 1066, in __rule_parse_vcf_files
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 568, in _callback
  File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 554, in cached_or_run
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2359, in run_wrapper
Removing output files of failed job parse_vcf_files since they might be corrupted:
output/05_call/variants.table.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-09-18T110707.810852.snakemake.log

The input file is correctly placed under "output/05_call/variants.0.snp_only.vcf".
Any idea of what might be causing this problem?

Thanks again,
J

@jefftc
Copy link
Contributor

jefftc commented Sep 18, 2023 via email

@jmfa
Copy link
Author

jmfa commented Sep 19, 2023

05_call.zip
Hi Jeff,
I'm dropping here the files in the 05_call folder.
Thanks for your help,
J

@jefftc
Copy link
Contributor

jefftc commented Sep 20, 2023 via email

@jmfa
Copy link
Author

jmfa commented Oct 19, 2023

Hi Jeff,

I updated the code and this issue seems to be solved.
However, I got a different error at a later stage of the pipeline:

`
[Thu Oct 19 07:04:21 2023]
Finished job 64.
36 of 55 steps (65%) done
Select jobs to execute...

[Thu Oct 19 07:04:21 2023]
rule list_sites_with_mixed_genotypes:
input: output/07_filter/genotype.count.txt
output: output/07_filter/mixed_genotypes.txt
jobid: 63

[Thu Oct 19 07:04:21 2023]
Error in rule list_sites_with_mixed_genotypes:
jobid: 63
output: output/07_filter/mixed_genotypes.txt

RuleException:
CalledProcessError in line 1479 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2:
Command 'set -euo pipefail; /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpvo_o2ml8.list_sites_with_mixed_genotypes.py' returned non-zero exit status 1.
File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2347, in run_wrapper
File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2", line 1479, in __rule_list_sites_with_mixed_genotypes
File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 568, in _callback
File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 554, in cached_or_run
File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2359, in run_wrapper
Removing output files of failed job list_sites_with_mixed_genotypes since they might be corrupted:
output/07_filter/mixed_genotypes.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-10-18T192121.168608.snakemake.log
`

The input file (output/07_filter/genotype.count.txt) looks fine so I'm not sure what's causing this error.
Any idea?

Thanks
J

@jefftc
Copy link
Contributor

jefftc commented Oct 24, 2023 via email

@jmfa
Copy link
Author

jmfa commented Oct 24, 2023

Hi Jeff,

I'm attaching the files you requested.
Also, this is what I have in the snakefile:
"KEEP_SITES_WITH_MIXED_GENOTYPES = 20"

Thanks,
J
2023-10-18T192121.168608.snakemake.log
genotype.count.txt

@jefftc
Copy link
Contributor

jefftc commented Oct 24, 2023 via email

@jmfa
Copy link
Author

jmfa commented Oct 30, 2023

Hi Jeff,
Ok so I think we're pretty close but now there's an error with the beast2 section of the pipeline.
This is the error I get:

[1] "Start with 46 sequences and 1716 positions." [1] "Final data has 46 sequences and 1716 positions, 0.0% sparse." [1] "Fitting tree over 100,000 iterations." Error instop_vctrs()`:
! Input must be a vector, not an environment.
Backtrace:

  1. ├─global run.beast(...)
  2. │ └─beautier::create_tracelog(trace.log, log_every = sample.interval)
  3. │ └─beautier::check_tracelog(tracelog)
  4. │ └─beautier::check_tracelog_values(tracelog)
  5. │ └─beautier::check_filename(tracelog$filename, allow_na = TRUE)
  6. │ └─stringr::str_detect(filename, " ")
  7. │ └─stringr:::check_lengths(string, pattern)
  8. │ └─vctrs::vec_size_common(...)
  9. └─vctrs:::stop_scalar_type(<fn>(<env>), "")
  10. └─vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)
  11. └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
    

`

Any idea of what might be causing this? I had to install a lot of packages manually as I'm currently using R version 4.0.4.
However all required packages seem to be installed and working.

Thanks
J

@jefftc
Copy link
Contributor

jefftc commented Nov 19, 2023 via email

@jmfa
Copy link
Author

jmfa commented Nov 23, 2023

Hi Jeff,
Thanks again for your help.
I'm now re-running the full pipeline with the updated R packages.

Small question though:

  • is the mutations.fa the final callset from Phylinsic or are there any additional filtering/genotyping steps after the beast run? The reason I'm asking is because I may want to test a different tree estimation method using the genotype calls from Phylinsic but I'm unsure as to whether this fasta file represents the final genotype calls.

Thanks,
J

@jmfa
Copy link
Author

jmfa commented Nov 24, 2023

Hi (again) Jeff,

Sorry to keep posting here but I was comparing the genotypes and fasta files that Phylinsic outputs and I have noticed that the make_fasta_file.py script may be assigning incorrect alleles for heterozygous sites.

For instance, here's the info printed in the genotypes.txt file for Cell_02:

# genotypes.txt
Site	Cell_02
chr1_139213_A_G	A
chr1_139266_T_C	R
chr1_139441_A_G	R
chr1_629274_G_A	H
chr1_630596_C_T	A
chr1_631580_T_C	H
chr1_958492_T_C	R
chr1_1562772_C_T	R

If I'm not mistaken, the genotype calls should be used as follows:
A = Homozygous alternative, R = Homozygous reference, H = Heterozygous call.

However, this is the output info in the fasta file:

>Cell_02
GTACTATC 

As you can see, the a C & A alleles were assigned to position 4 and 6, respectively, instead of R & Y.

I didn't get any errors in this step of the pipeline so I'm wondering whether this is a bug in the code.

Cheers,
J

@jefftc
Copy link
Contributor

jefftc commented Nov 24, 2023 via email

@jefftc
Copy link
Contributor

jefftc commented Nov 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants