Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTH ratio 1 #93

Open
flde opened this issue May 24, 2023 · 5 comments
Open

OTH ratio 1 #93

flde opened this issue May 24, 2023 · 5 comments
Labels
good first issue Good for newcomers

Comments

@flde
Copy link

flde commented May 24, 2023

Hello cellsnp-lite team,

I run cellsnp-lite with default parameters in mode 1 and get SNPs with OTH ratio 1 OTH_ratio <- Matrix::rowSums(OTH) / Matrix::rowSums(DP+OTH). Does that make sense? Highly appreciate your input.

download

Best wishes,
Florian

@hxj5
Copy link
Collaborator

hxj5 commented May 24, 2023

Hi, thanks for the feedback. It could happen that some SNPs have OTH ratio 1 in mode 1, since the REF and ALT alleles are not inferred from data but specified in the input VCF by the user. Sometimes the two specified alleles do not "express", either truely happened as there are other alleles expressed on that SNP, or artifacts due to sequencing / alignment errors etc, especially for those SNPs with low coverage.

@flde
Copy link
Author

flde commented May 24, 2023

Hi @hxj5,

Many thanks for your help.

The AD, DP, and OTH matrix can be used to compute the REF, ALT, and OTHER frequency. But the annotation of the matrix is based on the reference VCF file and not derived from the single cell data. For example, a SNP which is ALT in the reference could be the REF in my own data.

Therefore, to filter by minMAF either the REF or ALT frequency needs to be greater than the threshold. But you ignore the OTHER frequency, is that correct?

Also, if AD and DP is based on the reference VCF file and not the single cell data wouldn't that be a problem for Vireo?

Best wishes,
Florian

@hxj5
Copy link
Collaborator

hxj5 commented May 25, 2023

Thanks for the good question.

  1. Cellsnp was designed for bi-allelic SNPs. In its mode 1, REF and ALT alleles are specified by user while in mode 2, REF and ALT are inferred from data as the alleles with highest and second highest read(UMI) counts. Therefore, in mode 1, the REF or ALT in the reference VCF could be different from the major or minor allele inferred from data. For example, the ALT in VCF could be REF in the data, as you mentioned.

  2. In cellsnp cmdline (for both mode 1 and 2), MAF is always caculated as the fraction read(UMI)_count_of_minor_allele / read(UMI)_count_of_all_alleles, where the minor allele is the allele with second highest read(UMI) count inferred from data. Sorry for the misleading (updated in issue Question for minMAF parameter #77).

  3. Therefore, in mode 1, post-filtering SNPs based on the minimum allele frequency of the REF and ALT alleles in VCF file could be different from filtering SNPs with --minMAF in the cellsnp cmdline, for a small subset of SNPs whose major allele (with highest read/UMI count) or minor allele (second highest) is neither REF or ALT allele but one of the OTH alleles (updated in issue Filter SNPs by minMAF after running cellsnp-lite  #90).

  4. The number of SNPs whose major or minor allele is one of the OTH alleles is expected to be quite small (in mode 1), given the input reference VCF is reliable (e.g., with common SNPs compiled from 1000 genome project), hence should have limited influence on downstream donor deconvolution.

@flde
Copy link
Author

flde commented Jun 6, 2023

Dear @hxj5,

I was on vaccation last week, sorry for the delay. Many thanks for the detailed answer that is very helpful! I managed to write a script that sorts and filters the cellsnp output into vartrix format. Here the REF and ALT matrix is based on the single cell data. I will do some tests and close update/close the thread afterwards.

Again, many thanks for your time!

Best wishes, Florian

@flde
Copy link
Author

flde commented Jun 16, 2023

Dear @hxj5 and @huangyh09,

Many thanks for helping me with this issue and related #90 #62. I wrote a script to convert the cellsnp output to vatrix format and filter the vatrix matrix by minMAF. Hence, filter the matrix by the ALT frequency according to the single cell data.

For the plots below I show the cellsnp output with minMAF=0 and varied the minMAF for the vatrix output. For each patient we have samples from different time points which are either only host or mixed host/donr samples. I pooled those per patient and run cellsnp mode 1 with minMAF=0. Vireo was run with -M 150. The Vireo output can then be finally by time point to test if only donor samples were annotated correctly (not shown). But below I show only the results for the pooled samples per patient.

Some quick information on the patients:
Patient_1: Only very few donor cells expected.
Patient_2: Only donor cells
Patient_5: Siblings
Patient_7: Close relatives
Patient_9: Siblings

Vireo is capable of handling samples where there is only one genotype present (patient 2) and seems quite sensible if one genotype is overrepresented (patient 1). In that case smaller minMAF give better results. In case the patient is unrelated (patient 3 and 4) the results are independent of minMAF. In case they are closely related the analysis yields mostly unassigned cells and the change of minMAF either has minor effects (patient_5 and patient_7) or very strong effects (patient 9).

From the second figure I conclude that for closely related patients I could try to relax the min_prob filter to reduce the unassigned cells #24.

download
download

I hope that is interesting for you and I want to say many thanks again with all your help!

Best wishes,
Florian

@hxj5 hxj5 added the good first issue Good for newcomers label Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants