Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with "too many missing data" #24

Open
jeffchen2000 opened this issue Oct 24, 2022 · 2 comments
Open

how to deal with "too many missing data" #24

jeffchen2000 opened this issue Oct 24, 2022 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@jeffchen2000
Copy link

Hi BnpC support

from your published paper, your tools are best to deal with missing data (more than 20% missing).

Now. we have targeted sequencing snDNA samples from FFPE, with has lots of missing data due to the random fragmentation of DNA. (By the way, we cannot use regular filter thresholds to reduce the % missing data, otherwise, we will not get any cells). With such large % missing data (~50%), we got too many singleton clusters (which certainly not make sense). now what are the best parameters settings to avoid such a issue?

@NBMueller
Copy link
Member

NBMueller commented Oct 25, 2022

The α0 prior parameter of the Chinese Restaurant Process influences the number of clusters. You can change it with the -ap or --DPa_prior argument, requiring two floats.
The two values define the prior Gamma distribution for α0; in short: the higher the product of the two values, the more likely a new cluster will be 'opened'. The default parameters are [(#cells)1/2, 1] with the reasoning that more cells will result in more clusters. Setting them to something lower (e.g., [1,1]) should result in fewer clusters.
Hope that helps, otherwise feel free to also drop me a mail.

@NBMueller NBMueller self-assigned this Oct 25, 2022
@NBMueller NBMueller added the question Further information is requested label Oct 25, 2022
@jeffchen2000
Copy link
Author

hi NBMueller

thanks for your comments, when I set the "-ap 1 1", the resulting clusters sometimes reduced and sometimes did not.
I realized that the random seed makes huge difference.

I then modified your input table where I labeled each cell with sample ID so that I know which samples the cells come from after clustering.

then I will be able to generate a summary table (based on BnpC output) where each row represents a cluster ID and each column represents a sample, and each field represents the #cells per sample per cluster.

based on above summary table, I will be able to draw PCA plots where each dot represents a sample, and thus PCA plots will present the closeness/relationship between samples. it turned out that the PCA plots could be very different upon different random seed (which means the results are not converged if I understood correctly), so how can solved such an issue? is there any parameter I can modify?

thanks in advance

Jeff C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants