how to deal with "too many missing data" #24

jeffchen2000 · 2022-10-24T20:34:13Z

Hi BnpC support

from your published paper, your tools are best to deal with missing data (more than 20% missing).

Now. we have targeted sequencing snDNA samples from FFPE, with has lots of missing data due to the random fragmentation of DNA. (By the way, we cannot use regular filter thresholds to reduce the % missing data, otherwise, we will not get any cells). With such large % missing data (~50%), we got too many singleton clusters (which certainly not make sense). now what are the best parameters settings to avoid such a issue?

NBMueller · 2022-10-25T08:16:42Z

The α₀ prior parameter of the Chinese Restaurant Process influences the number of clusters. You can change it with the -ap or --DPa_prior argument, requiring two floats.
The two values define the prior Gamma distribution for α₀; in short: the higher the product of the two values, the more likely a new cluster will be 'opened'. The default parameters are [(#cells)^1/2, 1] with the reasoning that more cells will result in more clusters. Setting them to something lower (e.g., [1,1]) should result in fewer clusters.
Hope that helps, otherwise feel free to also drop me a mail.

jeffchen2000 · 2023-01-23T17:55:21Z

hi NBMueller

thanks for your comments, when I set the "-ap 1 1", the resulting clusters sometimes reduced and sometimes did not.
I realized that the random seed makes huge difference.

I then modified your input table where I labeled each cell with sample ID so that I know which samples the cells come from after clustering.

then I will be able to generate a summary table (based on BnpC output) where each row represents a cluster ID and each column represents a sample, and each field represents the #cells per sample per cluster.

based on above summary table, I will be able to draw PCA plots where each dot represents a sample, and thus PCA plots will present the closeness/relationship between samples. it turned out that the PCA plots could be very different upon different random seed (which means the results are not converged if I understood correctly), so how can solved such an issue? is there any parameter I can modify?

thanks in advance

Jeff C

NBMueller self-assigned this Oct 25, 2022

NBMueller added the question Further information is requested label Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to deal with "too many missing data" #24

how to deal with "too many missing data" #24

jeffchen2000 commented Oct 24, 2022

NBMueller commented Oct 25, 2022 •

edited

jeffchen2000 commented Jan 23, 2023

how to deal with "too many missing data" #24

how to deal with "too many missing data" #24

Comments

jeffchen2000 commented Oct 24, 2022

NBMueller commented Oct 25, 2022 • edited

jeffchen2000 commented Jan 23, 2023

NBMueller commented Oct 25, 2022 •

edited