Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how to improve cell barcode correction? #74

Open
AnnaAMonaco opened this issue Jun 29, 2022 · 2 comments
Open

Question: how to improve cell barcode correction? #74

AnnaAMonaco opened this issue Jun 29, 2022 · 2 comments

Comments

@AnnaAMonaco
Copy link

Hi,
I have been having issues recovering the same amount of cell barcodes post correction in salmon alevin compared to cell ranger. Maybe alevin-fry can help in this task?

What I have

  • scRNA-seq reads from the 10X kit v3.1 from interspecific hybrid embryos
  • A diploid reference transcriptome
  • a preliminary alignment of the scRNA-seq to just one species with CellRanger

What I need
I have been running salmon alevin to get alignments and do some allele-specific analysis in a single-cell setting. I already use a whitelist containing only the cell barcodes that are passed as true cells by CellRanger (~80% of the total). This is the generic line I usually run:
salmon alevin -lISR -1 $Read1 -2 $Read2 --chromiumV3 -i $index -p 12 -o $outDir --tgMap $tsv --whitelist $whitelist --numCellBootstraps 20 --dumpFeatures

What the issue is
This only gives me a 37% mapping rate (expected is ~70% from running salmon pseudobulk on the data), and by troubleshooting it turns out that there are a large amount of reads that are discarded because of "noisy barcodes". From alevin_meta_info.json:
"total_reads": 82603614, "reads_with_N": 0, "noisy_cb_reads": 35104396, "noisy_umi_reads": 7532

Question
From my understanding, alevin-fry could help with the cell barcode correcting, so I tried following the docs for "generate-permit-list" but I still have some questions.

First, I actually am having trouble generating the RAD directory that the options --rad and --sketch should do. Running salmon alevin as above adding these two options -- either individually or together -- generated a "map.rad" file that alevin-fry doesn't take as input.

$ alevin-fry generate-permit-list --input map.rad --output-dir $outDir --expected-ori either --valid-bc $whitelist
error: Invalid value "map.rad" for '--input <INPUT>': No valid directory was found at this path.

For more information try --help

This brings me to my second question: the above code would take my whitelist that contains barcodes I already know are true cells and correct against it, right? But in theory salmon alevin also does this in my quantification step. I know the other option is to use a list of all available barcodes and change the --min-reads threshold, but is this actually better than knowing which barcodes are true cells? Why not set this true whitelist as --unfiltered-pl and then --min-reads 10?

I hope I was clear enough but I would obviously be happy to elaborate on any unclear part or anything I might have left out :)

Cheers,
Anna

@rob-p
Copy link
Contributor

rob-p commented Jun 29, 2022

Hi @AnnaAMonaco,

Thanks for the detailed report, and we're happy to help! To your first question:

First, I actually am having trouble generating the RAD directory that the options --rad and --sketch should do. Running salmon alevin as above adding these two options -- either individually or together -- generated a "map.rad" file that alevin-fry doesn't take as input.

  • The issue here is that the argument to -i should actually be to the folder containing map.rad , not to the map.rad file itself. The reason for this is that the output folder of salmon alevin also contains other information (e.g. the number of unmapped reads corresponding to each barcode) that are used subsequently in alevin-fry.

This brings me to my second question: the above code would take my whitelist that contains barcodes I already know are true cells and correct against it, right? But in theory salmon alevin also does this in my quantification step. I know the other option is to use a list of all available barcodes and change the --min-reads threshold, but is this actually better than knowing which barcodes are true cells? Why not set this true whitelist as --unfiltered-pl and then --min-reads 10?

  • This is a great question. So, there are really several strategies in alevin-fry for generating a permit list. The --unfiltered-pl option is meant to mimic (but potentially improve upon) what is done by existing tools like CellRanger for protocols (like Chromium) where the list of potential valid barcodes is known. The idea here is that you may want a comprehensive (or near comprehensive) set of barcodes included in your output, so that you can use a method like EmptyDrops downstream to apply a statistical filter to check for "high quality" versus likely empty droplets. The alternative behavior you mention (i.e. where you have a true list of known corrected barcodes against which you want to correct) is also supported in alevin-fry via the command line parameter --valid-bc. Here, you provide a list of known-valid barcodes (not a list of potential barcodes) and then all observed barcodes are corrected against this list (corrected if they are within 1-edit and the correction is unambiguous). There are other modes like --knee, --force-cells and --expect-cells too; you can read about them all here.

Thanks!
Rob

@AnnaAMonaco
Copy link
Author

Thanks for the reply!

When it comes to my second question I was more wondering what would be the way to go when I know the valid barcodes, but giving this whitelist still leaves me with multiple noisy barcodes.

Here, you provide a list of known-valid barcodes (not a list of potential barcodes) and then all observed barcodes are corrected against this list (corrected if they are within 1-edit and the correction is unambiguous)

This makes me think that maybe many of these barcodes have more that 1 mismatch? So maybe giving it a list of potential barcodes and working with the min reads threshold could help. But I was really wondering if I would run something like --unfiltered-pl $whitelist --min-read 10, so that it only takes barcodes with at least 10 reads from the valid list. Would this have a negative impact on barcode recovery because I am reducing the amount of BCs to keep to those found multiple times (which are anyway the ones I am interested in at the end of the day), or would it have a positive impact because maybe the --unfiltered-pl allows for more mismatched in the sequenced barcodes?

I hope my question makes sense :)

Anna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants