Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salmon quantmerge skipped the nucleotide IDs that have multiple sequences - Metagenome dataset #910

Open
jiazhou0116 opened this issue Jan 30, 2024 · 2 comments

Comments

@jiazhou0116
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
The issue existed in both bulk and single-cell mode

Describe the bug
When using Salmon to quantify non-redundant (NR) genes in metagenomic datasets, the generated output is missing a summary for nucleotide IDs that correspond to multiple sequences.

To Reproduce
Steps and data to reproduce the behavior:

  1. Merging quantifications with Salmon:
    salmon quantmerge
    --quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
    -o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
  2. Searching for a specific gene ID in the quantification file:
    grep "k141_1346622_1" temp/salmon/L1EHI0900465--Q_S1_N6.quant/quant.sf

Multiple lines are found for this gene ID

  1. Searching for the same gene ID in the resulting TPM file:
    grep "k141_1346622_1" result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
    #No results are found, which is unexpected
截屏2024-01-30 21 56 23 截屏2024-01-30 21 55 28

Specifically, please provide at least the following information:

  • Which version of salmon was used? salmon 1.4.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)? conda install salmon -y
  • Which reference (e.g. transcriptome) was used? metagenome data
  • Which read files were used? L1EHI0900465--Q_S1_N6.quant/
  • Which which program options were used?
    salmon quantmerge
    --quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
    -o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM

Expected behavior
A clear and concise description of what you expected to happen.
I hope to keep all the gene IDs and for those who contains more than one line, take average values for each gene ID.

Screenshots
If applicable, add screenshots or terminal output to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Ubuntu Linux, OSX]
  • Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a]

Additional context
Add any other context about the problem here.

@jiazhou0116
Copy link
Author

Updated Expected behavior:
A clear and concise description of what you expected to happen.
I aim to retain all gene IDs, and for those represented by multiple lines, I intend to calculate the sum of values for each unique gene ID.

I came across a few posts regarding this issue, but have not found a good solution for salmon quantmerge yet

@jiazhou0116 jiazhou0116 changed the title salmon quantmerge skipped the nucleotide IDs that have multiple sequences salmon quantmerge skipped the nucleotide IDs that have multiple sequences - Metagenome dataset Jan 31, 2024
@jiazhou0116
Copy link
Author

Year 2018, in issue #214 (#214), --keepDuplicates was suggested for dealing with transcript duplicates. https://combine-lab.github.io/salmon/faq/ also mentioned "If you really want to go through with quantification of sequence duplicates. You can pass --keepDuplicates to the salmon indexing command. This will tell salmon not to discard these duplicates, and they will appear in the output quantifications." But from my understanding, this is for sequence-indentical duplicate, but for our case, the sequences and sequences' full annotations are different, but the shortened gene ID before "#" can be identical for multiple sequences.
e.g.,

k97_3_1 # 1 # 534 # 1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.672
k97_3_1

  1. After salmon quant step, the gene_ID will be shorted but all will be keeped even though same gene_ID have different lengths etc
    Name Length EffectiveLength TPM NumReads
    k97_3_1 534 216.520 0.000000 0.000
    k97_5_1 384 99.234 0.000000 0.000
    k97_6_1 333 73.044 0.000000 0.000
    k97_9_1 387 101.041 0.000000 0.000
  2. however, at salmon quantmerge step, the gene_ID with multiple sequences are removed.
    Name NP1.clean.quant
    k141_743617_3 0
    k141_742060_5 0
    k141_910930_3 0.015907
    k141_1078715_3 0
    k141_527785_4 0
    This will cause the whole dataset lose the most information gene information, since those genes with multiple sequences may play an important biological roles. So I think i need to take some actions to keep all the genes by relabeling those who have multiple sequences by order them. Not sure whether this is something I can do through salmon quantmerge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant