Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kallisto total UMI difference between velocity and novelocity modes. (single cell) #419

Open
mortunco opened this issue Dec 11, 2023 · 4 comments

Comments

@mortunco
Copy link

Hello,

Thank you for keeping kallisto alive. These days, more and more I realise how its actually important and hard maintain and update tools across many years.

Background:

I am processing same 10x 5' v2 chemistry sample fastqs with kallisto. I found great number of total UMI difference between velocity and nonvelocity runs. I shared the commands i used to run the scripts

Velocity mode

conda activate ~/mambaforge/envs/kallisto
KALLISTO_INDEX_DIR="foo/bar/kallisto-velocity-index" ### downloaded with kb ref -d human --workflow nac
kb count --h5ad -i $KALLISTO_INDEX_DIR/index.idx -g $KALLISTO_INDEX_DIR/t2g.txt -x 10xv2 --workflow nac --h5ad -o kallisto \
        -c1 $KALLISTO_INDEX_DIR/spliced_t2c.txt -c2 $KALLISTO_INDEX_DIR/unspliced_t2c.txt --filter bustools \
            SRR13806026_S1_R1_001.fastq \
                SRR13806026_S1_R2_001.fastq

velocity ref

$ ls kallisto-velocity-index
index.idx  spliced_t2c.txt  t2g.txt  unspliced_t2c.txt

Nonvelocity mode

KALLISTO_INDEX_DIR="foo/bar/kallisto-prebuilt-index" ### downloaded with kb ref -d human --workflow standard
kb count --h5ad -i $KALLISTO_INDEX_DIR/index.idx -g $KALLISTO_INDEX_DIR/t2g.txt -x 10xv2 --workflow standard --filter bustools -o kallisto_novelo_prebuilt SRR13806026_S1_R1_001.fastq SRR13806026_S1_R2_001.fastq

novelocity ref

$ ls kallisto-prebuilt-index
human_index_standard.tar.xz  index.idx  t2g.txt

Attempts

I calculated total UMI counts using the following lines. For this question please ignore the other kallisto runs. Although I used them trying to figure out whats going on, its not related to this question. Here you can see the total UMI difference between velocity and nonvelocity_prebuilt runs. One important note here is that I added ambigous + spliced and unspliced to velocity.

kallisto_novelocity_prebuilt=ad.read_h5ad("foo/bar/sample/kallisto_novelo_prebuilt/counts_filtered/adata.h5ad")
kallisto_novelocity_prebuilt.var_names_make_unique()

kallisto_velocity=ad.read_h5ad("foo/bar/sample/kallisto_velocity/counts_filtered/adata.h5ad")
kallisto_velocity.var_names_make_unique()

total_umi["kallisto_velocity"]= kallisto_velocity.layers["spliced"].sum().sum() + kallisto_velocity.layers["ambiguous"].sum().sum() + kallisto_velocity.layers["unspliced"].sum().sum()
total_umi["kallisto_novelocity_prebuilt"]= kallisto_novelocity_prebuilt.X.sum().sum()

image

In other comparison, you may see that even we add spliced+ambigous it still does not add up to velocity.

print(f' kalisto velocty spliced: {kallisto_velocity[list(common_barcodes),:].layers["spliced"].sum().sum()}')
print(f' kalisto velocty ambigous + spliced: {kallisto_velocity[list(common_barcodes),:].layers["spliced"].sum().sum() + kallisto_velocity[list(common_barcodes),:].layers["ambiguous"].sum().sum()}')
print(f' kalisto noveloctiy: {kallisto_novelocity_prebuilt[list(common_barcodes),:].X.sum().sum()}')

kalisto velocty spliced: 216012.0
kalisto velocty ambigous + spliced: 781301.0
kalisto noveloctiy: 905870.0

Questions

  1. It is clearly the impact of index + workflow. But in terms of good practice could you please show me guidance. I would like to run kallisto only once and use spliced+unspliced counts for standard sc analysis calculation and their ratio for the velocity/trajectory analysis.

Thank you very much for your time and patience,

Best regards,

Tunc.

@Yenaled
Copy link
Collaborator

Yenaled commented Dec 12, 2023

Hello! I just looked at the dataset -- for some reason, there seems to be strand-bias [causing mapping rate < 30%] under the default 10x settings (forward-stranded) so set --strand=unstranded.

As for your actual question: Why do mature+ambiguous gives fewer UMIs than just standardly mapping to the transcriptome (while discarding out-of-transcriptome reads)? This is because of UMI deduplication. In standard index, the same UMI sequences may simply map to a single gene, but in nac index, some of the same UMI sequences may map to multiple places (because there are more things in the index) -- and hence get filtered out. tl;dr it's because of UMI counting.

And yes, you want to sum ambiguous+mature.

@mortunco
Copy link
Author

Dear Yanaled, I am gonna give --strand=unstranded a try. Thank you very much for the hasty reply. I really appreciate it.

@NikTuzov
Copy link

Hello Tunc:

I am using a similar velocity workflow and I was wondering if you could please answer my question here:
pachterlab/kb_python#228 (comment)

@Yenaled
Copy link
Collaborator

Yenaled commented Jan 4, 2024

@mortunco I just released a new version of kb-python (version 0.28.1) -- it should cause less UMI differences between the standard index and the nac index.

Previously, if there was an exon/intron overlap of two different genes, the intron was being prioritized (which caused issues since a UMI could map to many introns). Now, the exon is being prioritized and, from what I tested, there is very high agreement between the standard index and the nac index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants