Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPM values depend on Combination #282

Open
keksundso opened this issue Sep 19, 2023 · 2 comments
Open

TPM values depend on Combination #282

keksundso opened this issue Sep 19, 2023 · 2 comments

Comments

@keksundso
Copy link

Using sleuth I can calculate the TPM-Value of a gene in a Sample (sleuth_prep followed by sleuth_to_matrix).

Now if I have 6 samples (6 abundance tables from Kallisto) with two belonging to on of the three conditions P, G and H, I would expect to get one TPM-Value per gene per sample.

However, the TPM-Value for Gene i in sample j is not a fixed value, but i varies depended on the combinations of conditions which go into sleuth_prep.

E.g. gene i in P1 has a different TPM-value when condition P and G goes into sleuth_prep compared to gene i in P1 with the condition P and H. See minimal example below:

getSleuthObj <- function(s2c){
    transcript2gene <- read_delim(transcript2genePath, delim = "\t", col_names=c("target_id","ens_gene","ext_gene"),show_col_types = FALSE )
    sleuth.obj <- sleuth_prep(sample_to_covariates = s2c, 
                              target_mapping = transcript2gene, 
                              extra_bootstrap_summary = TRUE,
                              read_bootstrap_tpm = TRUE, 
                              aggregation_column = 'ens_gene',
                              num_cores = numberOfCores,
                              gene_mode = TRUE
    )
    
    tpms <- sleuth_to_matrix(sleuth.obj, "obs_norm", "tpm")
    
    tpms <- as.data.frame(tpms)
    tpms$ens_gene <- rownames(tpms)
    tpms$ext_gene <- sleuth.obj$target_mapping$ext_gene[match(tpms$ens_gene, sleuth.obj$target_mapping$ens_gene)]
    rownames(tpms) <- NULL
    
    return(tpms)
    }

sleuth_PG <- getSleuthObj(rbind( 
    data.frame(sample = c("P1","P2"),
               condition = "P",
               path = c("/home/keks/app/data/P1","/home/keks/app/data/P2"),
               stringsAsFactors = FALSE)
    ,
    data.frame(sample = c("G5","G6"),
               condition = "G",
               path = c("/home/keks/app/data/G5","/home/keks/app/data/G6"),
               stringsAsFactors = FALSE)
))
sleuth_PH <- getSleuthObj(rbind( 
    data.frame(sample = c("P1","P2"),
               condition = "PC",
               path = c("/home/keks/app/data/P1","/home/keks/app/data/P2"),
               stringsAsFactors = FALSE)
    ,
    data.frame(sample = c("H3","H4"),
               condition = "H",
               path = c("/home/keks/app/data/H3","/home/keks/app/data/H4"),
               stringsAsFactors = FALSE)
)) 
@mschilli87
Copy link
Contributor

I didn't take a deep look but my first guess would be that depending on your input, a different set of transcripts passes sleuth's internal filters, resulting in a different total number of reads and features being used as the basis for the TPM normalization. If you have less features, all remaining features will end up with higher numbers because TPM always sum up to 1 M. So maybe you can get the behaviour that you want by manually overwriting the filter settings.

@keksundso
Copy link
Author

This sounds reasonable, especially since the variation is a constant shift proportional to the tpm values as it would be expected by a different total read number.
I replaced the filter function by a custom one which should not filter anything:

myFilter <- function (row, min_reads = 0, min_prop = 0) 
{
    mean(row >= min_reads) >= min_prop
}

sleuth.obj <- sleuth_prep(sample_to_covariates = s2c, 
                           target_mapping = transcript2gene, 
                           extra_bootstrap_summary = TRUE,
                           read_bootstrap_tpm = TRUE, 
                           aggregation_column = 'ens_gene',
                           num_cores = numberOfCores,
                           gene_mode = TRUE,
                           filter_fun = myFilter

As expected, in both combinations the same number of targets and genes now pass the filter.
However, the difference in the tpm values between both combinations persist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants