Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing reads when transitioning from Dada2 to phyloseq #1933

Open
GingerMATE opened this issue Apr 23, 2024 · 3 comments
Open

Missing reads when transitioning from Dada2 to phyloseq #1933

GingerMATE opened this issue Apr 23, 2024 · 3 comments

Comments

@GingerMATE
Copy link

GingerMATE commented Apr 23, 2024

Hello,

im encountering a similar issue to related to my sample reads. After completing the Dada2 pipeline according to the tutorial, i created a phyloseq object, but in doing so i loose reads in almost all my samples. Some of my samples loose over 50%.

Here is an excerpt from my track_reads file with added entry for when i create my phyloseq object:

sample | input | filtered | dadaF | dadaR | merged | nonchim | finalPercReadsKept % | reads physeq obj
ArcB1 | 82650 | 47038 | 45576 | 46265 | 43855 | 42770 | 51.7 | 42770
AuB1 | 90559 | 56066 | 54249 | 55233 | 52257 | 50278 | 55.5 | 32479
AuB2 | 100718 | 62615 | 59976 | 60933 | 57438 | 55009 | 54.6 | 53549
AuB3 | 137090 | 81540 | 80356 | 80983 | 79213 | 76150 | 55.5 | 27221

There is a large iscrepancy between the seqtab. nochim reads and the phyloseq object.

This is the code i use to create my seqtab.nochim file in the pipeline:

seqtab <- makeSequenceTable(mergers)

output_directory="/Analysis/Pipeline"

saveRDS(mergers, "/Analysis/RData/mergers.rds")

save.image(file.path(output_directory, "mergers.RData"))

#Inspect the merger data.frame from the first sample

head(mergers)

class(mergers)

dim(seqtab)

#Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab)))

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE) # minFoldParentOverAbundance = 8, multithread= parallel::detectCores(),

table(nchar(getSequences(seqtab.nochim)))

dim(seqtab.nochim)

sum(seqtab.nochim)/sum(seqtab)

saveRDS(seqtab, "/Analysis/RData/seqtab.rds")

saveRDS(seqtab.nochim, "/Analysis/RData/seqtab_nochim.rds")

And here is the code i use to create my phyloseq object:

Read sample Metadata and sample names

sample_metadata <- read.table("dna_metadata.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
rownames(sample_metadata) <- sample_metadata$sample_id
head(sample_metadata)
Read asv and taxa files

asv_counts_table <- readRDS("seqtab_nochim.rds")
asv_seqs <- colnames(asv_counts_table)
asv_headers <- vector(dim(asv_counts_table)[2], mode="character")
for (i in 1:dim(asv_counts_table)[2]) {
asv_headers[i] <- paste("ASV", i, sep="_")
}
colnames(asv_counts_table) <- asv_headers
head(asv_counts_table)
class(asv_counts_table)
sample_id <- sample_metadata$sample_id
rownames(asv_counts_table) <- sample_id
load taxa file

taxa_with_species <- read.table("taxa_.csv", header=TRUE, row.names=1, sep=",", stringsAsFactors=FALSE)
head(taxa_with_species)
class(taxa_with_species)
taxa_with_species <- as.matrix(taxa_with_species)
Create phyloseq object

physeq_ps137 <- phyloseq(otu_table(asv_counts_table, taxa_are_rows=FALSE),
tax_table(taxa_with_species), sample_data(sample_metadata))

physeq_ps137
readcount(physeq_ps137)

Any idea what could cause the reads to disappear? Or am i using the readcount function wrong? Thanks in advance!

@benjjneb
Copy link
Owner

I'm not sure what is defining the readcount function, but it is not a phyloseq function.

If you are trying to get the total abundances (numbers of reads) in each sample from a phyloseq object, the appropriate function to use is sample_sums.

@GingerMATE
Copy link
Author

GingerMATE commented Apr 24, 2024

I tried sample_sums to check the reads, but it gave me the same result.

Create phyloseq object

physeq_ps137 <- phyloseq(otu_table(asv_counts_table, taxa_are_rows=TRUE), 
                         tax_table(taxa_with_species), sample_data(sample_metadata)) 

physeq_ps137
sample_sums(physeq_ps137)
readcount(physeq_ps137)

ArcB1 AuP1 AuB1 AuB2 AuB3
42770 45601 32479 53549 27221

ArcB1 AuP1 AuB1 AuB2 AuB3
42770 45601 32479 53549 27221

The problematic step is between the seqtab.nochim and the physeq object. I assigend my taxa with the tryRC=TRUE argument in the assignTaxonomy function. Do you think that this could have an affect on the number of reads?

Here is the code i used to assign the taxonomy:

Assign taxonomy

unite.ref <- "DNA_results/Dada2/silva_nr99_v138.1_train_set.fa.gz"  # CHANGE ME to location on your machine

seqtab.nochim <- readRDS("DNA_results/seqtab_nochim.rds")

taxa <- assignTaxonomy(seqtab.nochim, unite.ref, multithread = parallel::detectCores(), tryRC = TRUE)

taxa <- addSpecies(taxa, "DNA_results/Dada2/silva_species_assignment_v138.1.fa.gz")

taxa.print <- taxa  # Removing sequence rownames for display only

rownames(taxa.print) <- NULL

head(taxa.print)

write.table (taxa, file ="DNA_results/Dada2/taxa_PS137_general_def_tryRCTRUE.csv", row.names = T, sep = "\t")

asv_seqs <- rownames(taxa_table)

asv_names <- character(nrow(taxa_table))

for (i in 1:nrow(taxa_table)) {
  asv_names[i] <- paste("ASV", i, sep="_")
}

rownames(taxa_table) <- asv_names

write.table(taxa_table, "DNA_results/Dada2/taxa_ps137_tryRCTRUE", sep="\t", quote=F, col.names=T)

@benjjneb
Copy link
Owner

I would suggest trying to construct the phyloseq object directly from the original sequence table and tax table from DADA2 (i.e. that contain the full DNA sequences as IDs), and checking if that solves the change in read counts issue.

It is simpler and less error prone to then switch to "ASV1" style IDs in the phyloseq object. Code for this is included in the DADA2 tutorial "Handoff to phyloseq" section:

dna <- Biostrings::DNAStringSet(taxa_names(ps))
names(dna) <- taxa_names(ps)
ps <- merge_phyloseq(ps, dna)
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
ps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants