Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspiciously high number of low-abundance ASVs (10s of 1000s)? #1951

Open
SJohnsonMayo opened this issue May 10, 2024 · 2 comments
Open

Suspiciously high number of low-abundance ASVs (10s of 1000s)? #1951

SJohnsonMayo opened this issue May 10, 2024 · 2 comments

Comments

@SJohnsonMayo
Copy link

SJohnsonMayo commented May 10, 2024

Hi all,

First, thank you for all the work that you do on maintaining this package, it is truly invaluable for microbiome analysis.

I'm trying to process ~1100 dental plaque samples. These are V3V4, 2x250, with primers removed by sequencing core prior to delivery to me. I have also confirmed that the primers are not present in the data, checking with Cutadapt, and preprocessing the data with Fastp is not showing substantial #s of reads with adapter content. I'm able to successfully get all of the data through DADA2 no problem with truncLen = c(240,240). As far as I can tell the data is high quality. Reads are generally > Q30 on average and a majority of reads are making it all the way through the pipeline (passing filters, merging, not being removed as chimeras.

However, when I inspect the final ASV table, there's over 53,000 ASVs across all samples. 45,000 of these are found in exactly 1 sample. By my math, these 45,000 ASVs only comprise ~1% of the total reads. I inspected the taxonomy table and didn't see anything out of the ordinary (i.e,. lots of oral-associated taxa).

When I browsed the other issues pages, many times there were upstream issues such as low quality scores, adapter content, low overlap, etc., that seemed to fix the numbers of ASVs, but as far as I can tell those issues can be ruled out with this dataset.

Is there anything else I can do to diagnose this dataset? Should I just remove these and move on? Any advice will be greatly appreciated!

EDIT: I have also tried running everything using just the R1 reads, but still have the same issue (actually slightly more ASVs this way -- 57,000)

Thanks.

@benjjneb
Copy link
Owner

There are several scenarios that could lead to what you are observing. One is that this is real biology -- there are many taxa that appear relatively rarely and patchily in oral samples, and that you are just observing this pattern. A second is that since DNA does not necessarily represent a living organism inhabiting the environment, transient microbial DNA (or even cross-amplification of mito/chloroplast DNA) could lead to something like this in saliva samples. A third possibility is some sort of technical issue with the sampling/sequencing process that is introducing contaminant or non-target reads into the final measurements. A fourth is that there is some sort of issue going on with the error control/denoising of the data.

Fully disentangling these possibilities can be difficult, but I would start by pulling a sampling of these low abundance 1-sample ASVs and just BLAST-ing them against a broad database like nt. What do they look like? Do they seem to plausibly be oral inhabiting microbes, or could it be something else?

I would also consider enforcing some basic prevalence/abundance filters on the output that goes into your more complete analysis. It is perfectly valid to focus your analysis on the set of ASVs that are above some minimum abundance/prevalence threshold -- I typically require ASVs to be present in at least 2 samples to keep for most of my final analyses.

@SJohnsonMayo
Copy link
Author

Thank you, I am a bit on the fence on whether or not these ASVs are real. I ran Barrnap on all of the ASVs and nearly all of them have better e-values with the bacterial 16S model compared to archaea/mito/euk. Another thing I've noticed is that every sample seems to be contributing roughly equally to this high number. The mean/median sample has 40 of these unique ASVs (and then 40 * 1130 = 45,000 1-sample ASVs).

BLASTing the high-abundance, high-prevalence ASVs gives results that totally make sense -- i.e., essentially identical matches to 16S sequences from common oral-associated microbes (100% query coverage, low e-value, 99-100% identity).

BLASTing these "unique" ASVs is a bit messier. There will still be the odd perfect match, but I see a lot more sequences with lower query coverage (e.g., 60-90%) and/or lower percent identity (e.g., 80-95%). These imperfect hits do still tend to be oral-associated taxa, though.

I usually do a similar prev/abund filter for my other datasets (mostly fecal), but the sheer magnitude of ASVs in this dataset did catch me off guard a bit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants