Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpreting log file for contamination removal #40

Open
rjsorr opened this issue Apr 28, 2022 · 3 comments
Open

interpreting log file for contamination removal #40

rjsorr opened this issue Apr 28, 2022 · 3 comments

Comments

@rjsorr
Copy link

rjsorr commented Apr 28, 2022

Hi @khyox,
I'm trying to understand/interpret the log file so as I can remove contaminats manually based on taxonids. I would like to remove those that have been assigned as "critical". However, it is difficult to understand what these are from the log file attached. If I run find for "critical" it highlights 18 hits, at different taxonomical levels. However, some of these critical hits are class level taxonids (e.g. Actionbacteria and Gammaproteobacteria), and I cannot possibly imagine that these should be removed from the dataset, and I'm guessing a lower taxonomic rank is actually being flagged, but as I write, not easy to see? A seperate contamiation output file that gives a simple result for interpretation and downstream processing would be a welcomed addition?
Recentrifuge17_log.txt

regards

@khyox
Copy link
Owner

khyox commented Apr 30, 2022

Hi @rjsorr,
Contaminants are automatically removed when you use one or more negative controls (with -c flag), so you don't need to manually remove contaminants when using Recentrifuge. As you see, they are removed at different taxonomical levels depending on the contaminants and the taxonomic level that it is being considered for the analysis. You can identify the samples with contamination removed because the contain the substring _CTRL_ in any of the different outputs that Recentrifuge provides. In some cases, the control samples are so different from the regular samples that the default values for the filters may not be the most appropriate. For such cases, you have a couple of flags in rcf to manually fine tuning algorithm parameters:

  -z NUMBER, --ctrlminscore NUMBER
                        minimum score/confidence of the classification of a
                        read in control samples to pass the quality filter; it
                        defaults to "minscore"
  -w INT, --ctrlmintaxa INT
                        minimum taxa to avoid collapsing one level into the
                        parent in control samples (if not specified a value
                        will be automatically assigned)

If you think that your control samples are too noisy, you can increase the values of these parameters to reduce the chances of false positives (false contaminants detected in the negative control samples). Finally, sure, an additional, optional, separate output (beyond the console log) devoted to the contamination removal algorithm would be a welcomed addition.

@rjsorr
Copy link
Author

rjsorr commented May 12, 2022

sorry for the slow reply @khyox,
The problem of removing read contaminants based on read classification, as I think I'm now struggling with, is the reliance on databases and their completness. I see now that the gammaproteobacteria present as a contaminant in the negative controls is a novel/new species that cannot be classified to a lower taxonomic level. As such, it's uncertain classification based on current databases is causing an interpretation problem were an entire class is being flagged as a contaminant when it is actually a single novel species with poor classification that is causing an issue. I don't see how changing the above parameters will help with this when the underlying problem is database/classification related? maybe you have some suggestions as how to attack this, other than assemble the MAG, which I have done, and then map it the reads?

@khyox
Copy link
Owner

khyox commented May 13, 2022

Metagenomic databases have improved a lot over time but are still very far from perfect. I would say that, if you have identified a clear problem in the DB you can try to correct it to avoid the issue in the source instead to having to correct it downstream. I understand that there are times where that is not so easy. If the classification is poor as you mention, you may have luckily a low classification score for such a taxon (that's another benefit of using score-oriented classification!), so the --ctrlminscore would be very helpful. In addition, if such a taxon is a minority one in the control samples, then you can use --ctrlmintaxa making it very low so that just the lowest possible level is flagged as contaminant and not an upper level, so that you would minimize the "damage" (of the DB problem upstream), keeping it at the lowest level. Alternatively, if you used Centrifuge, you can use rextract to get the reads that were misclassified and remove them from the controls. You can also develop a small script to delete assignations to that taxon in the results from the controls —if you used recentrifuge's --exclude option you would also remove them from the regular samples, so unfortunately that's not an option in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants