Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is RawHash fast enough for cDNA enrichment? #2

Open
andreaswallberg opened this issue Oct 10, 2023 · 2 comments
Open

Is RawHash fast enough for cDNA enrichment? #2

andreaswallberg opened this issue Oct 10, 2023 · 2 comments

Comments

@andreaswallberg
Copy link

Dear developers,

This tools looks super interesting! I wonder if you have tried it coupled with Read Until functionality for cDNA or other "short" long reads (e.g. 1-2kbp).

If not, do you think it has the potential to be able to tell whether a read is on or off target against a relatively small database of sequences (e.g. transcriptome or a panel of selected genes) already in the first 100-200 bases?

@canfirtina
Copy link
Member

Dear @andreaswallberg,

Thank you for your interest. In recent weeks, we have been working hard to add new features. We are interested in discussing more about what we can improve to provide better support for the cDNA. However, we have not specifically tested RawHash and RawHash2 (a newer version) with cDNA data.

Regarding evaluating 'short' long reads, we have used a dataset, D1, which consists of SARS-CoV-2 sequences. The average read length in this dataset is about 430 bases. These could be considered as 'short' long reads (even shorter based on the range you provided). In our paper, we have described using this D1 dataset alongside the D5 dataset (human genome, average read length 6k bases) for on-/off-target analysis, focusing on contamination analysis. The results show that RawHash2 achieves about 94% precision and 85% recall in this context. From this, we believe RawHash is capable of effectively identifying on-target and off-target reads in scenarios where on-target reads are very short.

We currently do not have a cDNA dataset in our evaluation set. We would be interested in evaluating such a dataset to better tailor RawHash for cDNA applications. If you have any suggestions or feedback, like recommending a dataset that includes signal files and basecalled reads for accurate ground truth mapping, and specifics on the analysis you wish to conduct with RawHash, we would welcome your input.

Best,
Can

@andreaswallberg
Copy link
Author

Hi @canfirtina !

Sounds good. I can provide such data. I will contact you later this week.

Best regards,
Andreas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants