Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manysketch: check ambiguous nucleotide handling #297

Open
bluegenes opened this issue Apr 2, 2024 · 0 comments
Open

manysketch: check ambiguous nucleotide handling #297

bluegenes opened this issue Apr 2, 2024 · 0 comments

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Apr 2, 2024

To be consistent with sourmash functionality, we should ignore k-mers with N's.
I initially used force to avoid throwing out records with N's, but I wanted to double check whether the behavior is as expected.

Digging in to see if I understand what is happening:

relevant branchwater manysketch code:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/src/manysketch.rs#L234-L236

 if moltype == "protein" {
     sig.add_protein(&record.seq())
     .expect("Failed to add protein");
} else {
    // if not force, panics with 'N' in dna sequence
    sig.add_sequence(&record.seq(), true)
    .expect("Failed to add sequence");

Relevant sourmash core code: https://github.com/sourmash-bio/sourmash/blob/0af3cbb828e77676496abd1dfe0196818c54c511/src/core/src/signature.rs#L38C5-L58C6

fn add_sequence(&mut self, seq: &[u8], force: bool) -> Result<(), Error> {
        let ready_hashes = SeqToHashes::new(
            seq,
            self.ksize(),
            force,
            false,
            self.hash_function(),
            self.seed(),
        );

        for hash_value in ready_hashes {
            match hash_value {
                Ok(0) => continue,
                Ok(x) => self.add_hash(x),
                Err(err) => return Err(err),
            }
        }

        // Should be always ok
        Ok(())
    }

Without force, I think add_sequence returns an Error if any of the k-mers have an N, meaning we lose the entire record, rather than just k-mers with N's?

in SeqToHashes: https://github.com/sourmash-bio/sourmash/blob/0af3cbb828e77676496abd1dfe0196818c54c511/src/core/src/signature.rs#L274-L282

 if !VALID[self.sequence[j] as usize] {
    if !self.force {
       return Some(Err(Error::InvalidDNA {
            message: String::from_utf8(kmer.to_vec()).unwrap(),
       }));
    } else {
       self.kmer_index += 1;
       // Move the iterator to the next step
       return Some(Ok(0));

So force allows us to push past invalid k-mers, which I think is the right behavior!

ref: sourmash-bio/sourmash#137, https://github.com/sourmash-bio/sourmash/pull/138/files

cc @ctb in case you have thoughts

@bluegenes bluegenes changed the title manysketch: improve ambiguous nucleotide handling manysketch: check ambiguous nucleotide handling Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant