You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To be consistent with sourmash functionality, we should ignore k-mers with N's.
I initially used force to avoid throwing out records with N's, but I wanted to double check whether the behavior is as expected.
Digging in to see if I understand what is happening:
if moltype == "protein" {
sig.add_protein(&record.seq())
.expect("Failed to add protein");
} else {
// if not force, panics with 'N' in dna sequence
sig.add_sequence(&record.seq(), true)
.expect("Failed to add sequence");
fn add_sequence(&mut self, seq: &[u8], force: bool) -> Result<(), Error> {
let ready_hashes = SeqToHashes::new(
seq,
self.ksize(),
force,
false,
self.hash_function(),
self.seed(),
);
for hash_value in ready_hashes {
match hash_value {
Ok(0) => continue,
Ok(x) => self.add_hash(x),
Err(err) => return Err(err),
}
}
// Should be always ok
Ok(())
}
Without force, I think add_sequence returns an Error if any of the k-mers have an N, meaning we lose the entire record, rather than just k-mers with N's?
if !VALID[self.sequence[j] as usize] {
if !self.force {
return Some(Err(Error::InvalidDNA {
message: String::from_utf8(kmer.to_vec()).unwrap(),
}));
} else {
self.kmer_index += 1;
// Move the iterator to the next step
return Some(Ok(0));
So force allows us to push past invalid k-mers, which I think is the right behavior!
To be consistent with
sourmash
functionality, we should ignore k-mers with N's.I initially used
force
to avoid throwing out records with N's, but I wanted to double check whether the behavior is as expected.Digging in to see if I understand what is happening:
relevant branchwater
manysketch
code:https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/src/manysketch.rs#L234-L236
Relevant sourmash core code: https://github.com/sourmash-bio/sourmash/blob/0af3cbb828e77676496abd1dfe0196818c54c511/src/core/src/signature.rs#L38C5-L58C6
Without
force
, I thinkadd_sequence
returns an Error if any of the k-mers have anN
, meaning we lose the entire record, rather than just k-mers with N's?in
SeqToHashes
: https://github.com/sourmash-bio/sourmash/blob/0af3cbb828e77676496abd1dfe0196818c54c511/src/core/src/signature.rs#L274-L282So
force
allows us to push past invalid k-mers, which I think is the right behavior!ref: sourmash-bio/sourmash#137, https://github.com/sourmash-bio/sourmash/pull/138/files
cc @ctb in case you have thoughts
The text was updated successfully, but these errors were encountered: