Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/better seed debug logging #1240

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Aug 30, 2023

  • feat: log basic seed and align stats at debug level

We're currently not really using debug level logging for anything. I think it'd be useful to output some basic stats for each sequence like what proportion of the query are covered by chained matches, mean band width, max indel, and things like that.

This is what this PR outputs at debug level (-vv):

2023-08-30 13:17:52.012 [D] compression.rs:68: When processing '"nextclade.tsv"': detected file extension 'tsv'. It will be using algorithm: 'None'
2023-08-30 13:17:52.012 [I] nextclade_loop.rs:87: Processing sequence 'Ireland/CO-NVRL-ecM21IRL00199074/2021'
2023-08-30 13:17:52.036 [D] seed_match2.rs:509: Chained seed stats. max indel: 12, # matches: 181, first/last match distance from start/end [start: (ref: 12, qry: 0), end: (ref: 129, qry: 0)], max unmatched stretch (ref: 173, qry stretch: 173)
2023-08-30 13:17:52.036 [D] seed_match2.rs:514: Seed alignment covers 87.09% of query length
2023-08-30 13:17:52.040 [D] align.rs:83: Nucleotide alignment band area=24370141, mean band width=814
2023-08-30 13:17:57.535 [D] align.rs:91: Attempt: 0, Alignment score: 86887

@vercel
Copy link

vercel bot commented Aug 30, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview Aug 30, 2023 2:20pm

@corneliusroemer corneliusroemer changed the base branch from master to v3 August 30, 2023 13:16
Comment on lines +487 to +509
let max_shift = seed_matches.iter().map(|sm| sm.offset.abs()).max().unwrap_or(0);
// Need to iterate in pairs
let max_unmatched_ref_stretch = seed_matches
.iter()
.tuple_windows()
.map(|(sm1, sm2)| sm2.ref_pos - (sm1.ref_pos + sm1.length))
.max()
.unwrap_or(0);
let max_unmatched_qry_stretch = seed_matches
.iter()
.tuple_windows()
.map(|(sm1, sm2)| sm2.qry_pos - (sm1.qry_pos + sm1.length))
.max()
.unwrap_or(0);
let first_match = seed_matches.first().unwrap();
let last_match = seed_matches.last().unwrap();
let first_match_ref_pos = first_match.ref_pos;
let first_match_qry_pos = first_match.qry_pos;
let last_match_ref_pos = ref_seq.len() - last_match.ref_pos - last_match.length;
let last_match_qry_pos = qry_seq.len() - last_match.qry_pos - last_match.length;

let n_matches = seed_matches.len();
debug!("Chained seed stats. max indel: {max_shift}, # matches: {n_matches}, first/last match distance from start/end [start: (ref: {first_match_ref_pos}, qry: {first_match_qry_pos}), end: (ref: {last_match_ref_pos}, qry: {last_match_qry_pos})], max unmatched stretch (ref: {max_unmatched_ref_stretch}, qry stretch: {max_unmatched_qry_stretch})");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extract the body of the condition into a separate function (keeping the condition itself where it is), so that it does not distract from the main algo code.

Note that the conditional jump will always be executed in production. (And every log macro invocation also adds a similar conditional jump underneath, which might hurt performance in tight loops. This particular place seems to be executed rarely though.)

Is this thing required in release builds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point to refactor into a function.

I know that this will be always executed in production - but it's only executed once per sequence, I don't think we need to think about perf at things so rarely executed.

Not required in release builds but it's nice if once can always get this info - we also include lots of trace statements in release builds. What's the argument against including it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. No particular argument other than it becomes unnecessary instructions for most executions. Let's hope branch predictor figures it out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants