feat/better seed debug logging #1240

corneliusroemer · 2023-08-30T13:15:45Z

feat: log basic seed and align stats at debug level

We're currently not really using debug level logging for anything. I think it'd be useful to output some basic stats for each sequence like what proportion of the query are covered by chained matches, mean band width, max indel, and things like that.

This is what this PR outputs at debug level (-vv):

2023-08-30 13:17:52.012 [D] compression.rs:68: When processing '"nextclade.tsv"': detected file extension 'tsv'. It will be using algorithm: 'None'
2023-08-30 13:17:52.012 [I] nextclade_loop.rs:87: Processing sequence 'Ireland/CO-NVRL-ecM21IRL00199074/2021'
2023-08-30 13:17:52.036 [D] seed_match2.rs:509: Chained seed stats. max indel: 12, # matches: 181, first/last match distance from start/end [start: (ref: 12, qry: 0), end: (ref: 129, qry: 0)], max unmatched stretch (ref: 173, qry stretch: 173)
2023-08-30 13:17:52.036 [D] seed_match2.rs:514: Seed alignment covers 87.09% of query length
2023-08-30 13:17:52.040 [D] align.rs:83: Nucleotide alignment band area=24370141, mean band width=814
2023-08-30 13:17:57.535 [D] align.rs:91: Attempt: 0, Alignment score: 86887

vercel · 2023-08-30T13:15:48Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
nextclade	✅ Ready (Inspect)	Visit Preview	Aug 30, 2023 2:20pm

packages_rs/nextclade/src/align/score_matrix.rs

ivan-aksamentov · 2023-08-30T15:23:37Z

packages_rs/nextclade/src/align/seed_match2.rs

+    let max_shift = seed_matches.iter().map(|sm| sm.offset.abs()).max().unwrap_or(0);
+    // Need to iterate in pairs
+    let max_unmatched_ref_stretch = seed_matches
+      .iter()
+      .tuple_windows()
+      .map(|(sm1, sm2)| sm2.ref_pos - (sm1.ref_pos + sm1.length))
+      .max()
+      .unwrap_or(0);
+    let max_unmatched_qry_stretch = seed_matches
+      .iter()
+      .tuple_windows()
+      .map(|(sm1, sm2)| sm2.qry_pos - (sm1.qry_pos + sm1.length))
+      .max()
+      .unwrap_or(0);
+    let first_match = seed_matches.first().unwrap();
+    let last_match = seed_matches.last().unwrap();
+    let first_match_ref_pos = first_match.ref_pos;
+    let first_match_qry_pos = first_match.qry_pos;
+    let last_match_ref_pos = ref_seq.len() - last_match.ref_pos - last_match.length;
+    let last_match_qry_pos = qry_seq.len() - last_match.qry_pos - last_match.length;
+
+    let n_matches = seed_matches.len();
+    debug!("Chained seed stats. max indel: {max_shift}, # matches: {n_matches}, first/last match distance from start/end [start: (ref: {first_match_ref_pos}, qry: {first_match_qry_pos}), end: (ref: {last_match_ref_pos}, qry: {last_match_qry_pos})], max unmatched stretch (ref: {max_unmatched_ref_stretch}, qry stretch: {max_unmatched_qry_stretch})");


Could you please extract the body of the condition into a separate function (keeping the condition itself where it is), so that it does not distract from the main algo code.

Note that the conditional jump will always be executed in production. (And every log macro invocation also adds a similar conditional jump underneath, which might hurt performance in tight loops. This particular place seems to be executed rarely though.)

Is this thing required in release builds?

Good point to refactor into a function.

I know that this will be always executed in production - but it's only executed once per sequence, I don't think we need to think about perf at things so rarely executed.

Not required in release builds but it's nice if once can always get this info - we also include lots of trace statements in release builds. What's the argument against including it?

Okay. No particular argument other than it becomes unnecessary instructions for most executions. Let's hope branch predictor figures it out :)

corneliusroemer changed the base branch from master to v3 August 30, 2023 13:16

corneliusroemer commented Aug 30, 2023

View reviewed changes

packages_rs/nextclade/src/align/score_matrix.rs Outdated Show resolved Hide resolved

vercel bot deployed to Preview August 30, 2023 13:33 View deployment

feat: log basic seed and align stats at debug level

27d669e

corneliusroemer force-pushed the feat/better-seed-debug-logging branch from 05c958c to 27d669e Compare August 30, 2023 14:03

vercel bot deployed to Preview August 30, 2023 14:20 View deployment

Base automatically changed from v3 to master August 30, 2023 14:39

ivan-aksamentov reviewed Aug 30, 2023

View reviewed changes

ivan-aksamentov force-pushed the master branch 2 times, most recently from 93f31a9 to edefa32 Compare September 4, 2023 16:40

ivan-aksamentov force-pushed the master branch from 27d4a31 to 9010d85 Compare February 17, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/better seed debug logging #1240

feat/better seed debug logging #1240

corneliusroemer commented Aug 30, 2023 •

edited

vercel bot commented Aug 30, 2023 •

edited

ivan-aksamentov Aug 30, 2023

corneliusroemer Aug 30, 2023

ivan-aksamentov Aug 30, 2023

feat/better seed debug logging #1240

Are you sure you want to change the base?

feat/better seed debug logging #1240

Conversation

corneliusroemer commented Aug 30, 2023 • edited

vercel bot commented Aug 30, 2023 • edited

ivan-aksamentov Aug 30, 2023

Choose a reason for hiding this comment

corneliusroemer Aug 30, 2023

Choose a reason for hiding this comment

ivan-aksamentov Aug 30, 2023

Choose a reason for hiding this comment

corneliusroemer commented Aug 30, 2023 •

edited

vercel bot commented Aug 30, 2023 •

edited