Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats redesign #216

Open
golobor opened this issue Mar 17, 2024 · 2 comments
Open

stats redesign #216

golobor opened this issue Mar 17, 2024 · 2 comments

Comments

@golobor
Copy link
Member

golobor commented Mar 17, 2024

stats are getting out of control.

A non-comprehensive list of issues:

  1. We have several stats, each generated by a different tool during their runtime (parse/stats, dedup, filterbycov). This is probably not an issue, but we do not have a standard for what stats are, how they should look like, and how should API look like.

  2. We lack a good naming scheme for stat modules and their output. The canonical "stats" is the summary produced by pairtools parse. It contains many unrelated sections and is called, confusingly, stats. Stats produced by other methods are called more specifically, e.g. "dedup.stats". Furthermore, some tools, e.g. dedup, produce multiple stats (e.g. duplication-by-tile).

  3. Currently, stats-calculating codes are interweaved with the main code, resulting in spaghetti code.

  4. Stats-calculating codes are duplicated: one version takes one pair at a time (PairCounter) and the other processes entire dataframes with many pairs.

  5. We have implemented summaries, but they may not be sufficiently documented/separated in the API. We may need a further category of stats: "counts" (i.e. additive statistics), "summaries" and "const" (i.e. non-additive stats, i.e. genome assembly properties). These categories should then probably be split right at the root level of the stats module.

  6. Saving stats is optional, but some (e.g. dedup) may not be regenerated after running.

  7. Stats can require many flags of their own (e.g. dedup now has five stats-related flags!), which makes the CLI confusing.

  8. We do not have any way to track which flags were applied to generate stats (e.g. --filter!!), which makes them less useful. CORRECTION: filter expression seems to be stored:

    self._stat[key]["filter_expression"] = self.filters[key]

  9. Pairtools stats API has many unrelated sections and is thus hard to navigate. It also has partially overlapping functionality with P(s).

  10. In the future, we may need even more stats, e.g. restrict (e.g. histogram of distances from cut site to restriction sites), scaling (P(s) as a stat, convergence point of scalings of different directionalities with high precision), phase.

  11. Stats are undocumented. As stats get more and more complicated, it becomes harder and harder to parse them - and to understand what some keys mean. Ideally, (a) .stats outputs should contain auto-generated comments explaining the content of each field, and (b) in the code, the text of these comments should be stored somewhere next to the code that generates them so it's easy to check the presence and the validity of the comment.

@agalitsyna
Copy link
Member

agalitsyna commented Mar 17, 2024

Excellent points. I would add also:

  1. Some stats are re-used in MultiQC that has to be up-to-date with the changes in format/reporting, which adds another layer of restrictions - we at least need to keep track of what types of stats are used there, and what can be safely modified.

  2. There are currently two variants to report scalings, which ideally can be a single table. (related to p. 8)

@Phlya
Copy link
Member

Phlya commented Mar 18, 2024

Thank you for bringing this up and writing up such a detailed list! I agree, many of these things also irked me.

Re --filter - indeed, the expression is stored!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants