Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

Open
unode opened this issue Jun 26, 2018 · 2 comments
Open

Comments

@unode
Copy link
Member

unode commented Jun 26, 2018

This is related to #74 and to some degree with #73 .

The current formats ({fastq}, {mapping}) are somewhat hard use.
Both formats leak internal information (e.g. 1:file preproc.lno10.pairs.1) and require contextual information for interpretation (ref. to line numbers in script.ngl or the filenames).

The fastq format is particularly hairy when mixing samples with a variable number of input files.
Since stats are computed per-file, you will end up with a variable number of pair and or single files. Once collect()ed, zero lines will be added to ensure alignment.

The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data.
These are popular with Python and R users that use pandas and tidyverse frameworks.

To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.

A proposal: instead of implicitly calculating statistics on select() #65 , preprocess(), as_reads(), ... one could:

...
mapped = map(reads, reference='hg19')

mappingstats(mapped, name="Human reads mapped")          <----

mapped = select(mapped) using |mr|:
    if mr.flag({unmapped}):
        discard

mappingstats(mapped, name="Human reads kept")            <----

reads = as_reads(mapped)
reads = preprocess(reads, keep_singles=True) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

fastqstats(reads, name="Human reads after strict QC")    <----

collect(qcstats({stats}),                                <----
        ofile='outputs/qcstats.tsv',
        current=sample, allneeded=samples)

this could in turn produce outputs/qcstats.tsv (tab-delimited - visually aligned here):

sample     steptype   step                          identifier     feature                     value
SAMPLE_A   mapping    Human reads mapped                   sam   reference                      hg19  (if internal reference, /path otherwise)
SAMPLE_A   mapping    Human reads mapped                   sam       total                   8348246
SAMPLE_A   mapping    Human reads mapped                   sam     aligned                   5678246
SAMPLE_A   mapping    Human reads mapped                   sam        file                     /path  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads mapped                   sam         ...                       ...  (other fields in mapping output)
SAMPLE_A   mapping    Human reads kept                     sam       total                   5678246
SAMPLE_A   mapping    Human reads kept                     sam     aligned                   5678246
SAMPLE_A   mapping    Human reads kept                     sam        file    Human reads mapped-sam  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_A   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_A   fastq      Human reads after strict QC       pair.1     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       pair.2     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       single     numSeqs                         0  (always present even if zero or non-existing)
SAMPLE_A   fastq      Human reads after strict QC          ...         ...                       ...
SAMPLE_B   mapping    Human reads mapped                   sam       total                   6412432
SAMPLE_B   mapping    Human reads mapped                   sam     aligned                   4327644
SAMPLE_B   mapping    Human reads mapped                   sam         ...                       ...
SAMPLE_B   mapping    Human reads kept                     sam       total                   4327644
SAMPLE_B   mapping    Human reads kept                     sam     aligned                   4327644
SAMPLE_B   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_B   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_B   fastq      Human reads after strict QC       pair.1     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       pair.2     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       single     numSeqs                   4327644  (all single end)
SAMPLE_B   fastq      Human reads after strict QC          ...         ...                       ...

See notes on the rightmost side of above table.

Aspects not discussed:

  1. implicit dependency between *stats() functions and qcstats({stats}) - should this be explicit too?
  2. impact on internal hashing - dependency on user specified name= for qcstats().
@unode
Copy link
Member Author

unode commented Jul 18, 2018

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

Passing this argument would also explicitly specify which functions should or shouldn't produce statistics.

@luispedro
Copy link
Member

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

I like this idea, it could be made more general, namely label as it could potentially be used elsewhere.

But, I would always produce statistics. This would just add extra convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants