Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

unode · 2018-06-26T21:23:32Z

This is related to #74 and to some degree with #73 .

The current formats ({fastq}, {mapping}) are somewhat hard use.
Both formats leak internal information (e.g. 1:file preproc.lno10.pairs.1) and require contextual information for interpretation (ref. to line numbers in script.ngl or the filenames).

The fastq format is particularly hairy when mixing samples with a variable number of input files.
Since stats are computed per-file, you will end up with a variable number of pair and or single files. Once collect()ed, zero lines will be added to ensure alignment.

The parsing and zero-filling problem could be addressed by using a long format or some other form of tidy data.
These are popular with Python and R users that use pandas and tidyverse frameworks.

To address the information leak problem and avoid contextual information dependency, I believe an explicit function call is required. This could be use to define explicit requirements for computing stats but also to provide additional metadata.

A proposal: instead of implicitly calculating statistics on select() #65 , preprocess(), as_reads(), ... one could:

...
mapped = map(reads, reference='hg19')

mappingstats(mapped, name="Human reads mapped")          <----

mapped = select(mapped) using |mr|:
    if mr.flag({unmapped}):
        discard

mappingstats(mapped, name="Human reads kept")            <----

reads = as_reads(mapped)
reads = preprocess(reads, keep_singles=True) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

fastqstats(reads, name="Human reads after strict QC")    <----

collect(qcstats({stats}),                                <----
        ofile='outputs/qcstats.tsv',
        current=sample, allneeded=samples)

this could in turn produce outputs/qcstats.tsv (tab-delimited - visually aligned here):

sample     steptype   step                          identifier     feature                     value
SAMPLE_A   mapping    Human reads mapped                   sam   reference                      hg19  (if internal reference, /path otherwise)
SAMPLE_A   mapping    Human reads mapped                   sam       total                   8348246
SAMPLE_A   mapping    Human reads mapped                   sam     aligned                   5678246
SAMPLE_A   mapping    Human reads mapped                   sam        file                     /path  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads mapped                   sam         ...                       ...  (other fields in mapping output)
SAMPLE_A   mapping    Human reads kept                     sam       total                   5678246
SAMPLE_A   mapping    Human reads kept                     sam     aligned                   5678246
SAMPLE_A   mapping    Human reads kept                     sam        file    Human reads mapped-sam  (/path or 'step + identifier' if exists)
SAMPLE_A   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_A   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_A   fastq      Human reads after strict QC       pair.1     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       pair.2     numSeqs                   2839123
SAMPLE_A   fastq      Human reads after strict QC       single     numSeqs                         0  (always present even if zero or non-existing)
SAMPLE_A   fastq      Human reads after strict QC          ...         ...                       ...
SAMPLE_B   mapping    Human reads mapped                   sam       total                   6412432
SAMPLE_B   mapping    Human reads mapped                   sam     aligned                   4327644
SAMPLE_B   mapping    Human reads mapped                   sam         ...                       ...
SAMPLE_B   mapping    Human reads kept                     sam       total                   4327644
SAMPLE_B   mapping    Human reads kept                     sam     aligned                   4327644
SAMPLE_B   mapping    Human reads kept                     sam         ...                       ...
SAMPLE_B   fastq      Human reads after strict QC       pair.1        file      Human reads kept-sam
SAMPLE_B   fastq      Human reads after strict QC       pair.1     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       pair.2     numSeqs                         0  (always present)
SAMPLE_B   fastq      Human reads after strict QC       single     numSeqs                   4327644  (all single end)
SAMPLE_B   fastq      Human reads after strict QC          ...         ...                       ...

See notes on the rightmost side of above table.

Aspects not discussed:

implicit dependency between *stats() functions and qcstats({stats}) - should this be explicit too?
impact on internal hashing - dependency on user specified name= for qcstats().

The text was updated successfully, but these errors were encountered:

unode · 2018-07-18T11:06:10Z

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

Passing this argument would also explicitly specify which functions should or shouldn't produce statistics.

luispedro · 2018-07-18T12:02:12Z

Another perhaps simpler alternative is to add a stats_label="Human reads kept" argument to all functions that can generate statistics.

I like this idea, it could be made more general, namely label as it could potentially be used elsewhere.

But, I would always produce statistics. This would just add extra convenience.

unode added enhancement question help wanted labels Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

unode commented Jun 26, 2018 •

edited

unode commented Jul 18, 2018

luispedro commented Jul 18, 2018

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

Improvements to qcstats (fastq, mapping) formats - explicit behavior? #75

Comments

unode commented Jun 26, 2018 • edited

unode commented Jul 18, 2018

luispedro commented Jul 18, 2018

unode commented Jun 26, 2018 •

edited