Secondary channel for long reads #157

sjackman · 2018-10-22T04:37:45Z

I'd like to add a parameter --long-fastq for use by porechop and unicycler. When porechop is used, it consumes the raw long reads, and its outputs is the input to the long_fastq secondary channel of unicycler. When porechop is not used, unicycler consumes the raw long reads. The behaviour that I want seems handled by set_secondary_inputs, but I'm not entirely sure from where or how to call this function. Could you please point me in the right direction?

flowcraft/flowcraft/generator/process.py

Lines 726 to 727 in 2173530

    
           def set_secondary_inputs(self, channel_dict): 
        
               """ Adds secondary inputs to the start of the pipeline.

The text was updated successfully, but these errors were encountered:

ODiogoSilva · 2018-10-22T21:30:42Z

The secondary_inputs process directive has been deprecated in v1.2.1. Secondary inputs for a component can be provided via the parameters directive. These parameters can be used in the nextflow template to create channels in any way you see fit.

However, your case requires a bit more than just specifying parameters. In this case, what would be the expected behavior of unicycler when no long reads have been provided? Or when only long reads were provided? I'm assuming that the correct behavior would be for the unicycler component to evaluate the types of input it has and, depending on that, it either assembles only short reads, only long reads, or a mixture of both?

sjackman · 2018-10-22T22:08:42Z

Unicycler is able to assemble only short reads, only long reads, or a hybrid assembly of both short and long reads. Ideally the flowcraft component would be able to handle all three of these cases. To give a concrete example, I've opened PR #158 which is able to assemble only short reads, or a hybrid assembly of both short reads and long reads. It has two shortcomings.

It doesn't handle the case of only long reads.
It can only use raw long reads. How do I connect the output of Porechop into the --long_fastq input of Unicycler?

sjackman · 2018-10-22T23:44:06Z

Let's move this discussion over to PR #158.

ODiogoSilva · 2018-10-22T23:44:41Z

Right, so we can solve those shortcoming but we'll need to make some additions to flowcraft, some of which we already wanted to do for some time:

Add new raw_input type for long reads. @cimendes has been pushing for this for some time 😄
Add support for 1 or more input types for components, where one is the main input (thus passed to the {{ input_channel }} placeholder) and the remaining ones would be defined as secondary channels.

In the components that can receive more than 1 input, we would have to write the code to handle the possible input combinations, since this would be highly software specific.

With these additions, assuming the new raw_input type is longFastq, it would just be a matter of adding a link_end = {'link':'__longFastq', 'alias':'someChannelName'} to unicycler. In this way, you could write a workflow like:

fastqc trimmomatic unicycler

And the workflow could be executed with --long_fastq to provide the long reads in addition to the short ones.

A pipeline like:

unicycler <other stuff>

Could be executed with only long reads, if the --long_fastq is provided and no --fastq is provided.

To use porechop, it could be:

// Alone
porechop unicycler
// process short and long reads at the same time
(porechop | fastqc unicycler)

In the last example, unicycler would fetch the latest process with a longFastq output type, which would be porechop. If porechop is not present, long reads can still be provided via ``--long_fastq`.

Does this overall setup sound good to you?

sjackman · 2018-10-22T23:58:23Z

This all sounds great!

I changed the name in PR #158 from long_fastq to long_reads, because most long read tools accept reads in either FASTQ or FASTA format.

(porechop | fastqc unicycler)

This syntax doesn't seem intuitive to me. The syntax 1 (2a 3a | 2b 3b) is short for 1 2a 3a and 1 2b 3b. The results in the a branch are not available to the b branch.
In (porechop | fastqc unicycler), which is short for porechop and fastqc unicycler, I don't expect unicycler to have access to the output of porechop, since it's in a different branch of the pipeline. To me it ought to look like (porechop | fastqc) unicycler

sjackman · 2018-10-23T00:03:56Z

A typical hybrid assembly pipeline might look like
(trimmomatic | porechop filtlong) unicycler (bandge | quast)

ODiogoSilva · 2018-10-24T13:28:16Z

Hmm, yes the problem of simplifying the pipeline orchestration with a single raw string is that you also don't have many simple options to build it. Changing to the syntax you proposed should be possible, but will require a re-working of the parser. I really feel this would be a very useful feature to add in flowcraft, the ability to merge the output of different components into a single one. However, we would need some time to think about the overall strategy so that it can be generally implemented. For instance, in the last example that you provided, it would be difficult to plug the output of trimmomatic to other assemblers that don't handle long reads.

Right now I'm on a tight deadline, but should be able to return to this shortly.

sjackman · 2018-10-24T20:45:14Z

Cool. Let's think on it and talk more.

cimendes · 2018-12-13T17:22:03Z

I'm bringing this issue back to life and adding my input as I'm planning on implementing long read components to flowcraft relatively soon. 😄
The main concern here relates to tools that can process either long reads, short reads, or both. The first point is that long reads can come from two different technologies, if we allow the introduction of components that deal directly with the raw data (example: canu, basecallers). We might agree that, to simplify, long read is already in fasta or fastq format. I'm in favour of this approach as it requires the creating of a single new input parameter. @sjackman raised a valuable point that long reads can either be in fastq or fasta format.. Should this new input handle both? should we create two different inputs? Or should we just go with fastq format as that most commonly used?
The other issue, pointed out by @ODiogoSilva that the same tool can have more than one input type, requires some decisions. The implementation of multiple input types is useful for other cases outside of the long read implementation, such as kraken that can work with both fastq and fasta. I really think we should start implementing this as we already have some duplicated components (such as mash_sketch_fastq and mash_sketch_fasta).
In the case of hybrid assembly, the issues merge with what is already being discussed in #174

sjackman · 2018-12-13T18:04:10Z

@sjackman raised a valuable point that long reads can either be in fastq or fasta format.. Should this new input handle both? should we create two different inputs? Or should we just go with fastq format as that most commonly used?

I believe most long read assemblers can handle either type (unverified), so Flowcraft need not care what the file type is, and pass it straight through to the assembler. FASTQ is definitely more common though, and if we pick just one, that'd handle >90% of cases.

sjackman · 2018-12-13T18:07:40Z

A better name for the reads parameter than --fastq would be --short-reads and --long-reads, and Nextflow can determine the file type (based on either file extension or file content).

sjackman mentioned this issue Oct 22, 2018

unicycler: Add parameter --long_reads #158

Open

sjackman closed this as completed Oct 22, 2018

sjackman reopened this Oct 22, 2018

cimendes added this to related issues in Expand raw input types Jan 23, 2019

cimendes added the enhancement New feature or request label Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Secondary channel for long reads #157

Secondary channel for long reads #157

sjackman commented Oct 22, 2018

ODiogoSilva commented Oct 22, 2018 •

edited

sjackman commented Oct 22, 2018 •

edited

sjackman commented Oct 22, 2018

ODiogoSilva commented Oct 22, 2018

sjackman commented Oct 22, 2018

sjackman commented Oct 23, 2018

ODiogoSilva commented Oct 24, 2018

sjackman commented Oct 24, 2018

cimendes commented Dec 13, 2018

sjackman commented Dec 13, 2018

sjackman commented Dec 13, 2018 •

edited

Secondary channel for long reads #157

Secondary channel for long reads #157

Comments

sjackman commented Oct 22, 2018

ODiogoSilva commented Oct 22, 2018 • edited

sjackman commented Oct 22, 2018 • edited

sjackman commented Oct 22, 2018

ODiogoSilva commented Oct 22, 2018

sjackman commented Oct 22, 2018

sjackman commented Oct 23, 2018

ODiogoSilva commented Oct 24, 2018

sjackman commented Oct 24, 2018

cimendes commented Dec 13, 2018

sjackman commented Dec 13, 2018

sjackman commented Dec 13, 2018 • edited

ODiogoSilva commented Oct 22, 2018 •

edited

sjackman commented Oct 22, 2018 •

edited

sjackman commented Dec 13, 2018 •

edited