Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary channel for long reads #157

Open
sjackman opened this issue Oct 22, 2018 · 11 comments
Open

Secondary channel for long reads #157

sjackman opened this issue Oct 22, 2018 · 11 comments
Labels
enhancement New feature or request

Comments

@sjackman
Copy link
Contributor

I'd like to add a parameter --long-fastq for use by porechop and unicycler. When porechop is used, it consumes the raw long reads, and its outputs is the input to the long_fastq secondary channel of unicycler. When porechop is not used, unicycler consumes the raw long reads. The behaviour that I want seems handled by set_secondary_inputs, but I'm not entirely sure from where or how to call this function. Could you please point me in the right direction?

def set_secondary_inputs(self, channel_dict):
""" Adds secondary inputs to the start of the pipeline.

@ODiogoSilva
Copy link
Collaborator

ODiogoSilva commented Oct 22, 2018

The secondary_inputs process directive has been deprecated in v1.2.1. Secondary inputs for a component can be provided via the parameters directive. These parameters can be used in the nextflow template to create channels in any way you see fit.

However, your case requires a bit more than just specifying parameters. In this case, what would be the expected behavior of unicycler when no long reads have been provided? Or when only long reads were provided? I'm assuming that the correct behavior would be for the unicycler component to evaluate the types of input it has and, depending on that, it either assembles only short reads, only long reads, or a mixture of both?

@sjackman
Copy link
Contributor Author

sjackman commented Oct 22, 2018

Unicycler is able to assemble only short reads, only long reads, or a hybrid assembly of both short and long reads. Ideally the flowcraft component would be able to handle all three of these cases. To give a concrete example, I've opened PR #158 which is able to assemble only short reads, or a hybrid assembly of both short reads and long reads. It has two shortcomings.

  1. It doesn't handle the case of only long reads.
  2. It can only use raw long reads. How do I connect the output of Porechop into the --long_fastq input of Unicycler?

@sjackman
Copy link
Contributor Author

Let's move this discussion over to PR #158.

@ODiogoSilva
Copy link
Collaborator

Right, so we can solve those shortcoming but we'll need to make some additions to flowcraft, some of which we already wanted to do for some time:

  • Add new raw_input type for long reads. @cimendes has been pushing for this for some time 😄
  • Add support for 1 or more input types for components, where one is the main input (thus passed to the {{ input_channel }} placeholder) and the remaining ones would be defined as secondary channels.

In the components that can receive more than 1 input, we would have to write the code to handle the possible input combinations, since this would be highly software specific.

With these additions, assuming the new raw_input type is longFastq, it would just be a matter of adding a link_end = {'link':'__longFastq', 'alias':'someChannelName'} to unicycler. In this way, you could write a workflow like:

fastqc trimmomatic unicycler

And the workflow could be executed with --long_fastq to provide the long reads in addition to the short ones.

A pipeline like:

unicycler <other stuff>

Could be executed with only long reads, if the --long_fastq is provided and no --fastq is provided.

To use porechop, it could be:

// Alone
porechop unicycler
// process short and long reads at the same time
(porechop | fastqc unicycler)

In the last example, unicycler would fetch the latest process with a longFastq output type, which would be porechop. If porechop is not present, long reads can still be provided via ``--long_fastq`.

Does this overall setup sound good to you?

@sjackman sjackman reopened this Oct 22, 2018
@sjackman
Copy link
Contributor Author

This all sounds great!

I changed the name in PR #158 from long_fastq to long_reads, because most long read tools accept reads in either FASTQ or FASTA format.

(porechop | fastqc unicycler)

This syntax doesn't seem intuitive to me. The syntax 1 (2a 3a | 2b 3b) is short for 1 2a 3a and 1 2b 3b. The results in the a branch are not available to the b branch.
In (porechop | fastqc unicycler), which is short for porechop and fastqc unicycler, I don't expect unicycler to have access to the output of porechop, since it's in a different branch of the pipeline. To me it ought to look like (porechop | fastqc) unicycler

@sjackman
Copy link
Contributor Author

A typical hybrid assembly pipeline might look like
(trimmomatic | porechop filtlong) unicycler (bandge | quast)

@ODiogoSilva
Copy link
Collaborator

Hmm, yes the problem of simplifying the pipeline orchestration with a single raw string is that you also don't have many simple options to build it. Changing to the syntax you proposed should be possible, but will require a re-working of the parser. I really feel this would be a very useful feature to add in flowcraft, the ability to merge the output of different components into a single one. However, we would need some time to think about the overall strategy so that it can be generally implemented. For instance, in the last example that you provided, it would be difficult to plug the output of trimmomatic to other assemblers that don't handle long reads.

Right now I'm on a tight deadline, but should be able to return to this shortly.

@sjackman
Copy link
Contributor Author

Cool. Let's think on it and talk more.

@cimendes
Copy link
Member

I'm bringing this issue back to life and adding my input as I'm planning on implementing long read components to flowcraft relatively soon. 😄
The main concern here relates to tools that can process either long reads, short reads, or both. The first point is that long reads can come from two different technologies, if we allow the introduction of components that deal directly with the raw data (example: canu, basecallers). We might agree that, to simplify, long read is already in fasta or fastq format. I'm in favour of this approach as it requires the creating of a single new input parameter. @sjackman raised a valuable point that long reads can either be in fastq or fasta format.. Should this new input handle both? should we create two different inputs? Or should we just go with fastq format as that most commonly used?
The other issue, pointed out by @ODiogoSilva that the same tool can have more than one input type, requires some decisions. The implementation of multiple input types is useful for other cases outside of the long read implementation, such as kraken that can work with both fastq and fasta. I really think we should start implementing this as we already have some duplicated components (such as mash_sketch_fastq and mash_sketch_fasta).
In the case of hybrid assembly, the issues merge with what is already being discussed in #174

@sjackman
Copy link
Contributor Author

@sjackman raised a valuable point that long reads can either be in fastq or fasta format.. Should this new input handle both? should we create two different inputs? Or should we just go with fastq format as that most commonly used?

I believe most long read assemblers can handle either type (unverified), so Flowcraft need not care what the file type is, and pass it straight through to the assembler. FASTQ is definitely more common though, and if we pick just one, that'd handle >90% of cases.

@sjackman
Copy link
Contributor Author

sjackman commented Dec 13, 2018

A better name for the reads parameter than --fastq would be --short-reads and --long-reads, and Nextflow can determine the file type (based on either file extension or file content).

@cimendes cimendes added this to related issues in Expand raw input types Jan 23, 2019
@cimendes cimendes added the enhancement New feature or request label Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Expand raw input types
  
related issues
Development

No branches or pull requests

3 participants