cutadapt fixes and improvements #5954

wm75 · 2024-04-16T13:29:44Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

Now here's a list of changes:

makes 1 the default for --minimum-length because of the reason described here: cutadapt fixes and improvements #5954 (comment)
groups forward and rv filtering and trimming options together, which were previously separated quite a bit in the tool interface
groups and orders parameters in a more intuitive way (I believe)
fixes the --discard-casava and --max-aer options that previously had no effect
fixes the edge case where --minimum_length2 would get ignored when minimum_length wasn't set
fixes same issue for --maximum_length2
switches the tool to use short options where possible to shorten the generated command line
adds support for the "new" --pair-adapters option
fixes output filter expressions to not trigger silent errors like:
dataset output filter (library['type'] == 'paired' and 'multiple_output' not in output_selector) failed: argument of type 'NoneType' is not iterable

bernt-matthias · 2024-04-16T16:01:51Z

Would named macro tokens be an option, i.e. using <yield name="xyz"/>?

Then we could insert the read2 parameters only if needed .. and at the correct positions.

bernt-matthias · 2024-04-16T16:04:11Z

tools/cutadapt/cutadapt.xml

-    --minimum-length=$filter_options.minimum_length:$library.minimum_length2
-#else if str($filter_options.minimum_length):
-    --minimum-length=$filter_options.minimum_length
+#if $paired and str($filter_options.minimum_length2):


Should minimum_length2 be treated the same way as minimum_length?

No, it shouldn't. That part is a bit confusing, but minimum_length2 should normally be unset, in which case the value of minimum_length will also be used for rv reads in PE data.
Only if the user sets it explicitly, it special-cases treatment of rv reads.

So if a user specifies a value for minimum_length2 we need to use the colon syntax to pass both params. Otherwise, we need to pass only minimum_length unless its set to the command line default anyway.
The old version was buggy because it always ignored minimum_length2 when the user didn't specify minimum_length.

bernt-matthias · 2024-04-16T16:04:50Z

tools/cutadapt/cutadapt.xml

@@ -227,32 +284,19 @@ $read_mod_options.zero_cap
            <param argument="--times" type="integer" min="1" value="1" label="Match times" help="Try to remove adapters at most COUNT times. Useful when an adapter gets appended multiple times." />
            <param argument="--overlap" type="integer" min="1" value="3" label="Minimum overlap length" help="Minimum overlap length. If the overlap between the adapter and the sequence is shorter than LENGTH, the read is not modified. This reduces the number of bases trimmed purely due to short random adapter matches." />
            <param argument="--match-read-wildcards" type="boolean" checked="false" truevalue="--match-read-wildcards" falsevalue="" label="Match wilcards in reads" help="Interpret IUPAC wildcards in reads"/>
-            <param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="--no-match-adapter-wildcards" label="Match wilcards in adapters" help="Interpret IUPAC wildcards in adapters."/>
-            <param argument="--revcomp" type="boolean" checked="false" truevalue="--revcomp" falsevalue="" label="Look for adapters in the reverse complement" help="Check both the read and its reverse complement for adapter matches. If match is on reverse-complemented version, output that one. Default: check only read." />
+            <param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="-N" label="Match wildcards in adapters" help="Interpret IUPAC wildcards in adapters."/>


Why not use the long one? Bit confusing, or?

We could sure, but the current vversion just produces unnecessary long command lines, which need to be stored by Galaxy. Not sold on using short option names throughout though. @bgruening what's your opinion here?

if we use argument, it will be part of the help text. From an educational point of view it would be nice to show the same param in the help and in the CLI

sure. https://galaxy-iuc-standards.readthedocs.io/en/latest/best_practices/tool_xml.html#parameter-name-argument-and-help says that argument should use the long form of the wrapped option, which then means that we would always use the long form on the command line, too.
If that's the consensus, then I'll revert the commit introducing all the short options.

I like that.

bernt-matthias · 2024-04-16T16:06:31Z

tools/cutadapt/cutadapt.xml

+        <section name="filter_options" title="Read Filtering Options">
+            <param argument="--discard-trimmed" type="boolean" checked="false" truevalue="--discard-trimmed" falsevalue="" label="Discard Trimmed Reads" help="Discard reads that contain the adapter instead of trimming them. Use the 'Minimum overlap length' option in order to avoid throwing away too many randomly matching reads!" />
+            <param argument="--discard_untrimmed" type="boolean" checked="false" truevalue="--discard-untrimmed" falsevalue="" label="Discard Untrimmed Reads" help="Discard reads that do not contain the adapter." />
+            <param argument="--minimum-length" type="integer" min="0" value="1" label="Minimum length (R1)" help="Discard reads that, after processing, are shorter than LENGTH. Note: You can set this parameter to zero to keep empty reads (with zero-length sequence and quality string) in the output, but some downstream tools may have problems with these. Default: 1" />


Is the default of the command line tool 1? I most of the time go with those defaults .. but I see the argument for not doing so.

Well spotted :-)

The problem with the 0 default of the command line tool was what triggered this whole PR.
With this setting it's possible that the trimmed output contains empty reads like here:

tools-iuc/tools/cutadapt/test-data/cutadapt_shorten_3prime.out

Lines 5 to 9 in edb9fd7

@prefix:1_13_1259/1

+

@prefix:1_13_1440/1

We learnt this week that such output causes failures in RNAStar and picard Fastq2Sam, and floods the stdout of HISAT2 with warnings. Enough reason for us to change the default.

Sorry, wasn't finished yet with an explanation of the PR and didn't expect so fast a review ;-)

These tests need -m 0, which was the old default for the param.

wm75 · 2024-04-16T16:30:37Z

Would named macro tokens be an option, i.e. using <yield name="xyz"/>?

Then we could insert the read2 parameters only if needed .. and at the correct positions.

That would require a huge macro though that would span like 2/3 of the whole wrapper if I understand you correctly. Would also change the nesting level of tons of params.

Do we want go that far to provide the best user experience?

wm75 · 2024-04-17T11:19:13Z

@bernt-matthias ready for another round of comments/review :-)

Switch back to use long options where possible to match param argument values. Also use = as param/value separator because this gives nicer cutadapt json output.

bgruening · 2024-04-24T07:39:56Z

LGTM, @bernt-matthias?

mvdbeek · 2024-04-24T15:45:58Z

tools/cutadapt/cutadapt.xml

-                <option value="both">Both: filtering criteria must apply to both reads in order for a read pair to be discarded. </option>
-                <option value="first">First: will make a decision about the read pair by inspecting whether the filtering criterion applies to the first read, ignoring the second read.</option>
-
+        <section name="other_trimming_options" title="Other Read Trimming Options">


This is going to require rechecking every parameter within the section changes in workflows, as we can't track that automatically. Is that worth it ? Not saying it isn't, but it's something that looks like a good amount of work.

You have a point here, which is worth considering.
Personally, I think all this restructuring is worth it because it will make usage of the tool a lot clearer for the average user, but that's just my opinion.
The specific change you've highlighted affects just 3 parameters I think, but, yes, I've moved around quite some others between section as well.
My point would be that it's better to move everything in one update, then hope that we'll be able to keep things constant for a while, than doing lots of incremental changes that each break workflow updates.

The specific change you've highlighted affects just 3 parameters I think,

Got lost trying to read the diff so that's not true. More like 10 params.

Very unlikely that any WF deviates from the default for a lot of them at once, but still you need to check whether that's the case.

What we could do as a compromise is to combine things in the existing "filter_options" section, change its title to sth like "Other trimming and filtering options" and simply reorder things properly within that section so that trimming options come before filtering ones. You're right that that would minimize effort during WF updates.

So now that I looked at this a bit more carefully: the options that I've moved into that new Other Trimming Options section all come from the Read Modifications section, which really mixed up different concepts.

The problem there is that the trimming options (cutting, trimming, shortening) should be presented before the filtering options because trimming also happens before filtering, but the remaining read modification options are really quite exotic and should go last I think. So a split of this section then seems unavoidable.

Currently, the split moves 6 out of 10 options to the new section and leaves 4 in the original one. Also these 6 are more commonly used ones, then the 4 that are left behind.
What I could do is to move the existing section up above filtering and split the 4 more exotic options out into a new bottom section. This way "only" 4 instead of 6 options would have to be rechecked on WF updates.
The downside is that the section that I wanted to name "other_trimming_options" will then be called "read_mod_options", and we'd have to think about a new name for the bottom section.

So al in all, a moderate gain in terms of WF maintainability vs somewhat inadequate internal section names. I'm undecided.

Here's what things would look like with the proposed split for comparison:

If you do change the parameters I would stick with changing all at once, whether you check 4 or 10 options doesn't make a difference I think.

bgruening · 2024-05-06T18:26:28Z

@mvdbeek any feedback to Wolfgangs latest comments?

wm75 added 10 commits April 16, 2024 10:41

Fix length filtering options and some cleanup

194dec0

Further cleanup and better grouping of cli options

c7c0665

Add new Other Trimming Options section

7f8be16

Fix --discard-casava option and minor improvements

d078741

Bring back PE options in their corresponding sections

c2340d2

Use short options where possible, fix --max-aer

8fe8a2f

Fix typo

7763b60

Adjust tests to new cli options use

3e32a4a

Test failure fixes

aad1770

More fixes

a9daa58

bernt-matthias reviewed Apr 16, 2024

View reviewed changes

Fix 3 tests with empty reads test output

2b7bc44

These tests need -m 0, which was the old default for the param.

wm75 added 5 commits April 17, 2024 09:19

Fix two more tests that expect empty reads in output

ef4e4fc

Adjust json output-based test

9c42740

Support --pair-adapters option

e70ea34

Add test for --pair-adapters option

526bac7

Fix output filters

1c52875

wm75 added 3 commits April 23, 2024 12:44

Revert option format switch

7dae425

Switch back to use long options where possible to match param argument values. Also use = as param/value separator because this gives nicer cutadapt json output.

Fix param names

663bad9

Fix two remaining tests

52ffd32

bernt-matthias approved these changes Apr 24, 2024

View reviewed changes

mvdbeek reviewed Apr 24, 2024

View reviewed changes

mvdbeek approved these changes May 7, 2024

View reviewed changes

bgruening merged commit a5b6cb4 into galaxyproject:main May 17, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cutadapt fixes and improvements #5954

cutadapt fixes and improvements #5954

wm75 commented Apr 16, 2024 •

edited

bernt-matthias commented Apr 16, 2024 •

edited

bernt-matthias Apr 16, 2024

wm75 Apr 16, 2024

wm75 Apr 16, 2024

bernt-matthias Apr 16, 2024

wm75 Apr 16, 2024

bgruening Apr 17, 2024

wm75 Apr 23, 2024

bgruening Apr 23, 2024

wm75 Apr 24, 2024

bernt-matthias Apr 16, 2024

wm75 Apr 16, 2024

wm75 commented Apr 16, 2024

wm75 commented Apr 17, 2024

bgruening commented Apr 24, 2024

mvdbeek Apr 24, 2024

wm75 Apr 24, 2024

wm75 Apr 24, 2024

wm75 Apr 24, 2024

wm75 Apr 24, 2024

wm75 Apr 25, 2024

wm75 Apr 25, 2024

mvdbeek May 7, 2024

bgruening commented May 6, 2024

cutadapt fixes and improvements #5954

cutadapt fixes and improvements #5954

Conversation

wm75 commented Apr 16, 2024 • edited

bernt-matthias commented Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wm75 commented Apr 16, 2024

wm75 commented Apr 17, 2024

bgruening commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgruening commented May 6, 2024

wm75 commented Apr 16, 2024 •

edited

bernt-matthias commented Apr 16, 2024 •

edited