change gff re to include gff3 #1191

roldanjg · 2024-01-14T18:45:01Z

Hi nf-core team,
I work with genomes from different species, and many times people use 'gff3' instead of 'gff' extension in the annotations file name, to differentiate GFF2 and GFF3 I guess. This pull request introduce 'gff3' extension for annotation file name specification. I modified the gff regex.

PR checklist

This comment contains a description of changes (with reason).
If necessary, include test data in your PR.
Remove all TODO statements.
Follow the naming conventions.
Follow the parameters requirements.
Follow the input/output options guidelines.

github-actions · 2024-01-14T18:47:48Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 20310bd

+| ✅ 146 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml
pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
pipeline_todos - TODO string in WorkflowRnaseq.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

files_unchanged - File ignored due to lint config: assets/email_template.html
files_unchanged - File ignored due to lint config: assets/email_template.txt
files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
multiqc_config - multiqc_config

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-rnaseq_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-rnaseq_logo_light.png
files_exist - File found: docs/images/nf-core-rnaseq_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: lib/WorkflowRnaseq.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-rnaseq_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 3.15.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (273 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.11.1
Run at 2024-01-19 09:37:24

drpatelh · 2024-01-15T09:30:14Z

Thanks @JOAQUINGR ! Have you tested the pipeline with a GFF3 annotation and it works? Asking just in case we need any other modifications to properly support the format. Also, be great if you can update the CHANGELOG please.

roldanjg · 2024-01-18T16:54:35Z

Hi @drpatelh ! Sorry for the messy commits, it's all fixed now. It looks like Prettier formatting has been updated because I had to run "prettier --write .devcontainer/devcontainer.json" to complete the tests.

Yes, I've being doing some tests and:

I think that "gffread" is the one to go for the semantic conversion, at least for this pipeline, I don't know if the simplified GTF2 is enough for others.
The documentation:
https://github.com/gpertea/gffread/blob/master/examples/README.md
It's quite hard to get a standard and automated parser. But I think that this one is great and goes as far as you can go automatically. Just being picky, there is a possibility that we are not considering and that I've seen in some GFF from Ensembl; Example: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/gff3/arabidopsis_thaliana/README

transcript types:
* ID: Unique identifier, format "transcript:<transcript_stable_id>"

As a result, the transcript_counts file looks like this:

tx gene_id ERX2558928 ERX2558929
transcript:AT1G01010.1 AT1G01010 0 0
transcript:AT1G01020.1 AT1G01020 0 0
transcript:AT1G01020.2 AT1G01020 0 0
transcript:AT1G01020.3 AT1G01020 0 0

I've modified the bin python file to correct these situations.

I think we're restrictive enough about letting a GFF go downstream the pipeline, so maybe these lines in bin/gtf2bed are redundant:

$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;
s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}

Could be enough with

($id) = $f[8]=~ /transcript_id "([^"]+)"/;

at this point?

drpatelh · 2024-03-12T09:23:22Z

Pinging @pinin4fjords for review here since he wrote the filter script.

pinin4fjords

I think this is more complex than it needs to be, assuming you just want to remove transcript: from lines and allow for a new extension. No need to disrupt the existing tab-checking logic.

pinin4fjords · 2024-03-12T10:20:54Z

bin/filter_gtf.py

@@ -20,16 +19,24 @@ def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
        return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}


-def tab_delimited(file: str) -> float:
-    """Check if file is tab-delimited and return median number of tabs."""
+def tab_checks(file: str) -> (bool, bool):


This function doesn't need to be changed (see below), and you've changed the existing logic- could you revert please?

pinin4fjords · 2024-03-12T10:24:16Z

bin/filter_gtf.py

@@ -46,6 +53,8 @@ def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_i

                if seq_name in seq_names_in_genome:
                    if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
+                        if extra_id:
+                            line = line.replace("transcript:", "")


There's no need to make this conditional. All you did above was add a check on whether the line contained this prefix. Since this replacement won't happen unless it does, that logic is redundant and this line.replace alone will be sufficient.

But could you amend to something that only replaces transcript: in the relevant part of the string, with a regex? I'm a bit nervous about it applying across the whole line.

pinin4fjords · 2024-03-12T10:25:03Z

.devcontainer/devcontainer.json

            },

            // Add the IDs of extensions you want installed when the container is created.
-            "extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]


Still need to fix up Git etc so you're not changing this file.

drpatelh · 2024-05-13T08:12:54Z

Converting to draft as it appears like more work needs to be done here. Thanks for the review @pinin4fjords !

change gff re to include gff3

8dedcf5

roldanjg added 3 commits January 15, 2024 18:23

prepare_genome fatas index optimization

69abf96

Undo the last commit made by mistake

59e4293

Update the changelog

e9353ff

roldanjg marked this pull request as draft January 18, 2024 11:55

Fixed prettier linting error

be4ba79

roldanjg added 2 commits January 18, 2024 16:56

GFF3 transcripts standardization

00d866b

Black format correct

ebeef19

roldanjg marked this pull request as ready for review January 18, 2024 17:47

Tab checks bug fixed to pass all the tests.

20310bd

pinin4fjords requested changes Mar 12, 2024

View reviewed changes

drpatelh marked this pull request as draft May 13, 2024 08:12

drpatelh mentioned this pull request May 13, 2024

GFF file regex #1190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change gff re to include gff3 #1191

change gff re to include gff3 #1191

roldanjg commented Jan 14, 2024 •

edited

github-actions bot commented Jan 14, 2024 •

edited

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

drpatelh commented Jan 15, 2024

roldanjg commented Jan 18, 2024 •

edited

drpatelh commented Mar 12, 2024

pinin4fjords left a comment

pinin4fjords Mar 12, 2024

pinin4fjords Mar 12, 2024

pinin4fjords Mar 12, 2024

drpatelh commented May 13, 2024

change gff re to include gff3 #1191

Are you sure you want to change the base?

change gff re to include gff3 #1191

Conversation

roldanjg commented Jan 14, 2024 • edited

PR checklist

github-actions bot commented Jan 14, 2024 • edited

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

drpatelh commented Jan 15, 2024

roldanjg commented Jan 18, 2024 • edited

drpatelh commented Mar 12, 2024

pinin4fjords left a comment

Choose a reason for hiding this comment

pinin4fjords Mar 12, 2024

Choose a reason for hiding this comment

pinin4fjords Mar 12, 2024

Choose a reason for hiding this comment

pinin4fjords Mar 12, 2024

Choose a reason for hiding this comment

drpatelh commented May 13, 2024

roldanjg commented Jan 14, 2024 •

edited

github-actions bot commented Jan 14, 2024 •

edited

`nf-core lint` overall result: Passed ✅ ⚠️

roldanjg commented Jan 18, 2024 •

edited