Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change gff re to include gff3 #1191

Draft
wants to merge 8 commits into
base: dev
Choose a base branch
from
Draft

change gff re to include gff3 #1191

wants to merge 8 commits into from

Conversation

roldanjg
Copy link

@roldanjg roldanjg commented Jan 14, 2024

Hi nf-core team,
I work with genomes from different species, and many times people use 'gff3' instead of 'gff' extension in the annotations file name, to differentiate GFF2 and GFF3 I guess. This pull request introduce 'gff3' extension for annotation file name specification. I modified the gff regex.

PR checklist

  • This comment contains a description of changes (with reason).
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.

Copy link

github-actions bot commented Jan 14, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 20310bd

+| ✅ 146 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowRnaseq.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

✅ Tests passed:

Run details

  • nf-core/tools version 2.11.1
  • Run at 2024-01-19 09:37:24

@drpatelh
Copy link
Member

Thanks @JOAQUINGR ! Have you tested the pipeline with a GFF3 annotation and it works? Asking just in case we need any other modifications to properly support the format. Also, be great if you can update the CHANGELOG please.

@roldanjg roldanjg marked this pull request as draft January 18, 2024 11:55
@roldanjg
Copy link
Author

roldanjg commented Jan 18, 2024

Hi @drpatelh ! Sorry for the messy commits, it's all fixed now. It looks like Prettier formatting has been updated because I had to run "prettier --write .devcontainer/devcontainer.json" to complete the tests.

Yes, I've being doing some tests and:

  1. I think that "gffread" is the one to go for the semantic conversion, at least for this pipeline, I don't know if the simplified GTF2 is enough for others.
    The documentation:
    https://github.com/gpertea/gffread/blob/master/examples/README.md

  2. It's quite hard to get a standard and automated parser. But I think that this one is great and goes as far as you can go automatically. Just being picky, there is a possibility that we are not considering and that I've seen in some GFF from Ensembl; Example: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/gff3/arabidopsis_thaliana/README

  • transcript types:
    * ID: Unique identifier, format "transcript:<transcript_stable_id>"

As a result, the transcript_counts file looks like this:

tx gene_id ERX2558928 ERX2558929
transcript:AT1G01010.1 AT1G01010 0 0
transcript:AT1G01020.1 AT1G01020 0 0
transcript:AT1G01020.2 AT1G01020 0 0
transcript:AT1G01020.3 AT1G01020 0 0

I've modified the bin python file to correct these situations.

  1. I think we're restrictive enough about letting a GFF go downstream the pipeline, so maybe these lines in bin/gtf2bed are redundant:

$gff = 2 if /^##gff-version 2/;
$gff = 3 if /^##gff-version 3/;
next if /^#/ && $gff;
s/\s+$//;
# 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr
my @f = split /\t/;
if ($gff) {
# most ver 2's stick gene names in the id field
($id) = $f[8]=~ /\bID="([^"]+)"/;
# most ver 3's stick unquoted names in the name field
($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3;
} else {
($id) = $f[8]=~ /transcript_id "([^"]+)"/;
}

Could be enough with

($id) = $f[8]=~ /transcript_id "([^"]+)"/;

at this point?

@roldanjg roldanjg marked this pull request as ready for review January 18, 2024 17:47
@drpatelh
Copy link
Member

Pinging @pinin4fjords for review here since he wrote the filter script.

Copy link
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more complex than it needs to be, assuming you just want to remove transcript: from lines and allow for a new extension. No need to disrupt the existing tab-checking logic.

@@ -20,16 +19,24 @@ def extract_fasta_seq_names(fasta_name: str) -> Set[str]:
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")}


def tab_delimited(file: str) -> float:
"""Check if file is tab-delimited and return median number of tabs."""
def tab_checks(file: str) -> (bool, bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't need to be changed (see below), and you've changed the existing logic- could you revert please?

@@ -46,6 +53,8 @@ def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_i

if seq_name in seq_names_in_genome:
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line):
if extra_id:
line = line.replace("transcript:", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make this conditional. All you did above was add a check on whether the line contained this prefix. Since this replacement won't happen unless it does, that logic is redundant and this line.replace alone will be sufficient.

But could you amend to something that only replaces transcript: in the relevant part of the string, with a regex? I'm a bit nervous about it applying across the whole line.

},

// Add the IDs of extensions you want installed when the container is created.
"extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to fix up Git etc so you're not changing this file.

@drpatelh drpatelh marked this pull request as draft May 13, 2024 08:12
@drpatelh
Copy link
Member

Converting to draft as it appears like more work needs to be done here. Thanks for the review @pinin4fjords !

@drpatelh drpatelh mentioned this pull request May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants