New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change gff re to include gff3 #1191
base: dev
Are you sure you want to change the base?
Conversation
|
Thanks @JOAQUINGR ! Have you tested the pipeline with a GFF3 annotation and it works? Asking just in case we need any other modifications to properly support the format. Also, be great if you can update the CHANGELOG please. |
Hi @drpatelh ! Sorry for the messy commits, it's all fixed now. It looks like Prettier formatting has been updated because I had to run "prettier --write .devcontainer/devcontainer.json" to complete the tests. Yes, I've being doing some tests and:
As a result, the transcript_counts file looks like this:
I've modified the bin python file to correct these situations.
Could be enough with
at this point? |
Pinging @pinin4fjords for review here since he wrote the filter script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more complex than it needs to be, assuming you just want to remove transcript:
from lines and allow for a new extension. No need to disrupt the existing tab-checking logic.
@@ -20,16 +19,24 @@ def extract_fasta_seq_names(fasta_name: str) -> Set[str]: | |||
return {line[1:].split(None, 1)[0] for line in fasta if line.startswith(">")} | |||
|
|||
|
|||
def tab_delimited(file: str) -> float: | |||
"""Check if file is tab-delimited and return median number of tabs.""" | |||
def tab_checks(file: str) -> (bool, bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't need to be changed (see below), and you've changed the existing logic- could you revert please?
@@ -46,6 +53,8 @@ def filter_gtf(fasta: str, gtf_in: str, filtered_gtf_out: str, skip_transcript_i | |||
|
|||
if seq_name in seq_names_in_genome: | |||
if skip_transcript_id_check or re.search(r'transcript_id "([^"]+)"', line): | |||
if extra_id: | |||
line = line.replace("transcript:", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to make this conditional. All you did above was add a check on whether the line contained this prefix. Since this replacement won't happen unless it does, that logic is redundant and this line.replace alone will be sufficient.
But could you amend to something that only replaces transcript:
in the relevant part of the string, with a regex? I'm a bit nervous about it applying across the whole line.
}, | ||
|
||
// Add the IDs of extensions you want installed when the container is created. | ||
"extensions": ["ms-python.python", "ms-python.vscode-pylance", "nf-core.nf-core-extensionpack"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to fix up Git etc so you're not changing this file.
Converting to draft as it appears like more work needs to be done here. Thanks for the review @pinin4fjords ! |
Hi nf-core team,
I work with genomes from different species, and many times people use 'gff3' instead of 'gff' extension in the annotations file name, to differentiate GFF2 and GFF3 I guess. This pull request introduce 'gff3' extension for annotation file name specification. I modified the gff regex.
PR checklist