Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to parse Saccharomyces cerevisiae S288C chromosome IX #313

Open
Koeng101 opened this issue May 29, 2023 · 8 comments
Open

Failed to parse Saccharomyces cerevisiae S288C chromosome IX #313

Koeng101 opened this issue May 29, 2023 · 8 comments
Labels
bug Something isn't working high priority High priority - something is broken or missing that is critical for users or developers. intermediate Will take some time to fix
Milestone

Comments

@Koeng101
Copy link
Contributor

Koeng101 commented May 29, 2023

the <155222 is confusing parseLocation.

2023/05/29 10:24:00 Failed to parse ix.gb with err: strconv.Atoi: parsing "<155222": invalid syntax

What does it even mean tho

https://www.ncbi.nlm.nih.gov/nuccore/NC_001141.2

     gene            <155222..>155765
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /db_xref="GeneID:854695"
     mRNA            join(<155222,155311..>155765)
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /product="cytochrome c oxidase subunit Vb"
                     /transcript_id="NM_001179459.1"
                     /db_xref="GeneID:854695"
     CDS             join(155222,155311..155765)
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /experiment="EXISTENCE:direct assay:GO:0005739
                     mitochondrion [PMID:16823961|PMID:24769239]"
                     /experiment="EXISTENCE:direct assay:GO:0005751
                     mitochondrial respiratory chain complex IV [PMID:2986105]"
                     /experiment="EXISTENCE:direct assay:GO:0006123
                     mitochondrial electron transport, cytochrome c to oxygen
                     [PMID:1331058]"
                     /experiment="EXISTENCE:mutant phenotype:GO:0004129
                     cytochrome-c oxidase activity [PMID:2986105]"
                     /experiment="EXISTENCE:mutant phenotype:GO:0050421 nitrite
                     reductase (NO-forming) activity [PMID:18388202]"
                     /note="Subunit Vb of cytochrome c oxidase; cytochrome c
                     oxidase is the terminal member of the mitochondrial inner
                     membrane electron transport chain; Cox5Bp is predominantly
                     expressed during anaerobic growth while its isoform Va
                     (Cox5Ap) is expressed during aerobic growth; COX5B has a
                     paralog, COX5A, that arose from the whole genome
                     duplication"
                     /codon_start=1
                     /product="cytochrome c oxidase subunit Vb"
                     /protein_id="NP_012155.1"
                     /db_xref="GeneID:854695"
                     /db_xref="SGD:S000001373"
                     /translation="MLRTSLTKGARLTGTRFVQTKALSKATLTDLPERWENMPNLEQK
                     EIADNLTERQKLPWKTLNNEEIKAAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGI
                     SFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSKNANPWGGYSQVQSK"

@soypat
Copy link
Contributor

soypat commented Jun 10, 2023

What does it even mean tho

It looks like the string <155222 was tried to be parsed as an integer, which it isn't since it contains the string "<", which is non-numerical. Looks like an off-by-one error when acquiring the integer string.

@Koeng101
Copy link
Contributor Author

I know what the code means, but it is pretty unclear what it biologically means. All 3 of those are referring to the same gene/mRNA/CDS... but each one uses a different location string - and it looks like the gene at least is lossy.

<155222..>155765 doesn't make sense because it isn't say where the gene actually does start (like with join(<155222,155311..>155765), which basically says there is an intron from 155222 to 155311, and then from 155311 to 155765 there is a gene). The better way to write that would be join(155222,155311..155765), but semantically I think they mean the same thing.

@carreter
Copy link
Collaborator

Status update on this? Does it still need fixing?

@Koeng101
Copy link
Contributor Author

I don't think it has been fixed. It does need fixing

I think the difficult part here is parsing out the join properly - without keeping a map of locus_tags, I'm not sure you can even parse <155222..>155765 properly, at all. It doesn't contain all the information necessary get the sequence out. We could also just accept that it is fucked up, and not try to fix it all. I kinda like that solution. Here is what snapgene displays:

Screen Shot 2023-09-15 at 2 07 58 PM

I personally think this is a fine solution so long as we note it somewhere. We should probably have a note somewhere in the file of all the location exception cases we find.

@carreter carreter added bug Something isn't working medium priority The default priority for a new issue. intermediate Will take some time to fix labels Sep 16, 2023
@Koeng101 Koeng101 added high priority High priority - something is broken or missing that is critical for users or developers. and removed medium priority The default priority for a new issue. labels Sep 16, 2023
@carreter carreter added this to the v1.0 milestone Sep 23, 2023
@abondrn abondrn mentioned this issue Oct 30, 2023
6 tasks
@TimothyStiles
Copy link
Collaborator

Should be fixed in #394 @Koeng101?

@Koeng101
Copy link
Contributor Author

Probably not. I think the time to fix this would be after the merge of ioToBio.

Copy link

This issue has had no activity in the past 2 months. Marking as stale.

@github-actions github-actions bot added the stale label Jan 29, 2024
@carreter carreter removed the stale label Jan 30, 2024
@carreter
Copy link
Collaborator

This will be fixed once #437 is merged as a part of #434 .

To clarify, the < and > syntax indicate that the sequence is unbounded, i.e. <155222..>155765 indicates the sequence starts before base 155222 and ends after base 155765.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority High priority - something is broken or missing that is critical for users or developers. intermediate Will take some time to fix
Projects
None yet
Development

No branches or pull requests

4 participants