Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with "source" in freshly created FeatureNodes in gtpython #977

Open
maol-corteva opened this issue Jun 8, 2021 · 3 comments
Open
Labels

Comments

@maol-corteva
Copy link

maol-corteva commented Jun 8, 2021

Problem description

Dear @satta :

Using Python3, when creating a GFF FeatureNode from scratch, setting its "source" field with fn.set_source() method results in the storage of a string that is possibly twice encoded.

If the user requests the source data back with fn.get_source() method, what the user gets back is a proper string that falsely resembles a "bytes" object. Attempting to decode the object results in an attribute error because the object does not have a "decode" attribute.

The object is already a string, but it retains (as a string) the format b'text' which looks like a bytes object. (See code below)

I assume the source logic is attempting to be transparent to both python2 and python3. I did not test with python2.

example python3.8 code "test_newfeat_source.py" follows

#!/bin/env python
# -*- coding: utf-8 -*-

from gt.extended.feature_node import FeatureNodeIteratorDirect
from gt.dlload import gtlib
from gt.extended import *
import sys
import re
print(sys.path)

if __name__ == "__main__":
    seqid = "foo"
    seqsource = "BAR"
    gene = FeatureNode.create_new(seqid, "gene", 100, 900, "+")
    gene.set_source(  seqsource   )
    exon1 = FeatureNode.create_new(seqid, "exon", 100, 200, "+")
    exon1.set_source(  seqsource   )
    gene.add_child(exon1)
    exon2 = FeatureNode.create_new(seqid, "exon", 800, 900, "+")
    exon2.set_source(  seqsource   )
    gene.add_child(exon2)

    fin = FeatureNodeIteratorDepthFirst(gene)
    while True:
        fn = fin.next()
        if not fn:
            break
        tfn_source = fn.get_source().decode('UTF-8')
        print(fn, tfn_source, fn.get_type())

    print("\n..After reformating source gff3 field...")
    locregex=re.compile("b'(.+)'$")
    fin = FeatureNodeIteratorDepthFirst(gene)
    while True:
        fn = fin.next()
        if not fn:
            break
        tfn_source = fn.get_source().decode('UTF-8')
        #tfn_source = str(fn.get_source())
        if (tfn_source.startswith("b'")):
            locpatternsearch = locregex.search(tfn_source)
            tfn_source = locpatternsearch.group(1)               #         <=== FIXed here
        print(f'The type of obj "source" is {type(tfn_source)}')
        print(fn, tfn_source, fn.get_type())

##Output from above code under Ubuntu's system installed python 3.8

 $  python3 test_newfeat_source.py
['/home/testy/Documents/work', '/home/testy/Documents/source/gt/gtpython', '/usr/lib/python38.zip', '/usr/lib/python3.8', '/usr/lib/python3.8/lib-dynload', '/usr/local/lib/python3.8/dist-packages', '/usr/lib/python3/dist-packages']
FeatureNode(start=100, end=900, seqid="foo") b'BAR' gene
FeatureNode(start=100, end=200, seqid="foo") b'BAR' exon
FeatureNode(start=800, end=900, seqid="foo") b'BAR' exon

..After reformating source gff3 field...
The type of obj "source" is <class 'str'>
FeatureNode(start=100, end=900, seqid="foo") BAR gene
The type of obj "source" is <class 'str'>
FeatureNode(start=100, end=200, seqid="foo") BAR exon
The type of obj "source" is <class 'str'>
FeatureNode(start=800, end=900, seqid="foo") BAR exon

What GenomeTools version are you reporting an issue for (as output by gt -version)?

I am using GenomeTools 1.6.1 installed by downloading precompiled binary and python libs from GenomeTools.org (single tar.gz package)

$ python3  -V
Python 3.8.5

$ which python3
/usr/bin/python3

$ echo $PYTHONPATH
/home/testy/Documents/source/gt/gtpython

$  /home/testy/Documents/source/gt/bin/gt --version
/home/testy/Documents/source/gt/bin/gt (GenomeTools) 1.6.1
Copyright (c) 2003-2016 G. Gremme, S. Steinbiss, S. Kurtz, and CONTRIBUTORS
Copyright (c) 2003-2016 Center for Bioinformatics, University of Hamburg
See LICENSE file or http://genometools.org/license.html for license details.

Did you compile GenomeTools from source? If so, please state the make parameters used.

Same downloaded distro reports:

Used compiler: cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Compile flags:  -g -Wall -Wunused-parameter -pipe -fPIC -Wpointer-arith -Wno-unknown-pragmas -O3 -m32 -Werror

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

Ubuntu 20.04 LTS

@satta
Copy link
Member

satta commented Jun 8, 2021

Well, feature_node.get_source() technically returns a gt.core.Str object, not a Python string. This is only implicitly converted (with UTF decoding) in its __str__() method, see https://github.com/genometools/genometools/blob/master/gtpython/gt/core/gtstr.py#L48.

The UTF8 decoding/encoding everywhere in the code is indeed a result of me trying to connect the C GtStrs with the correct representations in Python 2 and 3. Please keep in mind that this was written when Python 3 was just beginning to appear on the horizon in reality. I would be happy to get some pointers regarding this from someone who writes Python more regularly than I do ;)

We also have the option of removing Python 2 compatibility altogether, as it's been deprecated for quite a while now and Debian, for instance, doesn't even ship it anymore. Any opinions on that? I wouldn't mind simplifying things this way.

@satta satta added the bug label Jun 8, 2021
@maol-corteva
Copy link
Author

It has been while since I used python2 directly. I think its ok to start phasing it out.

@satta
Copy link
Member

satta commented Jun 23, 2021

I'd prefer to work on this with a bit more time, as I don't really have a lot of experience with the UTF encoding implications in the Python versions and I would like to get 1.6.2 with some other bugfixes out first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants