Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chainNameList, chainIdList are limited to 4 characters #37

Open
pwrose opened this issue Sep 3, 2018 · 3 comments
Open

chainNameList, chainIdList are limited to 4 characters #37

pwrose opened this issue Sep 3, 2018 · 3 comments

Comments

@pwrose
Copy link
Collaborator

pwrose commented Sep 3, 2018

For some use cases longer chain names/Ids are required, e.g., to encode the symmetry operator when creating biological assemblies.

It would be best if the chain names/Ids can have a flexible length.

@gtauriello
Copy link

One idea there: for decoders it should be fairly simple to accept Array and Binary types interchangeably (at least rcsb/mmtf-cpp doesn't have a problem with it and I consider C++ to be rather strict for types). As such one could relax the typing for lists such that they don't necessarily have to be of Binary type and give the encoders more flexibility. Already now, the encoding strategy for binary formats is not strictly required to be the fixed for decoders to work.

Now for your proposed change this would mean that the encoders would have to become "smarter" and choose an appropriate encoding for the chain names. If there is a reasonable max. chain name length (e.g. <= 4), the binary encoding can be used, and otherwise an Array of String can be used instead.

The alternative of course is to change the spec to be fixed to Array of String, but this would break compatibility with the current spec.

All of this is assuming that noone is currently strictly assuming that chain names are fixed at length 4.

In terms of implementing it, I can only speak for the rcsb/mmtf-cpp library where I don't see any problem with using chain names/ids of variable length.

@danpf
Copy link
Contributor

danpf commented Dec 20, 2018

I support having long chain names... but just for the record,
http://mmcif.wwpdb.org/docs/large-pdbx-examples/
suggests that

Chain identifiers of up to 4 characters are permitted. The PDB chain identifier corresponds to the "_atom_site.auth_asym_id" data item.

which is sad.

@speleo3
Copy link
Contributor

speleo3 commented Dec 21, 2018

for decoders it should be fairly simple to accept Array and Binary types interchangeably

mmtf-c and simplemmtf-python already supports this. Example:

d = simplemmtf.fetch('1rx1')
d._data['chainNameList'] = ['ABCD', 'EFGHIJKL', 'MNOPQRSTUVWXY', 'Z']
open('foo.mmtf', 'wb').write(d.encode())

The file can be loaded into PyMOL, which uses mmtf-c.

For the record, no length limitations mentioned here: http://mmcif.wwpdb.org/dictionaries/mmcif_mdb.dic/Items/_atom_site.auth_asym_id.html

speleo3 added a commit to schrodinger/simplemmtf-python that referenced this issue Dec 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants