Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a missing value for a FORMAT field #172

Open
p69180 opened this issue Sep 28, 2020 · 7 comments
Open

Writing a missing value for a FORMAT field #172

p69180 opened this issue Sep 28, 2020 · 7 comments

Comments

@p69180
Copy link

p69180 commented Sep 28, 2020

Hi,

Please understand that I do not have any experience with C or htslib.

I would like to write missing values ('.') to FORMAT fields.

I incidentally found that when the FORMAT field is of type 'Integer', entering a numpy array including -2147483648 results in missing value ('.').
But I could not find such a way with 'Float' type FORMAT field. I tried with numpy.nan, but the corresponding FORMAT field of output vcf file showed 'nan' instead of '.'.

Is there any way to write '.' in a FORMAT field?

Thanks

@brentp
Copy link
Owner

brentp commented Sep 28, 2020

You can write 0x7F800001 or 2139095041 as the missing value.

@p69180
Copy link
Author

p69180 commented Sep 28, 2020

Thank you for a fast reply.

Unfortunately your suggestion did not work.

What I did:

import sys
sys.path.append('/home/users/pjh/python/lib/python3.7/site-packages')
import cyvcf2
import numpy as np

vcf_path = '/home/users/pjh/tmp/TEST.bcf.gz'

vcf = cyvcf2.VCF(vcf_path)
vcf.add_format_to_header({'ID':'NEW', 'Type':'Float', 'Number':1, 'Description':'...'})
v = next(vcf)
print(v)
print()

NA = 0x7F800001
arr = np.array([[1.1], [NA]])
v.set_format('NEW', arr)
print(v)


Output:

1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0

1 94533 . C A . PASS AS_FilterStatus=SITE;AS_SB_TABLE=38,64|6,3;DP=116;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=396,385;MMQ=23,40;MPOS=30;NALOD=1.5;NLOD=9.01;POPAF=6;ROQ=65;TLOD=19.97 GT:AD:AF:DP:F1R2:F2R1:SB:NEW 0/1:72,9:0.12:81:39,5:33,4:24,48,6,3:1.1 0/0:30,0:0.031:30:16,0:14,0:14,16,0,0:2.1391e+09

@brentp
Copy link
Owner

brentp commented Sep 28, 2020

you can use:

v.set_format('NEW', np.array([1.1, np.nan], dtype=np.float32))

that shows "nan" instead of "." and I'm not sure why yet.

@davmlaw
Copy link
Contributor

davmlaw commented Oct 6, 2020

On the topic of missing values for FORMAT fields:

gt_alt_depths returns -1 for missing values while
format() - returns -2147483648, which is -(2^31) or all bits set signed int32 - I think numpy converts NaN to this when casting to int

@LiterallyUniqueLogin
Copy link
Contributor

Is there any way forward on this? I can confirm that currently setting a float format field with np.nans results in 'nan' being written to the output file, and that setting the field with 0x7F800001 or 2139095041 just results in that number, and not any missing value.

I'd like setting values to np.nan to result in '.' being written, as '.' is currently read in as np.nan. If that's not possible, then perhaps we could pass a masked array to set_format() and the masked values would take on the missing value? That could also be a consistent solution across datatypes. But I'll take whichever is faster/easier to implement.

@LiterallyUniqueLogin
Copy link
Contributor

@brentp thoughts?

@brentp
Copy link
Owner

brentp commented Nov 19, 2020

Hi, I'm not sure hwo to fix this. I'd be happy to accept a PR to resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants