Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in htslib/bcftools 1.1: [E::bcf_hdr_add_sample_len] Duplicated sample name #1408

Closed
sahwa opened this issue Feb 9, 2021 · 3 comments · Fixed by samtools/htslib#1237
Labels
bug D3: Easy htslib-dependent Cannot be fixed until htslib is fixed

Comments

@sahwa
Copy link

sahwa commented Feb 9, 2021

If we have a test.vcf and use htslib/bcftools 1.1 to:

bcftools view test.vcf

We get

[E::bcf_hdr_add_sample_len] Duplicated sample name .... Failed to read from test.vcf: could not parse header

We get the same error for running any bcftools command.

But using an old version bcftools (1.9-207-g2299ab6 Using htslib 1.9-271-g6738132), then the file can be viewed OK.

@sahwa
Copy link
Author

sahwa commented Feb 9, 2021

Edit - just figured that this is because one of the sample names contains an unusual special character "B�R1" - once I edited the name to remove the vcf could be parsed OK.

@daviesrob
Copy link
Member

This is due to the way HTSlib looks for tabs in the header line, which currently (on x86-64) mistakes UTF-8 characters for the end of the sample name. As a result is splits the name "B�R1" into several parts, and complains about a duplicate when it gets to "R1". Casting *q to uint8_t makes the comparison unsigned and keeps the entire name intact.

The VCF4.3 specification says it uses UTF-8 so presumably names like this ought to be allowed?

@jmarshall
Copy link
Member

The VCF4.3 specification says it uses UTF-8 so presumably names like this ought to be allowed?

The VCF spec is (as usual) vague on this. samtools/hts-specs#18 motivated UTF-8 by saying “in order to address the need to represent non-ASCII characters in INFO field values, VCF files are assumed to be encoded in UTF-8 […]” which I read as intending UTF-8 for use primarily in free text description fields and the like, as in SAM, but not necessarily in fields like these sample IDs that tools need to compare/etc. PR samtools/hts-specs#414 proposes rules for VCF sample IDs but remains vague. In a different context, ga4gh/seqcol-spec#2 (comment) wisely noted that allowing arbitrary Unicode would make testing sample IDs for equivalence (e.g. bcftools view -s) difficult.

Fortunately this HTSlib parsing code could be fixed to trigger only on \t and \n while remaining agnostic as to whether UTF-8 is actually allowed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug D3: Easy htslib-dependent Cannot be fixed until htslib is fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants