Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools faidx fails to retrieve large scaffolds #1660

Open
mcshane opened this issue May 26, 2022 · 7 comments
Open

samtools faidx fails to retrieve large scaffolds #1660

mcshane opened this issue May 26, 2022 · 7 comments
Assignees

Comments

@mcshane
Copy link
Contributor

mcshane commented May 26, 2022

Are you using the latest version of samtools and HTSlib? If not, please specify.

samtools 1.15.1
Using htslib 1.15.1
Copyright (C) 2022 Genome Research Ltd.

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)
  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)
  • compiler (run gcc --version or clang --version)

Linux 4.15.0-175-generic/x86_64

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

We have a large plant assembly with some scaffolds larger than 10G in length. samtools faidx has successfully indexed this file (first part of the fai file is below), but when trying to retrieve the contigs that are >2^32 in length, it fails. Scaffolds smaller than 2^32 appear to be okay.

$ samtools faidx out_scaffolds_final.fa scaffold_8 > /dev/null
[faidx] Failed to fetch sequence in scaffold_8

The end of the strace for the above is

openat(AT_FDCWD, "out_scaffolds_final.fa", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=104463251154, ...}) = 0
read(3, ">scaffold_1\nTATTTATATTTTATTCTATT"..., 32768) = 32768
brk(0x55e70f4dd000)                     = 0x55e70f4dd000
lseek(3, 57703942101, SEEK_SET)         = 57703942101
read(3, "TTCCTCTAACAACCCCTGGCTTAAGGAAATTG"..., 65536) = 65536
mmap(NULL, 6233518080, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x55e882d9c000)                     = 0x55e70f4dd000
mmap(NULL, 6233653248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 16), ...}) = 0
write(1, ">scaffold_8\n", 12>scaffold_8
)           = 12
write(2, "[faidx] Failed to fetch sequence"..., 47[faidx] Failed to fetch sequence in scaffold_8
) = 47
munmap(0x7fb88648f000, 331776)          = 0
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

The top of the fai file looks like this:

$ head -20 out_scaffolds_final.fa.fai
scaffold_1	10872223031	12	60	61
scaffold_2	10019801872	11053426773	60	61
scaffold_3	8320839242	21240225355	60	61
scaffold_4	7846475048	29699745264	60	61
scaffold_5	6842170535	37676994909	60	61
scaffold_6	6433889001	44633201632	60	61
scaffold_7	6422577009	51174322129	60	61
scaffold_8	6233517787	57703942101	60	61
scaffold_9	4303301191	64041351864	60	61
scaffold_10	4269982256	68416374755	60	61
scaffold_11	4059627635	72757523395	60	61
scaffold_12	3516010586	76884811504	60	61
scaffold_13	3514780679	80459422280	60	61
scaffold_14	2397495093	84032782650	60	61
scaffold_15	2124436455	86470236008	60	61
scaffold_16	1867718782	88630079751	60	61
scaffold_17	1092189090	90528927193	60	61
scaffold_18	497128335	91639319448	60	61
scaffold_19	422453574	92144733269	60	61
scaffold_20	402869361	92574227749	60	61
@daviesrob
Copy link
Member

faidx is supposed to support long chromosomes, but the length seems to get capped at 2^32 bases. The problem is in this line, which originally prevented an overflow when the function returned an int *. It should have been updated when the return type was changed to hts_pos_t * but it looks like it got missed. That shouldn't be too difficult to fix.

I also notice that faidx currently works by copying the entire requested region into a buffer and then writing it out. That's OK when you only want a few bases, but leads to excessive memory use when you try to access something like your 10G scaffold. It might be a good idea to limit the amount it tries to store in such cases to get its memory use back down to a more reasonable level.

@daviesrob daviesrob self-assigned this May 31, 2022
daviesrob added a commit to daviesrob/htslib that referenced this issue Jun 7, 2022
This was probably a left-over from the transition to 64-bit
positions in HTSlib.  Having the limit in fai_retrieve() caused
very long references to be truncated even though programs like
`samtools faidx` should be able to support them (see issue
samtools/samtools#1660 - samtools faidx fails to retrieve large
scaffolds).

The limit is useful for legacy faidx interfaces that return
the size in an `int *`, so tests for sizes over INT_MAX
have been applied to them.
whitwham pushed a commit to samtools/htslib that referenced this issue Jun 8, 2022
This was probably a left-over from the transition to 64-bit
positions in HTSlib.  Having the limit in fai_retrieve() caused
very long references to be truncated even though programs like
`samtools faidx` should be able to support them (see issue
samtools/samtools#1660 - samtools faidx fails to retrieve large
scaffolds).

The limit is useful for legacy faidx interfaces that return
the size in an `int *`, so tests for sizes over INT_MAX
have been applied to them.
@daviesrob
Copy link
Member

samtools/htslib#1446 should have gone most of the way to fixing this. I'm leaving this issue open for the moment though as I notice that there are a few other improvements that could be made here, notably to the amount of memory used when you fetch a long sequence, which starts to get a bit excessive.

@mcshane
Copy link
Contributor Author

mcshane commented Jun 8, 2022

Thanks @daviesrob. Pulled in the current develop head(s) and still fails to retrieve scaffold_8 and larger... Let me know if you want the location of the file on Sanger disk for testing...

@daviesrob
Copy link
Member

Getting access to your files would be useful. It worked on my test one, but I cheated a bit by making my sequences all "N" so they compressed down to a reasonable size.

@mcshane
Copy link
Contributor Author

mcshane commented Jun 8, 2022

Have sent you an email.

@mcshane
Copy link
Contributor Author

mcshane commented Jun 8, 2022

Thanks to @daviesrob identified that this was because I was running out of memory on the node - running on a node with enough memory worked as expected.

@daviesrob
Copy link
Member

Good to hear that. I'm looking at making it a bit less memory hungry, but it'll need a bit more work in HTSlib to get there.

I've also noticed that samtools dict is even worse. It tries to read in the entire assembly and then dies in the attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants