Saving large table FITS may lead to a truncated file #4307

embray · 2015-11-13T19:24:09Z

While handling large catalogue FITS files, I sometimes encounter problems when saving them: the newly saved file is truncated whereas no error is raised at saving time. For instance, I have a 37 GiB file and if I open it and save it just after, this leads to a 2 GiB file.

I tried to write a script to generate a big FITS file and expose the bug. During this, I found that if read a big file and count the rows before saving it, the newly save file is OK.

Here is the script. Of course, you need a lot of memory and a lot of disk space to run it.

import os

import numpy as np
from astropy.io import fits

# Let's generate a BIG fits file (~25 GiB).
print("> Generating a 500x13,000,000 random table FITS file.")
list_cols = []
for i in range(500):
    list_cols.append(
        fits.Column(
            name="{}".format(i),
            format='E',
            array=np.random.random(13000000)
        )
    )
cols = fits.ColDefs(list_cols)
hdu = fits.BinTableHDU.from_columns(cols)
hdu.writeto("bigfile.fits")
print("> bigfile.fits file saved.")

# If we count read the file and count the rows before saving it to another
# file, everything is OK.
print("> Read. Count rows. Save.")
hdu_list = fits.open("bigfile.fits")
print("bigfile.fits: {} rows".format(len(hdu_list[1].data)))
hdu_list.writeto("bigfile2.fits")
print("> bigfile2.fits saved.")

print("> Comparing the file sizes.")
print("bigfile.fits: {}".format(os.stat("bigfile.fits").st_size))
print("bigfile2.fits: {}".format(os.stat("bigfile2.fits").st_size))

print("Reading bigfile2.fits and counting the rows.")
hdu_list = fits.open("bigfile2.fits")
print("bigfile2.fits: {} rows".format(len(hdu_list[1].data)))

# If we read the file and right after save it to another file, the later is
# truncated.
print("> Read the file and save it to another file just after.")
hdu_list = fits.open("bigfile.fits")
hdu_list.writeto("bigfile3.fits")
print("> bigfile3.fits saved.")

print("> Comparing the file sizes.")
print("bigfile.fits: {}".format(os.stat("bigfile.fits").st_size))
print("bigfile3.fits: {}".format(os.stat("bigfile3.fits").st_size))

print("> Reading bigfile3.fits produce a warning.")
hdu_list = fits.open("bigfile3.fits")  # Produces a warning.
print("> Accessing the data raises an error.")
print("bigfile3.fits: {} rows".format(len(hdu_list[1].data)))  # Fails.

The output of the script is:

> Generating a 500x13,000,000 random table FITS file.
> bigfile.fits file saved.
> Read. Count rows. Save.
bigfile.fits: 13000000 rows
> bigfile2.fits saved.
> Comparing the file sizes.
bigfile.fits: 26000087040
bigfile2.fits: 26000087040
Reading bigfile2.fits and counting the rows.
bigfile2.fits: 13000000 rows
> Read the file and save it to another file just after.
> bigfile3.fits saved.
> Comparing the file sizes.
bigfile.fits: 26000087040
bigfile3.fits: 2147565952
> Reading bigfile3.fits produce a warning.
WARNING: File may have been truncated: actual file length (2147565952) is smaller than the expected size (26000087040) [astropy.io.fits.file]
> Accessing the data raises an error.
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/utils/decorators.py", line 339, in __get__
    return obj.__dict__[self._key]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "big_table_FITS_bug.py", line 52, in <module>
    print("bigfile3.fits: {} rows".format(len(hdu_list[1].data)))  # Fails.
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/utils/decorators.py", line 341, in __get__
    val = self._fget(obj)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/table.py", line 399, in data
    data = self._get_tbdata()
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/table.py", line 168, in _get_tbdata
    self._data_offset)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/base.py", line 556, in _get_raw_data
    return self._file.readarray(offset=offset, dtype=code, shape=shape)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/file.py", line 271, in readarray
    buffer=self._mmap)
TypeError: buffer is too small for requested array

The main problem is that astropy does not raise an error when the saving leads to a truncated file. The second problem is that astropy should be able to handle large files.

I have no idea on how to debug this but I'm willing to help.

Yannick

yannick1974 · 2015-11-11T23:48:49Z

The script result given above was obtained using astropy 1.0.5. With the development version of astropy on my laptop, I have the problem of truncated files when “opening and saving” but I can not run the script.

embray · 2015-11-12T16:25:01Z

What platform is this on?

embray · 2015-11-12T19:35:23Z

Nevermind, I saw the thread on the mailing list about this. It occurs to me that this might be related to #1380. What happens if you open the file with fits.open(..., mode='denywrite')?

yannick1974 · 2015-11-12T21:21:31Z

Hi Erik,
No, this does not solve the issue. Note that the problem occurs whereas I open the file memmapped or not.

embray · 2015-11-12T21:29:10Z

What traceback do you get with memmap=False? What about mode='update' ?

yannick1974 · 2015-11-12T21:52:00Z

Using memmap=False leads to the same traceback.

But if I open the file in update mode and save it, the resulting file is not truncated.

juliantaylor · 2015-11-13T12:49:34Z

io/fits/util.py _write_string doesn't make sure the full dataset has been written.
similar to low level io you need to loop until everything is written as it may decide to not write everything in one go.

astrofrog · 2015-11-13T13:17:46Z

@juliantaylor - ah, does it need a f.flush()?

juliantaylor · 2015-11-13T13:21:44Z

no it needs:

sz = 0
while sz < len(data):
    sz += f.write(data[sz:])

and technically EINTR handling if python < 3.5

embray · 2015-11-13T14:07:46Z

Using memmap=False leads to the same traceback.

@yannick1974 That doesn't sound right, since the last line in the traceback above is in code that is never entered if memmap=False.

embray · 2015-11-13T14:18:12Z

@juliantaylor That seems right, though that code isn't used except in a few smaller writes--it is not used to write data arrays--those generally go through ndarray.tofile.

juliantaylor · 2015-11-13T14:19:27Z

it is used by the posted script

embray · 2015-11-13T14:29:09Z

Oh, well it's not supposed to be. That appears to be a bug.

…file to be skipped over on a direct copy write

embray · 2015-11-13T19:25:36Z

I think the attached fix should work around the main problem with truncation--how does this look for you @yannick1974 ?

juliantaylor · 2015-11-13T19:27:39Z

the write_string issue should also be fixed, its a bug even when its not triggered by the code with this fix anymore

juliantaylor · 2015-11-13T19:39:21Z

the fix works for me

embray · 2015-11-13T19:59:22Z

Agreed! I might as well go ahead and add that to this PR.

yannick1974 · 2015-11-14T00:15:32Z

Hi Erik,
I applied the patch to astropy 1.0.5 and the problem is no more present. Thanks.

embray · 2015-11-16T15:04:30Z

@juliantaylor @yannick1974 Excellent, thanks for confirming.

astrofrog · 2015-11-18T15:52:30Z

@embray - can you add a changelog entry and merge?

… that the original problem is fixed this is unlikely to ever be an issue, since _write_string should mostly just be used for writing headers and other smaller strings). Added changelog entry for 1.0.7.

…use file.write does not have a return value. Will reconsider something like this later if the need arises.

Saving large table FITS may lead to a truncated file

embray added the io.fits label Nov 12, 2015

Fix possible regression in file writing that caused use of ndarray.to…

df93272

…file to be skipped over on a direct copy write

embray force-pushed the fits/issue-4307 branch from 2870a1a to df93272 Compare November 13, 2015 19:25

embray added Bug Affects-release labels Nov 16, 2015

embray added this to the v1.0.7 milestone Nov 16, 2015

embray added 3 commits November 18, 2015 15:59

Fixes _write_string routine as suggested by @juliantaylor (though now…

85326d1

… that the original problem is fixed this is unlikely to ever be an issue, since _write_string should mostly just be used for writing headers and other smaller strings). Added changelog entry for 1.0.7.

re-raise exception if not EINTR [skip ci]

055d2e9

Back out the more 'robust' write which didn't work on Python 2.7 beca…

4bbcb45

…use file.write does not have a return value. Will reconsider something like this later if the need arises.

embray added a commit that referenced this pull request Nov 20, 2015

Merge pull request #4307 from embray/fits/issue-4307

c34d866

Saving large table FITS may lead to a truncated file

embray merged commit c34d866 into astropy:master Nov 20, 2015

embray deleted the fits/issue-4307 branch November 20, 2015 21:41

embray added a commit that referenced this pull request Nov 20, 2015

Merge pull request #4307 from embray/fits/issue-4307

abe39f1

Saving large table FITS may lead to a truncated file

embray added a commit that referenced this pull request Nov 20, 2015

Merge pull request #4307 from embray/fits/issue-4307

1a0c193

Saving large table FITS may lead to a truncated file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving large table FITS may lead to a truncated file #4307

Saving large table FITS may lead to a truncated file #4307

embray commented Nov 13, 2015

yannick1974 commented Nov 11, 2015

embray commented Nov 12, 2015

embray commented Nov 12, 2015

yannick1974 commented Nov 12, 2015

embray commented Nov 12, 2015

yannick1974 commented Nov 12, 2015

juliantaylor commented Nov 13, 2015

astrofrog commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

embray commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

embray commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

yannick1974 commented Nov 14, 2015

embray commented Nov 16, 2015

astrofrog commented Nov 18, 2015

Saving large table FITS may lead to a truncated file #4307

Saving large table FITS may lead to a truncated file #4307

Conversation

embray commented Nov 13, 2015

yannick1974 commented Nov 11, 2015

embray commented Nov 12, 2015

embray commented Nov 12, 2015

yannick1974 commented Nov 12, 2015

embray commented Nov 12, 2015

yannick1974 commented Nov 12, 2015

juliantaylor commented Nov 13, 2015

astrofrog commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

embray commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

embray commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

juliantaylor commented Nov 13, 2015

embray commented Nov 13, 2015

yannick1974 commented Nov 14, 2015

embray commented Nov 16, 2015

astrofrog commented Nov 18, 2015