Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving large table FITS may lead to a truncated file #4307

Merged
merged 4 commits into from Nov 20, 2015

Conversation

embray
Copy link
Member

@embray embray commented Nov 13, 2015

While handling large catalogue FITS files, I sometimes encounter problems when saving them: the newly saved file is truncated whereas no error is raised at saving time. For instance, I have a 37 GiB file and if I open it and save it just after, this leads to a 2 GiB file.

I tried to write a script to generate a big FITS file and expose the bug. During this, I found that if read a big file and count the rows before saving it, the newly save file is OK.

Here is the script. Of course, you need a lot of memory and a lot of disk space to run it.

import os

import numpy as np
from astropy.io import fits

# Let's generate a BIG fits file (~25 GiB).
print("> Generating a 500x13,000,000 random table FITS file.")
list_cols = []
for i in range(500):
    list_cols.append(
        fits.Column(
            name="{}".format(i),
            format='E',
            array=np.random.random(13000000)
        )
    )
cols = fits.ColDefs(list_cols)
hdu = fits.BinTableHDU.from_columns(cols)
hdu.writeto("bigfile.fits")
print("> bigfile.fits file saved.")

# If we count read the file and count the rows before saving it to another
# file, everything is OK.
print("> Read. Count rows. Save.")
hdu_list = fits.open("bigfile.fits")
print("bigfile.fits: {} rows".format(len(hdu_list[1].data)))
hdu_list.writeto("bigfile2.fits")
print("> bigfile2.fits saved.")

print("> Comparing the file sizes.")
print("bigfile.fits: {}".format(os.stat("bigfile.fits").st_size))
print("bigfile2.fits: {}".format(os.stat("bigfile2.fits").st_size))

print("Reading bigfile2.fits and counting the rows.")
hdu_list = fits.open("bigfile2.fits")
print("bigfile2.fits: {} rows".format(len(hdu_list[1].data)))

# If we read the file and right after save it to another file, the later is
# truncated.
print("> Read the file and save it to another file just after.")
hdu_list = fits.open("bigfile.fits")
hdu_list.writeto("bigfile3.fits")
print("> bigfile3.fits saved.")

print("> Comparing the file sizes.")
print("bigfile.fits: {}".format(os.stat("bigfile.fits").st_size))
print("bigfile3.fits: {}".format(os.stat("bigfile3.fits").st_size))

print("> Reading bigfile3.fits produce a warning.")
hdu_list = fits.open("bigfile3.fits")  # Produces a warning.
print("> Accessing the data raises an error.")
print("bigfile3.fits: {} rows".format(len(hdu_list[1].data)))  # Fails.

The output of the script is:

> Generating a 500x13,000,000 random table FITS file.
> bigfile.fits file saved.
> Read. Count rows. Save.
bigfile.fits: 13000000 rows
> bigfile2.fits saved.
> Comparing the file sizes.
bigfile.fits: 26000087040
bigfile2.fits: 26000087040
Reading bigfile2.fits and counting the rows.
bigfile2.fits: 13000000 rows
> Read the file and save it to another file just after.
> bigfile3.fits saved.
> Comparing the file sizes.
bigfile.fits: 26000087040
bigfile3.fits: 2147565952
> Reading bigfile3.fits produce a warning.
WARNING: File may have been truncated: actual file length (2147565952) is smaller than the expected size (26000087040) [astropy.io.fits.file]
> Accessing the data raises an error.
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/utils/decorators.py", line 339, in __get__
    return obj.__dict__[self._key]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "big_table_FITS_bug.py", line 52, in <module>
    print("bigfile3.fits: {} rows".format(len(hdu_list[1].data)))  # Fails.
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/utils/decorators.py", line 341, in __get__
    val = self._fget(obj)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/table.py", line 399, in data
    data = self._get_tbdata()
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/table.py", line 168, in _get_tbdata
    self._data_offset)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/hdu/base.py", line 556, in _get_raw_data
    return self._file.readarray(offset=offset, dtype=code, shape=shape)
  File "/opt/anaconda3/lib/python3.4/site-packages/astropy/io/fits/file.py", line 271, in readarray
    buffer=self._mmap)
TypeError: buffer is too small for requested array

The main problem is that astropy does not raise an error when the saving leads to a truncated file. The second problem is that astropy should be able to handle large files.

I have no idea on how to debug this but I'm willing to help.

Yannick

@yannick1974
Copy link
Author

The script result given above was obtained using astropy 1.0.5. With the development version of astropy on my laptop, I have the problem of truncated files when “opening and saving” but I can not run the script.

@embray embray added the io.fits label Nov 12, 2015
@embray
Copy link
Member

embray commented Nov 12, 2015

What platform is this on?

@embray
Copy link
Member

embray commented Nov 12, 2015

Nevermind, I saw the thread on the mailing list about this. It occurs to me that this might be related to #1380. What happens if you open the file with fits.open(..., mode='denywrite')?

@yannick1974
Copy link
Author

Hi Erik,
No, this does not solve the issue. Note that the problem occurs whereas I open the file memmapped or not.

@embray
Copy link
Member

embray commented Nov 12, 2015

What traceback do you get with memmap=False? What about mode='update' ?

@yannick1974
Copy link
Author

Using memmap=False leads to the same traceback.

But if I open the file in update mode and save it, the resulting file is not truncated.

@juliantaylor
Copy link

io/fits/util.py _write_string doesn't make sure the full dataset has been written.
similar to low level io you need to loop until everything is written as it may decide to not write everything in one go.

@astrofrog
Copy link
Member

@juliantaylor - ah, does it need a f.flush()?

@juliantaylor
Copy link

no it needs:

sz = 0
while sz < len(data):
    sz += f.write(data[sz:])

and technically EINTR handling if python < 3.5

@embray
Copy link
Member

embray commented Nov 13, 2015

Using memmap=False leads to the same traceback.

@yannick1974 That doesn't sound right, since the last line in the traceback above is in code that is never entered if memmap=False.

@embray
Copy link
Member

embray commented Nov 13, 2015

@juliantaylor That seems right, though that code isn't used except in a few smaller writes--it is not used to write data arrays--those generally go through ndarray.tofile.

@juliantaylor
Copy link

it is used by the posted script

@embray
Copy link
Member

embray commented Nov 13, 2015

Oh, well it's not supposed to be. That appears to be a bug.

…file to be skipped over on a direct copy write
@embray
Copy link
Member

embray commented Nov 13, 2015

I think the attached fix should work around the main problem with truncation--how does this look for you @yannick1974 ?

@juliantaylor
Copy link

the write_string issue should also be fixed, its a bug even when its not triggered by the code with this fix anymore

@juliantaylor
Copy link

the fix works for me

@embray
Copy link
Member

embray commented Nov 13, 2015

Agreed! I might as well go ahead and add that to this PR.

@yannick1974
Copy link
Author

Hi Erik,
I applied the patch to astropy 1.0.5 and the problem is no more present. Thanks.

@embray embray added this to the v1.0.7 milestone Nov 16, 2015
@embray
Copy link
Member

embray commented Nov 16, 2015

@juliantaylor @yannick1974 Excellent, thanks for confirming.

@astrofrog
Copy link
Member

@embray - can you add a changelog entry and merge?

… that the original problem is fixed this is unlikely to ever be an issue, since _write_string should mostly just be used for writing headers and other smaller strings). Added changelog entry for 1.0.7.
…use file.write does not have a return value. Will reconsider something like this later if the need arises.
embray added a commit that referenced this pull request Nov 20, 2015
Saving large table FITS may lead to a truncated file
@embray embray merged commit c34d866 into astropy:master Nov 20, 2015
@embray embray deleted the fits/issue-4307 branch November 20, 2015 21:41
embray added a commit that referenced this pull request Nov 20, 2015
Saving large table FITS may lead to a truncated file
embray added a commit that referenced this pull request Nov 20, 2015
Saving large table FITS may lead to a truncated file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants