Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FITS I/O, WINDOWS only: files remain open despite the use of "with" statement #7404

Open
matteobachetti opened this issue Apr 24, 2018 · 45 comments

Comments

@matteobachetti
Copy link
Contributor

In this PR on another package:
discos/srt-single-dish-tools#91
I'm getting a failure in Appveyor:

>           os.unlink(dst)
E           PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\srttools-test-wx27cxl4\\Lib\\site-packages\\dummyfile.fits'

However, dummyfile.fits should not be open. All I/O on that file with astropy.io.fits.open() is through with statements and this call to unlink is in a function called outside of them, so the file should be closed when it gets to that function.
Has anyone gotten the same error?

@pllim
Copy link
Member

pllim commented Apr 24, 2018

Can you provide a standalone snippet here to reproduce your problem? Are you using memmap?

@matteobachetti
Copy link
Contributor Author

Hi @pllim I couldn't do it because it doesn't happen in all cases, and I don't understand what is making a specific test fail and not the others. I triple checked that all fits files are closed before overwriting them, and in Linux and Mac it works properly. But Appveyor throws this error.
I was hoping that someone else had had the same error and knew what kind of conditions could produce it

@pllim
Copy link
Member

pllim commented May 2, 2018

@Alex-Ian-Hamilton any chance you also encounter this on Windows? 🙂

@pllim
Copy link
Member

pllim commented May 2, 2018

@matteobachetti can you please check if you still see this behavior if you force memmap=False? Please see the warning box in http://astropy.readthedocs.io/en/stable/io/fits/index.html#working-with-large-files regarding memmap.

Or could it be that handler cannot be freed fast enough before OS attempts to move the file. In that case, maybe you can put in a time.sleep(x) before force_move_file(...)?

A quick internet search also brought up "make sure you have the correct file permission" and "make sure you run it with admin privilege", and also a mention of a magical task scheduler. These might or might not be relevant.

@Alex-Ian-Hamilton
Copy link

I haven't ever noticed this, but just send me some code if you want me to test.

@matteobachetti
Copy link
Contributor Author

@pllim I thought that memmap was set to False by default? I will try to force it. Thanks for the suggestion

@Alex-Ian-Hamilton the failure happens when I run unit tests on Appveyor. If you have some time, could you install these tools and run the unit tests?

For your convenience, I attach a series of commands you can just copy-and-paste (assuming you use Anaconda):

conda create -n srttools_test python=3
source activate srttools
conda install astropy scipy numpy matplotlib pyyaml h5py statsmodels numba
git clone https://github.com/matteobachetti/srt-single-dish-tools.git
cd srt-single-dish-tools
git fetch
git checkout mbfits_converter
python setup.py install
python setup.py test

Thanks!

@saimn
Copy link
Contributor

saimn commented May 3, 2018

@matteobachetti - memmap is True by default (and this should be better documented : #6497)

@matteobachetti
Copy link
Contributor Author

@saimn @pllim (Also related to #7440 by @cdeil) -- yes, I worked around the problem by forcing memmap=False everywhere.
However, there is clearly something weird going on because those files were explicitly closed. How can I be sure that I closed a file and I can work on it if was very large and I used memmap? That option is there to be used, and would be very convenient :).
Thanks for your help!

@cdeil
Copy link
Member

cdeil commented May 10, 2018

I think this issue might be the same as #7440 , you are seeing it on Windows, I was seeing it on Mac OS. In #7440 there are simple test cases to reproduce.

I don't fully understand it, but basically the explanation what is goig on was given by @saimn
in #7440 (comment) : hdu_list.close(), which is what is called when you exit the with fits.open(filename) as hdu_list: block, doesn't always close the memmap file (which is kind of like a second open file, see explanation here).

It's not clear to me if this behaviour makes sense, as seen by this issue and #7440 it's very confusing and inconvenient for users, and if Astropy could be changed to a more predictable behaviour, i.e. always closing the memmap in hdu_list.close(), that would be nice. Does anyone have an example where that would cause problems? Why isn't it implemented like that?
Calling gc.collect() in hdu_list.close() is probably not a good idea, no?

@astrofrog
Copy link
Member

astrofrog commented May 10, 2018

Fundamentally I guess the question is what should happen in an ideal world after:

hdulist = fits.open('myfile.fits', memmap=True)
data = hdulist[0].data
hdulist.close()

There are two possibilities here:

D1. data continues to be a memory mapped array and therefore there will be one open file remaining after hdulist.close
D2. data is then switched to be a full copy of the data
D3. data stops working if accessed

It's not clear to me that D2. is desirable since closing a file would then lead to a (potentially) giant memory increase, and it's also not clear how one would swap a memory mapped array for a non-memory-mapped one in-place.

Option D3. would probably break a lot of existing code.

So then I think D1. is the only option in the above snippet.

So then here are (in my view) the possible ways forward assuming D1 works above:

O1. Keep things as-is and just document the issue about closed files
O2. Change memmap to False by default. This is a reasonably big change and could lead to big memory increases for people already using io.fits which is not great if 3.1 was supposed to be the performance release ;)
O3. Change to memmap=False for FITS files that are below a certain size (e.g. 1 or 10MB) since it's not clear that memmap=True actually provides any benefits there, and this addresses the use case of opening many small FITS files.
O4. Keep track of how many memory-mapped open files there are and once the number goes above e.g. 20, we emit a warning that says WARNING: the number of open FITS files with memory mapping is now >20. If you don't need to use memory-mapping, you can explicitly set memmap=False when opening the FITS files or something along those lines.

Personally I would prefer O4 and maybe also O3 (at the same time), plus some documentation, but I'd be curious what other people think?

@cdeil
Copy link
Member

cdeil commented May 10, 2018

@astrofrog - I think I want D3 and would use none of the other options you mention. In Gammapy some users were running out of open files and the process ended, so for them it doesn't help to warn about too many open files, and they can't go in and change lots of Gammapy code to memmap=False, we have to do that for them and make sure files are closed at the end of all of our Map.read and Spectrum.read etc methods.

Like I mentioned in #7440 (comment) I don't understand yet why reliably closing the memmap file shouldn't be possible in that case. Probably there's a line or two that could be added to Table.read or HDUList.__exit__ to close the memmap file in this case where it should be safe to do so.

Is it really not possible to close the memmap file in HDUList.__exit__? Can you give an example snipped of code that people might have now that would break?

If that's the case, then it might still be possible to offer a small API addition to let people do option D3 without having to call gc.collect() like @saimn suggested in #7440 (comment) ?

Otherwise, as some people have already said or started working on, yes, improving the docs to explain a bit about memmap would be very helpful, e.g. I'm still unsure if memmap=True or memmap=False is better when reading FITS BINTABLE into Table objects.

@astrofrog
Copy link
Member

astrofrog commented May 10, 2018

Sorry for confusing things - I think there is indeed a bug in #7440 and maybe here too. I do think:

s = 0
for iter in range(10):
    hdulist = fits.open('myfile.fits', memmap=True)
    data = hdulist[0].data
    hdulist.close()
    s += data.sum()

should work, which means D1 (not D3) is the right behavior (and changing this will break a lot), but at the end of each iteration the memory-mapped object should indeed be closed since data then no longer has any references, and if it doesn't then that's a bug.

So just to be clear, I don't think HDUList.close() should completely close the memory mapping, but the memory map should be garbage collected when it goes out of scope.

@saimn
Copy link
Contributor

saimn commented May 23, 2018

Coming to this a bit late, sorry. D1 is the only possible (and right) behavior imo, but I don't think that there is an issue here with io.fits. And I may be confused but I think that #7440 is a different thing as it concerns Table, not directly io.fits.
If I take @astrofrog's example just above and print the number of opened files with psutil, it doesn't increase.

To show more in detail the behavior, which looks correct to me:

>>> from astropy.io import fits
>>> import psutil
>>> proc = psutil.Process()
>>> print(proc.open_files())
[]
>>> hdulist = fits.open('astropy/io/fits/tests/data/arange.fits', memmap=True)
>>> print(proc.open_files())
[popenfile(path='/home/simon/dev/astropy/astropy/io/fits/tests/data/arange.fits', fd=3, position=8640, mode='r', flags=557056)]

Ok, the file is opened, now let's add a reference to .data:

>>> data = hdulist[0].data
>>> print(proc.open_files())
[popenfile(path='/home/simon/dev/astropy/astropy/io/fits/tests/data/arange.fits', fd=3, position=8640, mode='r', flags=557056), popenfile(path='/home/simon/dev/astropy/astropy/io/fits/tests/data/arange.fits', fd=4, position=8640, mode='r', flags=557056)]

We now have two references, due to the memmap object.

>>> hdulist.close()
>>> print(proc.open_files())
[popenfile(path='/home/simon/dev/astropy/astropy/io/fits/tests/data/arange.fits', fd=4, position=8640, mode='r', flags=557056)]

When closing, HDUList.close checks the number of references to know if it can close the memmap or not. In this case, there is a reference with data, so the file is closed but the memmap is not.

>>> data=None
>>> print(proc.open_files())
[]

If we remove the reference to the array using the memmap object, then everything is closed.

So I think it works as expected. But it is possible that there is a Windows specific issue here ?

@jdavies-st
Copy link
Contributor

Coming to this very late.

I don't understand why fits.open() can't properly respect the with context manager with memmap=True. If I've written some code using that context manager, I've explicitly done so to make sure that the file is closed at __exit__. If I try to use any arrays that are memmaped outside the context manager, I should expect some sort of error. The fact that this doesn't happen seems like it must be a bug.

Now I understand the legacy concerns of the HDUList.close() method as described here and elsewhere in linked issues. But if I make a pointer to a memmap array within the with context manager, without an explicit copy(), I should expect it to go out of scope when my open file has gone out of scope.

The solution to

with fits.open('file.fits', memmap=True) as hdulist:
    data = hdulist[0].data
print(data.shape)

is to not use it. I should not expect it to work. One should use:

with fits.open('file.fits', memmap=True) as hdulist:
    data = hdulist[0].data
    print(data.shape)

or

with fits.open('file.fits', memmap=True) as hdulist:
    data = hdulist[0].data.copy()
print(data.shape)

Those are explicit.

Now if that means __exit__() needs to call some other method like close_for_sure(), then perhaps that is needed. But it really must respect the context manager.

@pllim
Copy link
Member

pllim commented Apr 11, 2019

@jdavies-st , great idea. If you want to put in a PR for it, it is too late for 3.2 but maybe 4.0 if it happens in the next 2-3 months. This is essentially an API change, so I propose we not backport the behavior.

@saimn
Copy link
Contributor

saimn commented Apr 11, 2019

I don't understand why fits.open() can't properly respect the with context manager with memmap=True. If I've written some code using that context manager, I've explicitly done so to make sure that the file is closed at exit. If I try to use any arrays that are memmaped outside the context manager, I should expect some sort of error. The fact that this doesn't happen seems like it must be a bug.

I don't think that this is a bug. The context manager is closing the file, the one that is passed to him. And the fact that another handle stays open is inherent to the use of memmap, and should be transparent for the users, which should then be able to use the numpy array like any array, without knowing that there is a memmap behind.

And forcing a copy is not an option for me, this means losing all the benefit of using memmap, e.g. not being able to use memmap with Table (e.g. Table.read(.., memmap=True)) or any class that want to "load" a FITS file without loading the data in memory!

@jdavies-st
Copy link
Contributor

I think the fact that memmap=True is the default means that it is not transparent to users. And most users don't understand how memmap works under the hood. With context managers, if I open a file with the context manager, and then close it, the assumption is that the file object with its attributes is closed.

The simple solution to being able to continue using a memory-mapped array in an HDU is to not close the HDUList. I.e. don't try to access these memory-mapped arrays outside the context manager.

Nobody is forcing anyone to copy. But we are asking that if I explicitly use a context manager, then make it act like one. Nobody is forcing you to use the with context manager if you want to keep open the data attributes. Just don't close the file. But when I do explicitly want it closed, there better be a method to close it, and whatever that method is, it should be the one called by __exit__.

@saimn
Copy link
Contributor

saimn commented Apr 11, 2019

Breaking a lot of code in the wild, including astropy's Table, and probably CCDData, is not a "simple solution".
And proposing to leak a second file handler does not help much.
The current situation is not so bad, and does not justify such big changes imo. The use case of opening a lot of files can already be managed with memmap=False, or by carefully removing references and calling gc.collect.
That said, one thing that could easily be done, is to add a keyword to fits.open to tell it to close the memmap in .close.

@drdavella
Copy link
Contributor

I agree with @jdavies-st here. The library should close memory maps when the context manager goes out of scope. The entire point of context managers is to manage underlying resources. A memory map is no different in this context (pun intended) than the main file handle. The current behavior is completely unintuitive, not to mention dangerous.

I agree with @jdavies-st that the user has one of several correct options here:

  1. Keep the context manager active for as long as memory mapped data is required
  2. Explicitly copy the data
  3. Do not use the context manager

@drdavella
Copy link
Contributor

In my opinion, the truly correct thing would require the following steps:

  1. Close both the underlying file handle and any memory maps when the context manager goes out of scope
  2. Disallow all access to the HDUList instance after it has been closed

This is the only way to handle the resources in a consistent and secure way.

As an instructive example, consider what happens when you attempt to do the following:

>>> with open('foo.txt') as ff:
>>>    data = ff.read()
>>> other_data = ff.read()
...
ValueError: I/O operation on closed file.

This is what HDUList should do if a user attempts to access it either after .close() has been called, or after the context handler is out of scope. Strong opinion ahead, but any other behavior is nonsensical.

Obviously this would require a deprecation period and could only be changed in a major release. But it would be quite simple to add a DeprecationWarning to the bad behavior in the interim.

@pllim
Copy link
Member

pllim commented Apr 11, 2019

This is not a trivial behavior change and should be discussed over at astropy dev mailing list.

@cdeil
Copy link
Member

cdeil commented Apr 12, 2019

How much work is it to try out the proposed change for one of the experts here?

I agree with @pllim that the discussion / decision on such a change should also be on the astropy-dev list, but note that apart from few experts, many users or developers of affiliated packages like me will only be able to give an informed and useful feedback if you really make it clear what will break and change. Specifically triggered by the comment / concern by @saimn above for Table.read, I've put a question at #8573 because for us in Gammapy BINTABLE read performance is important, but I don't understand if we can use it or how this change would affect us.

@saimn
Copy link
Contributor

saimn commented Apr 12, 2019

It seems that memmap is True since 2011 (spacetelescope/PyFITS@d3c5153), so quite a long time, and I think that it has been working pretty well, with a few bug reports only for very specific cases.

The use case of opening lots of FITS files is already mentioned in the FAQ (#3827, http://docs.astropy.org/en/latest/io/fits/appendix/faq.html#i-m-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files). It was proposed there by @embray to add "an optional argument to HDUList.close() to close all mmaps as well", which I'm also proposing above. I think that this would be a more reasonable option than what is proposed by @jdavies-st and @drdavella.

I could also add that fits.getdata works well with the current behavior: you get a Numpy array, which can be sliced later without loading all the array in memory.
Yes, I care about memory usage, and the current behavior is really practical for this purpose. Changing the behavior means that a lot of code will have to be adapted to get the same memory efficiency, and at the price of keeping a second file handle opened, and having to close manually the resources when there are no more used. BTW, spectral-cube also relies on this to avoid loading the whole in memory, from what I can see at first glance.

Is there a reason for this sudden worry and strong words ?

@drdavella
Copy link
Contributor

drdavella commented Apr 12, 2019

@jdavies-st has explained the problems he has encountered with using the io.fits context manager in the library code he maintains. I can let him elaborate on the specifics, but basically it doesn't manage resources properly which leads to difficult-to-diagnose problems in various scenarios.

As a result of this discussion, though, it has become apparent to me that the way the context manager in io.fits is implemented is just fundamentally wrong. There's no other way to say it: the fact that IO resources remain open after the context manager is out of scope is is just plain wrong. It flies in the face of the entire purpose of context managers, which is to manage underlying resources. It is completely unintuitive for anyone who has ever used a context manager in any other IO library in Python (e.g. file IO, sockets, etc.). The only reason that people don't see this is because, as you said, this design decision was made a long time ago in io.fits, and so everyone is just used to it. That doesn't mean it's okay, though.

Remember that memory mapped IO requires an underlying file handle. So at the level of OS library calls, an open file handle must be passed in order to create the memory map. Currently, when the io.fits context manager goes out of scope, the original file handle is "closed", but it is actually still being used by the memory map(s). So it hasn't been closed at all! This means that the context manager is not only meaningless in this scenario, but it is tricking developers into thinking that resources have been managed properly when they have not been at all. This is the reason for my strong opinion.

To demonstrate this, consider the following example. Any FITS file you have that has a data array will work:

from astropy.io import fits

with fits.open('myfile.fits', memmap=True) as hdulist:
    data = hdulist['DATA'].data

So now I have a reference to a data array that corresponds to underlying data. Now try the following in the same process:

import psutil

p = psutil.Process()
p.open_files()

You will see something like the following as the output:

[...,
 popenfile(path='/path/to/myfile.fits', fd=<N>)]

The file is still open!! So what purpose does the context manager even serve? Why are we using it? The underlying resources are not being managed!

Let's try again, but with an explicit call to close:

from astropy.io import fits

with fits.open('myfile.fits', memmap=True) as hdulist:
    data = hdulist['DATA'].data

hdulist.close()

It doesn't really make any sense to do this, because the context handler has closed the open file handle already, right? But look at the open files using psutil again, and you'll see that the file is still open!

To be clear, I'm not saying that the use cases you mentioned shouldn't be supported. They are clearly important and have good motivations. We should continue to enable them. But what I am saying is that we should not enable the use of a context manager for something that is in direct contradiction with its basic purpose.

Obviously breaking downstream code is not desirable and should be avoided when possible, but I would argue that it is not possible in this case. We should respect deprecation periods and wait for a major release, but this is an important issue that should be addressed.

@drdavella
Copy link
Contributor

By the way, I am speaking from some experience here because asdf had a very similar design issue which took some effort to resolve. There were a few PRs that were required to fix the problem completely, which may serve as a useful reference here:

asdf-format/asdf#355
asdf-format/asdf#407

@matteobachetti
Copy link
Contributor Author

matteobachetti commented Apr 13, 2019

Why not going through a long deprecation period? It's not like Astropy hasn't broken code before (for example, putting a lower limit on the Python version), but what has been done in the past was correct and can still be done now: deprecation warnings for a fair time frame, and from then on if one needs to use memmap they don't use it outside the context managers (or stick with the previous Astropy version).

The Windows bug I originally reported is a clear symptom, in my opinion, that this behavior is much closer to a bug than a feature.

@MSeifert04
Copy link
Contributor

The problem is that a deprecation needs some transition between not-closing the memmap and closing the memmap when the context manager exits.

For example an argument that the memmap should be closed when the context manager exits, but then code supporting older versions of astropy cannot use the argument because it doesn't exist on old version. If there's no transition or toggle to opt-in then it's really complicated to find out if the problem is "active" and enable the better behavior.

It's also non-trivial to find out if the memmap is still used (for a deprecation warning) for example if a sliced region of the memmap is referenced or not - maybe there are solutions for this but it's certainly complicated (or not if there's an easy way).

I totally agree that it's a bug, I'm just not sure how to raise the awareness of problematic usages and fix it in a way that can be adopted easily. Although we could just provide an alternative that handles it correctly but just deprecating the current context-manager (it shouldn't be removed for several minor/major versions) - but that also has some drawbacks because we now have to maintain a correct and an incorrect version.

@saimn
Copy link
Contributor

saimn commented Apr 15, 2019

@drdavella - But it has been like this for ages. I described this behavior just above (#7404 (comment)) one year ago. So I'm surprised that you are so surprised. And the context manager's __exit__ is just calling .close, so calling .close once or ten times more the result will be the same.
Yes, there is a file descriptor that remains open for the memmap, as long as there are some references to the memmap.

In asdf you made the choice of closing the memmap, which is easier to do with a recent library, you can simply tell to your users that using the memmap after close is not possible. With io.fits the history is different. Also I could note that in asdf-format/asdf#168 embray mentioned the io.fits behavior, and mdboom suggested adding a keyword to leave the memmap open. Because yes, it can be useful!


I still argue that this is not a bug, because the current behavior is a choice, a tradeoff that has its advantages. I did not make this choice, but for me it is a reasonable one, with more advantages than drawbacks. And this choice was made ~8 years ago, so yes there is certainly a lot of code that depend on this.

From a user point of view there is currently no difference between opening with the memmap option or not. Closing the memmap in the context manager, or with the .close since there is no difference between the two, will break this assumption that a reference to .data behaves the same way in both cases. memmap is an implementation detail, that users do not need to understand, and that helps doing what they expect in term of memory usage (slicing a big image or cube even outside the context manager will not blow up your memory). This is a great feature for big datasets.

If you break this, and ask libraries that need the benefits of memmap to stop using the context manager, firstly it will be a lot of work, and secondly it may have the inverse consequence of leaking more file handles. Classes such as CCDData, SpectralCube, or mpdaf's Cube, will need a .close method to close the two file descriptors, and ask users to use this method. Many users will not do this because it is quite unexpected, and because it is easy to forget, which will lead to more leaks.

And to add another example: the netcdf file reader in scipy uses memmap, and they are not closing the memmap if it is still referenced. The difference with io.fits is that they show a warning in this case, to tell the user that if they want the file to be cleanly closed they need to copy the data:
https://github.com/scipy/scipy/blob/90534919e139d2a81c24bf08341734ff41a3db12/scipy/io/netcdf.py#L295-L319

I also understand the need for a way to be sure that all files are closed, and it is not difficult to add the option to do so. But changing the current default dehavior seems a very difficult path, as explained by @MSeifert04. If the majority here really want to go this way, I wonder if the only sane choice is to deactivate memmap by default.

@matteobachetti
Copy link
Contributor Author

@saimn: I do understand that a lot of code relies on this, and that solving this can be very painful, but who says that this use of context manager is counterintuitive and non-standard has a very good point. In my original example, as a user, I've been fed a choice I didn't make (memmap=True by default), undermining the choice I did make (closing the file in order to modify it later), without a warning.
Going through the scipy warning style and giving a long grace period (with DeprecationWarning) before moving to a new default can be painful, but on the long run I would say it's the way to go.
At the very least, this behavior should be documented and highlighted.

Can an APE be a good way to propose and discuss this change and its consequences on the affiliated packages?

@saimn
Copy link
Contributor

saimn commented Apr 15, 2019

@matteobachetti - Yes this is certainly a valid point. In the case you reported above, the problem is that Windows does not allow to delete an open file, so it is platform-specific (you can do this on linux, probably on mac os as well) but I agree that it is unexpected. The current behavior could indeed be better documented, the existing FAQ item should be more general than just the "too many open files" issue.

One remark about the scipy case: it is a RuntimeWarning, not a DeprecationWarning.

Yes we can think about what would be the best default behavior, and if we can find a not too painful way to make the transition, but this is really not easy. And we must be careful about not going though a situation that would be worse than the existing one. Leaking a few file descriptors is not a big deal in most cases, things can be worse: at first glance it seems that accessing a closed memmap leads to a segfault (also reproducible with np.memmap).

In [2]: with fits.open('astropy/io/fits/tests/data/arange.fits') as hdul: 
   ...:     data = hdul[0].data 

In [3]: data.base                                                                              
Out[3]: <mmap.mmap at 0x7f1372bbd4b0>

In [4]: data.base.close()                                                                      

In [5]: data                                                                                   
Out[5]: [1]    16269 segmentation fault (core dumped)  ipython

@MSeifert04
Copy link
Contributor

MSeifert04 commented Apr 16, 2019

Just FYI:

Yes, there is a file descriptor that remains open for the memmap, as long as there are some references to the memmap.

While it's practically correct this is relying on a CPython implementation detail that objects (that don't participate in cyclic references) are destroyed immediately after the last reference is dropped. However that's not how other Python implementations implement it - it's not even a guarantee in CPython. One should always assume that it can happen anytime between (and including) "last reference to object dropped" and "process ends".

@drdavella
Copy link
Contributor

drdavella commented Apr 16, 2019

I understand the arguments for backwards compatibility. If you insist on retaining the current behavior, the only correct solution is to deprecate and remove the context handler for io.fits entirely. I have already demonstrated that its current behavior is not only completely meaningless, but extremely misleading to developers and problematic in many cases. Clearly I am not the only developer who is surprised by this, nor the only one who has encountered problems due to the implementation. Just because a choice was made in the past does not mean that it was necessarily correct, or that it reflects current best practices. This is no reason to not fix the issue.

@drdavella
Copy link
Contributor

@saimn I just looked at the netcdf case you posted. I think that a warning like this would be extremely helpful here in the short term and should probably be added for the next release. I still would argue that this is only a workaround, though, and the underlying issue still needs to be addressed.

Just to attempt to summarize my earlier points, here are my main objections with the current behavior:

  1. The use of a context manager implies to users that all resources are properly being handled, when no such thing is actually occurring in the case of memmap=True. I feel very strongly about this being wrong because it violates both the principle of least surprise.
  2. Given that the memory mapped resources are not handled at all by either the context manager or HDUList.close, there is not an obvious way for users to actually release those resources when they are no longer required. If io.fits requires users to explicitly manage memory mapped resources, then it should provide a way to do this. It should not obfuscate this fact by providing a context manager which actually does nothing.

@orionlee
Copy link
Contributor

orionlee commented Oct 31, 2020

I'd like to add that the problem also occurs when callers use open and close to explicitly open and release a FITS file, in addition to using with statement.

Example snippets:

    cbv = fits.open('tess_sample_cbv.fits')
    cbv_time = cbv[1].data['Time']
    cbv.close()
    
    os.remove('tess_sample_cbv.fits')  # PermissionError

@jdavies-st
Copy link
Contributor

Yes, that's exactly the current behavior. cbv.close() doesn't actually close the file, and there's no public interface to do so.

@embray
Copy link
Member

embray commented Nov 2, 2020

cbv.close() doesn't actually close the file

It actually does--the issue is that in the above example there are two handles open to the file: One that was initially used to parse it, and one that is being used to access the data array.

One thing I've thought of doing is adding a warning about this when .close() is called but there is still a handle held on the data, though this only affects users on Windows trying to do something like os.remove() on a file while they're still using the data in that file.

In the above example you can also call .copy() on the array and it should work.

@embray
Copy link
Member

embray commented Nov 2, 2020

I do agree with @drdavella's points about this being both surprising and problematic for sane resource management.
I'm just not sure what the best solution is. When calling HDUList.close() should we search for any existing references to the data, maybe even force the garbage collector to run (not a bad idea) to ensure any reference cycles are broken and any remaining data references are deleted?

I would actually like this, and to display a warning in this case, because I think continuing to use a the data after closing the file is indicative of slopping programming, albeit understandable since most people don't realize that when they access hdu.data it's not reading the full array into memory.

A warning would help to more easily debug problems like this, and tidy up code. It can also offer an explanation and possible solutions, and/or link to the FAQ (which someone mentioned elsethread should be updated, which I agree).

@orionlee
Copy link
Contributor

orionlee commented Nov 2, 2020

I wonder how problems does not surface on Linux when the FITS file is removed: I don't mean the PermissionError, but one accessing the underlying (memory-mapped) data.

I did a small test on google colab, running Ubuntu, astropy 4.1, numpy 1.18.5. Somehow one can access mem-mapped data even after the underlying FITS file is removed.

The FITS file is about 50Mb, in case the size matters.

#%%
# !curl -C - -L -o sample_tpf_bak.fits https://mast.stsci.edu/api/v0.1/Download/file/?uri=mast:TESS/product/tess2018206045859-s0001-0000000033877176-0120-s_tp.fits
!ls -l sample_tpf_bak.fits

#%%
from astropy.io import fits
import os

def type_of_original_base(ary):
  if getattr(ary, 'base', None) is None:
    return type(ary)
  else:
    return type_of_original_base(ary.base)


# create a copy to maniuplate and possible remove it
!cp sample_tpf_bak.fits sample_tpf.fits
f_path = 'sample_tpf.fits'

f_tpf = fits.open(f_path, memmap=True)
f_flux = f_tpf[1].data['FLUX']
print('before remove: ', f_flux.shape, type_of_original_base(f_flux))
f_tpf.close()
os.remove(f_path)

!ls -l sample_tpf.fits
# memory-mapped f_flux is still available after f_tpf is closed
print('after remove:  ', f_flux.shape, type_of_original_base(f_flux))
print(f_flux[0])

@ayshih
Copy link
Contributor

ayshih commented Nov 2, 2020

From the documentation for os.remove():

On Windows, attempting to remove a file that is in use causes an exception to be raised; on Unix, the directory entry is removed but the storage allocated to the file is not made available until the original file is no longer in use.

That is, on Unix systems, the file is "deleted" in that it doesn't show up to the user anymore, but the data is still protected.

@orionlee
Copy link
Contributor

orionlee commented Nov 2, 2020

A warning would help to more easily debug problems like this, and tidy up code. It can also offer an explanation and possible solutions, and/or link to the FAQ (which someone mentioned elsethread should be updated, which I agree).

Documentation / FAQ would be helpful. A warning could be helpful too but I worry it might end up generating excessive warnings for (I suspect) the predominant cases that do not affect users. The issue arises only when codes try to remove (or move / rename) the underlying FITS file.

@embray
Copy link
Member

embray commented Nov 3, 2020

On *NIX removing a file simply removes the file from directory entries. This is why os.remove() is a synonym for os.unlink(). The inode for the file is only actually removed when no processes have an open handle to the file.

On Windows the corresponding function DeleteFileA/W has different semantics:

The DeleteFile function fails if an application attempts to delete a file that has other handles open for normal I/O or as a memory-mapped file (FILE_SHARE_DELETE must have been specified when other handles were opened).

The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed. Subsequent calls to CreateFile to open the file fail with ERROR_ACCESS_DENIED.

The note about FILE_SHARE_DELETE is interesting; I hadn't heard of that before. Here's some more useful information I found about it. Unfortunately it wouldn't be straightforward to use this in Python without a bit of original C code, or at least ctypes. I think it's possible though--I'll give it a try if I have time and see if it helps at all.

@eslavich
Copy link
Contributor

eslavich commented May 7, 2021

Some fodder for discussion:

Over in the asdf library we have a feature that automatically closes np.memmap arrays when the associated ASDF file is closed. Unfortunately, views over those arrays don't know that the memory map has been closed, and an attempt to use them results in a segfault! Here's what that would look like with astropy.io.fits:

from astropy.io import fits
import numpy as np

fits.HDUList([fits.PrimaryHDU(), fits.ImageHDU(np.zeros(10))]).writeto("test.fits")

with fits.open("test.fits") as hdul:
  view = hdul[-1].data[0:5]
  # Simulate closing the memory map
  hdul[-1].data.base.close()

print("Segfault in 3... 2... 1...", flush=True)

print(view)

So far I haven't come up with a reasonable way to protect against this -- the view object's base array is a memoryview that points to the memory map's address and it has no idea that the address has become invalid. If we can't fix it, I think the segfault is enough of a catastrophe to justify removing the auto-close feature from asdf.

@saimn
Copy link
Contributor

saimn commented May 7, 2021

Yep, that's indeed a good reason for not closing the memmap. (#7404 (comment))
So what io.fits does is close the memmap only if there are no other references:

def _maybe_close_mmap(self, refcount_delta=0):
"""
When mmap is in use these objects hold a reference to the mmap of the
file (so there is only one, shared by all HDUs that reference this
file).
This will close the mmap if there are no arrays referencing it.
"""
if (self._mmap is not None and
sys.getrefcount(self._mmap) == 2 + refcount_delta):
self._mmap.close()
self._mmap = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests