Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading/modifying arrays with Mmap #235

Open
JonasIsensee opened this issue Sep 20, 2020 · 4 comments
Open

Allow reading/modifying arrays with Mmap #235

JonasIsensee opened this issue Sep 20, 2020 · 4 comments

Comments

@JonasIsensee
Copy link
Collaborator

To allow modification of arrays in existing files we should be able to make use of the mmaparrays keyword.

Essentially one can modify read_array to check for the mmaparrays flag and, if set,
use Mmap.mmap! to return a memorybacked array.
I have tested that locally and it works (some work with alignment required but that seems solvable).

The main problem is one with checksums.
JLD2 computes a checksum for every dataset and when you modify an array that obviously invalidates the checksum and it has to be recomputed.
When and how to recompute it is the tricky part. Suppose the following case

using JLD2
@save "test.jld2" a=rand(5)

f = jldopen("test.jld2", "r+"; mmaparrays=true)
b = f["a"]
b[1] = 0
close(f)

b[2] = 1
exit()

One of my initial ideas was to recompute the checksum inside close(f). This works but only when there are no further updates of the array after the file was closed.
What should happen when trying to edit the array after closing the array can be discussed. (error / nothing / array not accessible anymore) I just don't want the above to corrupt the file or segfault julia.

Other ideas include implementing a finalizer but I must admit that I don't fully understand the docs for that
and haven't been able to make it work successfully.
Again another approach could be to implement our own Array wrapper type that knows about the state of the file and takes care of sync!ing and recomputing the checksum.

@ktdq
Copy link

ktdq commented May 19, 2022

Quick fix: what if when mmaparrays flag is used the arrays are mapped in read-only mode? If you want read/write then don't use mmaparrays for now.

@philbit
Copy link

philbit commented May 31, 2022

This feature would be great (even if read-only). However, my usecase would be to read only a very small portion of a large array from a file on an NFS volume instead of the whole file. Would that even work like I think it would? Or do I remember correctly there were problems with mmap over NFS? Currently, to speed things up, I copy the whole file to a local directory before reading, because I expect that to be faster than accessing it directly (especially if there are multiple random accesses to the file), but that consideration might change if it was possible to read only a small portion over the network.

@Gregstrq
Copy link

Gregstrq commented Mar 5, 2024

I am not completely sure, but it seems that my issue is related to the improvement discussed here. Therefore I would like to ask: what is the status of this proposal? Is there a WIP solution? What parts of it are working and what parts of it are not? What else needs to be done?

I wanted to fill a large dataset array part by part in a loop. For that, I have tried to create the dataset of the required size in advance and then update its parts afterwards. However, it did not work because all the changes I introduced were not saved to the disk. As an MWP of what I tried to do, consider

using JLD2

a = zeros(100,100)
jldsave("test.jld2"; a=a)

jldopen("test.jld2", "r+") do file
    file["a"][:,1] = ones(100)
end

What is interesting, if I use HDF5 to update the array (replace jldopen with h5open) created by JLD2, the update works.

Since JLD2 realizes a subset of HDF5 and HDF5 is able to update the array, can we borrow the approach used there?

@JonasIsensee
Copy link
Collaborator Author

Hi @Gregstrq,
there is currently no WIP.

jldopen("test.jld2", "r+") do file
    file["a"]   #[:,1] = ones(100)
end

this bit will already load the whole array from the file into a Array{Float64,2} in memory. Changing it has little to do with JLD2.

As described above, one might be able to use mmap but would possibly need to recompute checksums to keep the file valid.
Another option could be a completely new API that wraps the dataset and behaves like an AbstractArray.
JLD2.lookup_offset can give you the position in the file where the dataset is stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants