Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splitobs and DataLoader make views, but say they require only getobs #167

Open
mcabbott opened this issue Sep 19, 2023 · 3 comments
Open

Comments

@mcabbott
Copy link
Contributor

It's surprising that splitobs and DataLoader make views, when they mention only getobs in their docstrings, which does not:

help?> splitobs
  splitobs(data; at, shuffle=false) -> Tuple

  Split the data into multiple subsets proportional to the value(s) of at.

  If shuffle=true, randomly permute the observations before splitting.

  Supports any datatype implementing the numobs and getobs interfaces.
[...]

julia> splitobs(ones(1,5); at=0.2)[1]
1×1 view(::Matrix{Float64}, :, 1:1) with eltype Float64:
 1.0

julia> getobs(ones(1,5), 1:2)  # what it says it does
1×2 Matrix{Float64}:
 1.0  1.0

julia> obsview(ones(1,5), 1:2)  # what it actually uses
1×2 view(::Matrix{Float64}, :, 1:2) with eltype Float64:
 1.0  1.0

help?> DataLoader
search: DataLoader DataType

  DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])

  An object that iterates over mini-batches of data, each mini-batch containing batchsize
  observations (except possibly the last one).

  Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
  object that implements the numobs and getobs methods.

[...]

This means that they do not preserve OneHotArrays, which is FluxML/OneHotArrays.jl#40 .

But more generally, perhaps copies are just safer?

@ToucheSir
Copy link
Contributor

Looking back through the blame, I'm guessing the idea was to not materialize large datasets. The easiest path would be to change the docs, but that doesn't solve the OneHotArrays issue. Maybe

MLUtils.jl/src/obsview.jl

Lines 224 to 227 in af7ebea

function obsview(A::AbstractArray{T,N}, idx) where {T,N}
I = ntuple(_ -> :, N-1)
return view(A, I..., idx)
end
is too general and obsview should be returning ObsView for more types?

@mcabbott
Copy link
Contributor Author

mcabbott commented Oct 9, 2023

Yes I'm sure not materialising was the goal. However, some views are more useful than others. I wonder if the default should be something like splitobs(ones(1,10); at=0.5) makes two contiguous views, which are almost as good as Arrays, but splitobs(ones(1,10); at=0.5, shuffle=true) makes copies?

The OneHotArrays issue could be solved on that side, by changing what view does. Or more narrowly by changing what obsview does.

@ToucheSir
Copy link
Contributor

Regardless of solution, the docstring should 100% be updated to mention some type of view is returned with ObsView being the detault.

The OneHotArrays issue could be solved on that side, by changing what view does

How straightforward do you think this would be? Since FluxML/OneHotArrays.jl#40 is mostly about performance, another idea would be adding a kwarg which controls whether a view or copy is returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants