Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to an org / making a golden standard for NetCDF files in Julia #57

Open
Datseris opened this issue Jan 9, 2020 · 26 comments
Open

Comments

@Datseris
Copy link
Contributor

Datseris commented Jan 9, 2020

(continuing from an offtopic discussion in #47 (comment) )

Concerning moving NCDatasets to an organization: I prefer to keep it under my name, but I might reconsider should I have less time to work on it.

I've spent some time arguing why I think having packages in orgs instead of owners is a good thing. Here is my summary: having more contributors is only a small part of why I am suggesting this. Its more so about improving discoverability, improving trust by the community, inviting more collaborators (for some, having your name as an owner can be decisive in not contributing), having more sets of eyes trying to converge on the best design decisions, having more maintainers that can review and merge PRs as well as answer Issues, even fairness (because if this one person is the one that will absolutely take all decisions according to their one personal agenta, its not that fair for the rest of the contributors or the user community I'd say). These are some of the benefits among others. Obviously there are numerous downsides for you personally, e.g. sharing "power". Others have also argued why orgs can be "bad" here: https://discourse.julialang.org/t/newcomer-contributor-in-juliageo-and-co-help-me-get-started/32480 but I remain in favor of orgs.

Besides, such a massively central thing like "loading nc datasets" which I would imagine is used by so many people working in these fields, is just too important to be lacking those aforementioned features. I am a huge proponent of having that "one" package that does that "one" thing as good as possible. I will be for sure spending the next two years working with .nc files, and thus I want to spend time and effort making this golden standard possible, and even channeling some of the community into this effort. In my eyes, this seems to be easier to do with an org.

Ultimately it is obviously up to your personal agenta what happens with this repo, but I just hope that you can see that my suggestions come truly from a "improve the community experience" and not "I want to take power from you". I would believe that other people also share the same view as me. I will continue contributing here weekly of course, but I am also convinced that such a central part of the ecosystem should part of an organization. Of course, you don't have to decide immediately. As I contribute more and more here, the benefits of collaboration might skew your opinion towards the org.


The first reason for opening this discussion is simply laying down the facts from my perspective. The second reason for opening this issue, is that I'd like @Alexander-Barth to please give a bit more transparency about what should belong in this repo or not. This request is motivated by the following comments:

Concerning the v(:, :, 1:5): I am wondering if you can have this alternative syntax as an extension package, but it is a clever idea. (from)

(... discussion regarding sum on a dataset) but I am not sure if computing should be part of NCDatasets. (from)

The aforementioned comments hint a lack of transparency on what would be accepted as a PR or not. Working on a feature that when submitted as a PR will be rejected as "unfitting" can be very demotivating... Can you please give some hints on what you think should be here and what not?

My journey in netcdf files started by porting Python code to Julia. Python code used xarray for their .nc stuff, and unfortunately, truth be told, xarray has very long list of features that do not exist in Julia (I was even considering of using xarray for all my .nc related work actually). As I decided to go with the native approach and implement these features instead, doing it here is the only thing natural for me.
But that's only if they are accepted obviously :P

@Alexander-Barth
Copy link
Owner

Concerning v(:, :, 1:5) as an alias of view(v,:,:,1:5): I like follow the approach There should be one-- and preferably only one --obvious way to do it.. To me reusing the view function of Julia with a similar meaning, is more obvious than introducing a new syntax, wouldn't you agree?

but I am not sure if computing should be part of NCDatasets.

What I meant is that is could be maybe part of a different package, maybe SlowAbstractArrays (:-)) (where indexing of individual elements is slow but loading of slices is relatively fast). The CFVariable could derive from SlowAbstractArrays. Any data format for arrays (HDF5, zarr, image formats, video formats,...) could then effectively reuse this. If for all data formats we would need to define what it mean to sum over a given dimension, it would not be very efficient.

@Datseris
Copy link
Contributor Author

To me reusing the view function of Julia with a similar meaning, is more obvious than introducing a new syntax, wouldn't you agree?

Yeap, I do. Justifying make them easy to be seen from other's perspectives :)

@visr
Copy link
Contributor

visr commented Jan 10, 2020

@Alexander-Barth is https://github.com/meggart/DiskArrays.jl along the lines of what you had in mind for SlowAbstractArrays?

By the way, I'm thinking to propose to effectively deprecate NetCDF.jl in favor of this package. We could do only bugfixes there, and point to this package for further development. I haven't discussed this with anyone yet, so @meggart do chime in. (Plus we'd have to ask the other contributors in a NetCDF.jl issue.)

But at this point I don't see strong reasons to keep both around. I do like the name NetCDF.jl better. Perhaps you can take it if we deprecate the original haha.

@Datseris
Copy link
Contributor Author

Thank you, I really applaud you for suggesting this, as I was going to do it anyway, but since I am not a contributor in NetCDF.jl, I would appear quite a mean person... :P

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Jan 12, 2020

Yes, DickArrays.jlDiskArrays.jl is exactly what I head in mind. It is great to see that it already exists! Thanks @visr !

One of the reason that I am reluctant to propose NCDatasets to go into a Julia organization, is that I do not want to add to the confusion for Julia users who are not quite sure which library to pick and I do not want to give the impression that there is a competition between organizations. But if there is just one non-deprecated netcdf package, then it should rather be part of an organization. The most expertise in NetCDF lies indeed in the JuliaGeo organization, so it would be a good choice as an organization.

The name NetCDF.jl is indeed nicer that NCDatasets.jl, but we should try to minimize the impact to the current user.

In any case, I will wait before releasing version 1.0.

@visr
Copy link
Contributor

visr commented Jan 12, 2020

Nice typo in your first sentence ;)

In that case, if @meggart agrees, it is probably best to first create an issue in NetCDF.jl as I mentioned, and go from there.

To be honest, I have not done an extensive evaluation between the two. Are you aware of any significant things that you can do with NetCDF.jl that you cannot do with NCDatasets.jl?

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Jan 12, 2020

Nice typo in your first sentence ;)

Haha, what a wired thing to have in one's mind! ;-)

To be honest, I have not done an extensive evaluation between the two. Are you aware of any significant things that you can do with NetCDF.jl that you cannot do with NCDatasets.jl?

I compared the docs, and I did not see any issue form this perspective. In NCDatasets.jl, we currently cannot read compound types. Can you read such data in NetCDF.jl? I have never came across of a NetCDF file with such data in the wild. As far as I know, matlab and octave do not support such types either but python-netcdf4 does.

@visr
Copy link
Contributor

visr commented Jan 12, 2020

No I'm quite sure NetCDF.jl doesn't support that either. I see no reference to it other than in the auto wrapped code, and have not come across compound types myself.

@meggart
Copy link

meggart commented Jan 13, 2020

I think there is a feature in NetCDF.jl that is not present in this package. The only and most important reason for me to use NetCDF.jl instead of this package is type stability when reading from objects representing an NC variable. For example lets generate an example file:

import NCDatasets
import NetCDF
NetCDF.nccreate("dataset.nc", "v1", "longitude",1000, "latitude", 500, "time", 31, atts=Dict("scale_factor"=>5.0, "add_offset"=>1.0), t=UInt8)
NetCDF.open("dataset.nc","v1",mode = NetCDF.NC_WRITE) do v
    v[:,:,:] = rand(UInt8, 1000,500, 31)
end;

and when we read a slice from a CFVariable we get

ds = NCDatasets.Dataset("dataset.nc")
v = ds["v1"]
@code_warntype v[:,:,1]
Variables
  #self#::Core.Compiler.Const(getindex, false)
  v::NCDatasets.CFVariable{Union{Missing, Dates.DateTime, CFTime.AbstractCFDateTime, Number},3,NCDatasets.Variable{UInt8,3},NCDatasets.Attributes}
  indexes::Core.Compiler.PartialStruct(Tuple{Colon,Colon,Int64}, Any[Core.Compiler.Const(Colon(), false), Core.Compiler.Const(Colon(), false), Int64])
  attnames::Array{String,1}
  isscalar::Any
  fillvalue::Any
  mask::Any
  units::Any
  calendar::Union{AbstractChar, String}
  tmp::Any
  data::Any
  datam::Any
  @_13::Bool
  @_14::Any

Body::Any
1 ──       Core.NewvarNode(:(isscalar))
│          Core.NewvarNode(:(fillvalue))
│          Core.NewvarNode(:(mask))
│          Core.NewvarNode(:(units))
│          Core.NewvarNode(:(calendar))
│          Core.NewvarNode(:(tmp))
│          Core.NewvarNode(:(datam))
│    %8  = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│          (attnames = NCDatasets.keys(%8))
│    %10 = Base.getproperty(v, :var)::NCDatasets.Variable{UInt8,3}
│    %11 = Core.tuple(%10)::Tuple{NCDatasets.Variable{UInt8,3}}
│          (data = Core._apply(Base.getindex, %11, indexes))
│    %13 = NCDatasets.typeof(data)::DataType
│    %14 = (%13 == NCDatasets.String)::Bool
└───       goto #3 if not %14
2 ──       (@_13 = %14)
└───       goto #4
3 ── %18 = NCDatasets.typeof(data)::DataType
└───       (@_13 = %18 == NCDatasets.DateTime)
4 ┄─       goto #6 if not @_13
5 ──       (@_14 = true)
└───       goto #7
6 ── %23 = NCDatasets.ndims(data)::Any
└───       (@_14 = %23 == 0)
7 ┄─       (isscalar = @_14)
└───       goto #9 if not isscalar
8 ──       (data = Base.vect(data))
9 ┄─ %28 = ("_FillValue" in attnames)::Bool
└───       goto #11 if not %28
10 ─ %30 = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│          (fillvalue = Base.getindex(%30, "_FillValue"))
│          (mask = NCDatasets.isfillvalue(data, fillvalue))
└───       goto #12
11 ─ %34 = NCDatasets.size(data)::Any
└───       (mask = NCDatasets.falses(%34))
12 ┄ %36 = Base.getproperty(v, :var)::NCDatasets.Variable{UInt8,3}
│    %37 = NCDatasets.eltype(%36)::Core.Compiler.Const(UInt8, false)
│    %38 = (%37 != NCDatasets.Char)::Core.Compiler.Const(true, true)
└───       goto #17 if not %38
13 ─ %40 = ("scale_factor" in attnames)::Bool
└───       goto #15 if not %40
14 ─ %42 = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│    %43 = Base.getindex(%42, "scale_factor")::Any
└───       (data = %43 * data)
15 ┄ %45 = ("add_offset" in attnames)::Bool
└───       goto #17 if not %45
16 ─ %47 = data::Any
│    %48 = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│    %49 = Base.getindex(%48, "add_offset")::Any
│    %50 = Base.broadcasted(NCDatasets.:+, %47, %49)::Any
└───       (data = Base.materialize(%50))
17 ┄ %52 = ("units" in attnames)::Bool
└───       goto #20 if not %52
18 ─ %54 = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│          (units = Base.getindex(%54, "units"))
│    %56 = NCDatasets.occursin(" since ", units)::Union{Missing, Bool, Base.var"#58#59"{_A} where _A}
└───       goto #20 if not %56
19 ─ %58 = Base.getproperty(v, :attrib)::NCDatasets.Attributes
│    %59 = NCDatasets.get(%58, "calendar", "standard")::Any
│          (calendar = NCDatasets.lowercase(%59))
│    %61 = data::Any
│    %62 = Base.broadcasted(NCDatasets.:!, mask)::Base.Broadcast.Broadcasted{_A,Nothing,typeof(!),_B} where _B<:Tuple where _A<:Union{Nothing, Base.Broadcast.BroadcastStyle}
│    %63 = Base.materialize(%62)::Any
│    %64 = Base.getindex(%61, %63)::Any
│    %65 = units::Any
│          (tmp = NCDatasets.timedecode(%64, %65, calendar))
│    %67 = tmp::Any
│    %68 = NCDatasets.size(data)::Any
│          (data = NCDatasets.similar(%67, %68))
│    %70 = data::Any
│    %71 = tmp::Any
│    %72 = Base.broadcasted(NCDatasets.:!, mask)::Base.Broadcast.Broadcasted{_A,Nothing,typeof(!),_B} where _B<:Tuple where _A<:Union{Nothing, Base.Broadcast.BroadcastStyle}
│    %73 = Base.materialize(%72)::Any
└───       Base.setindex!(%70, %71, %73)
20 ┄       goto #24 if not isscalar
21 ─ %76 = Base.getindex(mask, 1)::Any
└───       goto #23 if not %76
22 ─       return NCDatasets.missing
23 ─ %79 = Base.getindex(data, 1)::Any
└───       return %79
24 ─ %81 = NCDatasets.eltype(data)::Any
│    %82 = Core.apply_type(NCDatasets.Union, %81, NCDatasets.Missing)::Type
│    %83 = Core.apply_type(NCDatasets.Array, %82)::Type{Array{_A,N} where N} where _A
│          (datam = (%83)(data))
│    %85 = Base.dotview(datam, mask)::Any
│    %86 = Base.broadcasted(Base.identity, NCDatasets.missing)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Base.RefValue{Missing}}}
│          Base.materialize!(%85, %86)
└───       return datam

while for NetCDF.jl:

NetCDF.open("datasettest.nc","v1") do v
    @code_warntype v2[:,:,1]
end
Variables
  #self#::Core.Compiler.Const(getindex, false)
  v::NetCDF.NcVar{UInt8,3,7}
  i1::Core.Compiler.Const(Colon(), false)
  i2::Core.Compiler.Const(Colon(), false)
  i3::Int64

Body::Array{UInt8,3}
1 ─ %1 = NetCDF.readvar(v, i1, i2, i3)::Array{UInt8,3}
└──      return %1

Where the call to the C function gets inlined. This is not a problem per se, since one can use function barriers to make this type stable. However, as I remember this was the initial reason for the split of the packages that I as a maintainer was hesitant to accept a PR that would apply scale_factor and add_offset automatically but would silently introduce a type-instability. The second issue I had was the temporary arrays allocated during this step:

if eltype(v.var) != Char
if "scale_factor" in attnames
data = v.attrib["scale_factor"] * data
end
if "add_offset" in attnames
data = data .+ v.attrib["add_offset"]
end
end
which might be a problem when working with very large files hardly fitting into memory.

Before I switched my main focus on Zarr I had a solution to this in mind how to apply scale_factor and add_offset while still being type-stable and without making intermediate copies, but I don't know the codebase of NCDatasets well enough to easily see if they would apply here.

@meggart
Copy link

meggart commented Jan 13, 2020

Regarding the future for these packages, we might think about reaching out to JuliaIO. There are NetCDF files in the wild which are nit in the Geo domain (like chemistry) and where CF conventions don't apply. So maybe it would be an option to move NetCDF.jl to JuliaIO (in case they would accept it) as a basic and more lowlevel package and have NCDatasets as part of JuliaGeo which covers more CF-specific functionality.

@Alexander-Barth
Copy link
Owner

@meggart, thanks for pointing out the type-stability which is indeed a concern. I am currently working on a branch where I requite that all attributes affecting the type of the element be present early enough in order to compute the type of an element based of the type of add_offset, scale_factor, ...

But so far, the type stability issues are still present.

So maybe it would be an option to move NetCDF.jl to JuliaIO (in case they would accept it) as a basic and more lowlevel package and have NCDatasets as part of JuliaGeo which covers more CF-specific functionality.

This sounds also like a good solution to me.

@Alexander-Barth
Copy link
Owner

I think I have now a solution to make it type stable. Essentially, all variables affecting the element-type are now part of the CFVariable via a named tuple (called _storage_attrib). Parameters like add_offset can have a value or can be Nothing. The transformation of the CF convention are implemented via functions like these:

@inline CFtransform_offset(data,add_offset) = data + add_offset
@inline CFtransform_offset(data,add_offset::Nothing) = data

If I define a variable with only a scale_factor, indexing a single element show indeed that the adding offset is optimized away:

NetCDF.nccreate("dataset_only_scale.nc", "v1", "longitude",1000, "latitude", 500, "time", 31, atts=Dict("scale_factor"=>5.0), t=UInt8)
ds = NCDatasets.Dataset("dataset_only_scale.nc"); v = ds["v1"]; @code_typed v[1,1,1]

Returns now:

julia> @code_typed v[1,1,1]
CodeInfo(
1 ─ %1  = Base.getfield(v, :var)::NCDatasets.Variable{UInt8,3}
│   %2  = (getfield)(indexes, 1)::Int64
│   %3  = (getfield)(indexes, 2)::Int64
│   %4  = (getfield)(indexes, 3)::Int64
│   %5  = invoke Base.getindex(%1::NCDatasets.Variable{UInt8,3}, %2::Int64, %3::Int64, %4::Int64)::UInt8
│   %6  = Base.getfield(v, :_storage_attrib)::NamedTuple{(:fillvalue, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor),Tuple{Nothing,Float64,Nothing,Nothing,Nothing,Nothing}}
│   %7  = Base.getfield(%6, :scale_factor)::Float64
└──       goto #3 if not true
2 ─       nothing::Nothing
3 ┄ %10 = Base.uitofp(Float64, %5)::Float64
│   %11 = Base.mul_float(%10, %7)::Float64
└──       goto #4
4 ─       goto #5
5 ─       goto #6
6 ─       goto #7
7 ─       return %11
) => Float64

But the important part is that for the example above, all types are now concrete:

ds = NCDatasets.Dataset("dataset.nc"); v = ds["v1"]; @code_warntype v[:,:,1]
Variables
  #self#::Core.Compiler.Const(getindex, false)
  v::NCDatasets.CFVariable{Float64,3,NCDatasets.Variable{UInt8,3},NCDatasets.Attributes,NamedTuple{(:fillvalue, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor),Tuple{Nothing,Float64,Float64,Nothing,Nothing,Nothing}}}
  indexes::Core.Compiler.PartialStruct(Tuple{Colon,Colon,Int64}, Any[Core.Compiler.Const(Colon(), false), Core.Compiler.Const(Colon(), false), Int64])
  data::Array{UInt8,2}

Body::Array{Float64,2}
1 ─ %1  = Base.getproperty(v, :var)::NCDatasets.Variable{UInt8,3}
│   %2  = Core.tuple(%1)::Tuple{NCDatasets.Variable{UInt8,3}}
│         (data = Core._apply(Base.getindex, %2, indexes))
│   %4  = data::Array{UInt8,2}
│   %5  = NCDatasets.fillvalue(v)::Core.Compiler.Const(nothing, false)
│   %6  = NCDatasets.scale_factor(v)::Float64
│   %7  = NCDatasets.add_offset(v)::Float64
│   %8  = NCDatasets.time_origin(v)::Core.Compiler.Const(nothing, false)
│   %9  = NCDatasets.time_factor(v)::Core.Compiler.Const(nothing, false)
│   %10 = NCDatasets.eltype(v)::Core.Compiler.Const(Float64, false)
│   %11 = NCDatasets.CFtransformdata(%4, %5, %6, %7, %8, %9, %10)::Array{Float64,2}
└──       return %11

If for some reason, one wants the raw data, this is possible via, the following code (this is now type-stable too):

ds = NCDatasets.Dataset("dataset.nc"); v = ds["v1"]; @code_warntype v.var[:,:,1]

I noticed that in NetCDF.jl, v[:,:,1] returns an 3D-array of the size (1000, 500, 1) while an array in Julia would be typically 2D for such subscripts. I guess this is done for simplicity/performance.
NCDatasets returns a 2D array, but currently at the expense of an additional copy which I would really like to avoid:

data = v[ind...][drop_index...]

@Datseris
Copy link
Contributor Author

woohoo, huge progress!

@Balinus
Copy link
Contributor

Balinus commented Jan 23, 2020

Nice!

Having a singleton dimension is quite useful to store the time information. If it returns only a 2D array, then there is no dimension left to store the associated timestamp.

@Alexander-Barth
Copy link
Owner

Alexander-Barth commented Jan 23, 2020

@Balinus, you can keep the singleton dimension if you read the full variable, like Array(v) or v[:,:,:] (or even v[:,:,1:1] for a single time slice). Here the behavior is identical if v is a regular Julia array of a CFVariable.

@Datseris
Copy link
Contributor Author

If its of any worth, my opinion would be that v[:, :, 1] should indeed drop singleton dimensions same with Base Julia. The syntax v[:, :, 1:1] is the solution for Balinus, I agree with that as well.

@meggart
Copy link

meggart commented Feb 13, 2020

Just wanted to mentioned that I opened this issue at JuliaIO requesting to move NetCDF.jl there. Just in case someone wants to add something.

@visr
Copy link
Contributor

visr commented Feb 13, 2020

Sounds good to me! We'd probably need an extra clear section in the readme on which package to use.

One thing I'm a bit concerned about is maintenance of NetCDF.jl. I'm not so interested to keep doing things on NetCDF.jl if JuliaGeo chooses NCDatasets. Will you still want to do that @meggart? Because we should also be clear with the JuliaIO folks if it's current maintainers effectively abandon it.

I see the point that NCDatasets has some geo specific features whereas NetCDF doesn't. The question is, are those features an issue for non geo users?

@Datseris
Copy link
Contributor Author

I still can't see the reason of both existing, I hope someone writes a reason for this one day...

@Alexander-Barth
Copy link
Owner

I just came across this comment recently, which made me think.
JuliaLang/PackageCompiler.jl#304 (comment)

I propose that I mention Fabian and Martijn too in the License file (+ a link to github for a full list of contributors) for the use of the netcdf_c.jl file and the error handling code in NCDatasets.jl (unless you do not want to, of course).

@meggart
Copy link

meggart commented Feb 13, 2020

Will you still want to do that @meggart?

Yes, regarding pure maintenance of the package I would like to continue, because there is some legacy code of mine that I don't want to update. I know that I am not as quick in replying to issues as others, but I think a waiting time of one or two weeks until an issue gets a reply should be acceptable for a package like this.

I still can't see the reason of both existing, I hope someone writes a reason for this one day...

I think there is no strong enough reason to delete one of them. There is already quite some legacy code around that I think would be nice if it would continue to work.

On a personal note I think that the scope of a package named NetCDF.jl should not be to implement as many features as possible but to provide basic IO-functionalities needed to read and write files and leave the rest to the user. I really like the added functionalities of NCDatasets.jl, but many of these are not really unique to NetCDF datasets. For example the Multifile Dataset is basically an implementation of a lazily concatenated array. Why not make a general Julia package for this so that it is also usable for stacking HDF5 files or similar? The same is true for CF conventions. I really appreciate that @Alexander-Barth factored out CFTime.jl and I am using it in my code for encoding and decoding time in Zarr files. To me, this is exactly the way to go. The same is true for applying add_offset and scale_factor, which can be simply done in a lazy manner using https://github.com/JuliaArrays/MappedArrays.jl if one does not want to make a copy of the data.

If you look at the python world, I find myself hardly using the python-netcdf4 package anymore, because it is so easy to read any kind of data using xarray, which uses different IO packages as drivers in the background. So in my opinion to use resources efficiently we should concentrate on making rather minimalistic IO packages (and this is where I personally see a package like NetCDF.jl) and build convenience packages on top of that which can handle more than a single backend and give more user-friendliness.

@visr
Copy link
Contributor

visr commented Feb 13, 2020

Yes, regarding pure maintenance of the package I would like to continue, because there is some legacy code of mine that I don't want to update.

Ok good. I would not recommend deleting NetCDF.jl altogether, but even if it is in bugfix-only mode, that would be good to signal to users. Waiting a week or two for replies is of course ok in any sense.

In general I agree with your favored approach of keeping focused packages and factoring out / reusing packages for non NetCDF specific features. I think the reality is however that sometimes this takes too long, and bundling things together in a package allows to quickly provide users with convenient API's that they are used to from packages like xarray. Developments in the ecosystem (and your own with for instance DiskArrays) may catch up however and should make it easier to apply your favored approach.

Perhaps if we keep NetCDF.jl around, we could move the NetCDF IO core of NCDatasets to NetCDF.jl and make NCDatasets depend on that? I don't mind exactly how we do it in the end, as long as there is a clear story for users, and ideally minimal duplication of core NetCDF IO / binary wrapping code.

@Alexander-Barth
Copy link
Owner

For example the Multifile Dataset is basically an implementation of a lazily concatenated array. Why not make a general Julia package for this so that it is also usable for stacking HDF5 files or similar? The same is true for CF conventions.

If you are interested in why: it is simply that I am not familiar enough with raw HDF5 files and it is only recently that I became aware that CF conventions are used outside of NetCDF files.

An essential part of the multifile code (with an aggregation of a given dimension) is implemented in a submodule CatArrays which does not import anything from NCDatasets. It could be factored out
(and I guess it won't be too hard to find a logo ;-) ), but it is not documented yet and it is a fairly straightforward translation on a octave code (https://github.com/Alexander-Barth/octave-ncarray/blob/master/inst/ncCatArray.m) without carrying to much about type stability unfortunately, but this can be fixed. If there are some(any) volunteer(s) to help me in this processes, this could be a separate package.

Factoring-out the CF part, would unfortunately be a lot harder.

@meggart
Copy link

meggart commented Feb 17, 2020

An essential part of the multifile code (with an aggregation of a given dimension) is implemented in a submodule CatArrays which does not import anything from NCDatasets

This is very good to know. I will see what I can do. I also found https://github.com/JuliaArrays/LazyArrays.jl but for my purposes they do not fit because they are developed around being in-memory and having fast access to single elements. So your code is definitely a better start for a separate package that could be re-used with any DiskArray

@visr
Copy link
Contributor

visr commented Feb 29, 2020

Perhaps another package for stacked arrays to look at: https://github.com/mcabbott/LazyStack.jl.

@alex-s-gardner
Copy link

It would be good to revive this discussion to see if we can come to an agreement within the community as to path forward. It's been 3 years and both packages are reasonably mature now and the two package ecosystem isn't all that helpful. My vote would be to move NetCDF.jl to JuliaIO, move NCDatasets.jl to JuliaGeo and then to slowly remove duplication in NCDatasets.jl by building on top of NetCDF.jl. This would make it easier for users to determine which package is most appropriate to use and contibute to. To me this seems like the most sustainable solution in the long-term. @Alexander-Barth and @meggart do you think we can find a path forward that would work for everyone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants