Archiving DataLoaders.jl #90

lorenzoh · 2022-05-28T08:29:38Z

Now that MLUtils.jl has feature parity with DataLoaders.DataLoader, it's time to deprecate DataLoaders.jl.

We should:

add a guide for users transitioning that explains the differences between DataLoaders.DataLoader and MLUtils.DataLoader.
port and adapt relevant documentation, including:
- notice for people that are looking for torch.utils.data.DataLoader and the comparison to PyTorch
- the collating/batching guide
- guide on larger-than-memory datasets (should be updated to use MLDatasets.FileDataset and mapobs instead of custom struct)
go through DataLoaders.jl issues to search for feature requests that we can move to MLUtils.jl, i.e. Reproductivity problem with multi-threading lorenzoh/DataLoaders.jl#32 -> Deterministic, parallel data iteration #68

The text was updated successfully, but these errors were encountered:

CarloLucibello · 2022-05-28T10:53:56Z

I'm having second thoughts on this, since I heard some complaints about MLUtils.jl now being too heavy for someone that needs only basic utilities like train/test split. For instance, the latency of using MLUtils is halved removing the parallel stuff:

julia> @time using MLUtils # on master
  2.090915 seconds (2.10 M allocations: 124.960 MiB, 3.50% gc time, 68.02% compilation time)

julia> @time using MLUtils  # after removing FLoops and FoldsThreads
  0.928274 seconds (890.85 k allocations: 53.444 MiB, 2.36% gc time, 73.03% compilation time)

CarloLucibello · 2022-05-28T10:59:11Z

Maybe we can keep DataLoaders.jl, move it to the JuliaML org, copy and paste the implementation from here and mark a breaking release. Then we have Flux reexport it.

For MLUtils, we keep eachobs without the parallel option.

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo. But that can come later, with the excision of the parallel dataloader MLUtils is a lean dependence for someone that wants to implement the obs interface.

Or we just go on with the plan outlined in the OP. What do you think @lorenzoh @darsnack @ToucheSir ?

lorenzoh · 2022-05-28T12:40:49Z

Right, I didn't think of the latency. I think what you're proposing makes sense, and makes for an easier migration as well.

juliohm · 2022-05-28T12:58:14Z

My opinion is that we should really consider how many people are annoyed by this latency issue and if this is a real demand that justifies dividing work again into multiple repositories.

This premature modularization is a serious issue in Julia organizations where people simply split things into packages because 1 user commented about load time. At some point we all hope that Julia will fix the latency issue, and so let's think carefully before deciding to go separate ways.

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

ToucheSir · 2022-05-28T14:36:00Z

How much can we reduce import times without removing core deps (FLoops can probably go though since it's mostly used for syntax sugar)? I don't think this package has any precompilation, for example.

darsnack · 2022-05-28T15:15:18Z

I agree with Brian that we should look into reducing the load time contributed by Floops first. It would be nicer as a user to not have to load different packages to set up the data pipeline.

I think the main complaint re: load times is packages that want to extend the interface, not user load times. For this, let's see how far people get with the getindex and length interface. If we find lots of types are needing to define getobs and numobs specifically, then we can split the interface out. I think that makes more sense than splitting out the parallel loading.

lorenzoh · 2022-05-29T05:37:41Z

As another data point, the import time difference is also quite big on my machine:

julia> @time using MLUtils
  0.222900 seconds (468.17 k allocations: 27.956 MiB, 1.72% compilation time) (no parallel deps)
  1.010500 seconds (1.67 M allocations: 95.612 MiB, 1.94% gc time, 43.90% compilation time) (master)
  0.995093 seconds (1.56 M allocations: 88.791 MiB, 46.31% compilation time) (without FLoops)

Edit: added time without FLoops.jl

CarloLucibello · 2022-05-29T09:28:33Z

Can you think of users of DataLoaders.jl that are not users of MLUtils.jl? If the answer is yes, then we have a more solid argument to keep these efforts separate with such a tiny community of maintainers.

I can think of many users of MLUtils not being users of DataLoaders.

It would be nicer as a user to not have to load different packages to set up the data pipeline.

true. This is mitigated though by the fact that Flux's users just import Flux and have anything they need, without having to know anything about MLUtils or DataLoaders. Non-deep-learning users can just use an eachobs without any parallel option.

Edit: added time without FLoops.jl

so removing FLoops doesn't help much.

As for the mantainance burden of another repo, I expect DataLoaders.jl to be very low activity, we will hardly change anything until we start experimenting with other parallel loading strategies.

ToucheSir · 2022-05-29T14:31:20Z

What about precompilation? I don't think any packages in JuliaML use it?

CarloLucibello · 2022-05-29T15:32:32Z

Precompilation would help with TTFX but not with speeding up using right?

ToucheSir · 2022-05-29T17:29:00Z

I thought it was part of SciML/DifferentialEquations.jl#786, but reading back through SciML/DifferentialEquations.jl#786 (comment) it appears invalidations + Requires were more important for import times. Has anyone tried running @time_imports from 1.9 on MLUtils? I don't have a nightly Julia installed locally.

theabhirath · 2022-05-31T01:02:28Z

This is what I get:

julia> @time_imports using MLUtils
     10.4 ms    ┌ MacroTools
     17.8 ms  ┌ ZygoteRules 34.58% compilation time
      0.6 ms  ┌ DefineSingletons
      0.3 ms    ┌ IteratorInterfaceExtensions
      0.9 ms  ┌ TableTraits
      1.1 ms      ┌ DelimitedFiles
      3.3 ms    ┌ Compat
      1.2 ms    ┌ Requires
      0.4 ms    ┌ DataValueInterfaces
      1.0 ms      ┌ DataAPI
     11.2 ms    ┌ Tables
      1.0 ms    ┌ ConstructionBase
     16.2 ms    ┌ Setfield 29.96% compilation time (22% recompilation)
     13.6 ms    ┌ InitialValues
    127.6 ms  ┌ BangBang 55.14% compilation time (2% recompilation)
     10.0 ms  ┌ FunctionWrappers
     51.4 ms  ┌ DataStructures
      0.4 ms    ┌ CompositionsBase
     15.4 ms  ┌ Accessors 24.46% compilation time
      1.5 ms  ┌ ContextVariablesX
     60.8 ms    ┌ ChainRulesCore
     62.5 ms  ┌ ChangesOfVariables
      0.8 ms    ┌ InverseFunctions
      2.8 ms    ┌ DocStringExtensions 52.48% compilation time
      3.6 ms    ┌ IrrationalConstants
      0.7 ms    ┌ StatsAPI
      0.7 ms    ┌ SortingAlgorithms
      1.4 ms    ┌ LogExpFunctions
      6.1 ms    ┌ Missings
     38.9 ms  ┌ StatsBase 3.78% compilation time
      6.9 ms    ┌ MicroCollections
      0.7 ms    ┌ Adapt
      0.7 ms    ┌ ArgCheck
      6.4 ms    ┌ SplittablesBase
     34.7 ms    ┌ Baselet
     85.8 ms  ┌ Transducers 10.24% compilation time
     28.4 ms  ┌ MLStyle
      1.1 ms  ┌ PrettyPrint
      5.1 ms  ┌ ShowCases
    265.3 ms  ┌ FoldsThreads 86.56% compilation time
      1.2 ms  ┌ FLoopsBase
      0.7 ms  ┌ NameResolution
      3.4 ms  ┌ JuliaVariables
      6.1 ms  ┌ FLoops
    747.3 ms  MLUtils 43.14% compilation time (<1% recompilation)

CarloLucibello · 2022-05-31T06:36:48Z

We could do the lazy import (or require import) thing we do in MLDatasets.jl also here for FoldsThreads (see https://github.com/johnnychen94/LazyModules.jl), but it is a hackish solution that we should avoid if possible.

ToucheSir · 2022-05-31T16:16:19Z

Some finer-grained timing locally suggests that the majority of import overhead for BangBang and FoldsThreads is self time and not from dependencies (unless there is significant invalidation happening, but that does not appear to be the case). It's not clear to me why this is though, @tkf do you have any thoughts?

CarloLucibello · 2022-06-05T09:22:44Z

I'm waiting for the outcome of this discussion before tagging a new release that adds the 'parallel' API to DataLoader which we may have to break in the future if the parallel implementation is moved out. But now a few changes have accumulated, most importantly bug fixes like #97, so we should reach a decision soon.

lorenzoh · 2022-06-10T13:03:16Z

Would it be possible to add parallel as an experimental option? Could be removed later by simply ignoring it.

Thinking through it, I think it would be better to have DataLoader in MLUtils.jl. If criticisms mount after, it can always be removed.

Re being able to implement the data container interface without taking on a large dep: I think getindex and length should suffice in most use cases; where they don't, is mostly for builtin types like Array, Tuple, and Dict that are already defined in MLUtils.jl

darsnack · 2022-06-10T14:06:43Z

I share Lorenz's view here about waiting to see what people say. Also, keeping it in means we can release now, so I suggest doing that and moving this discussion to the next cycle. The long load times + DataLoader as a type are already released right?

lorenzoh · 2022-06-10T14:32:54Z

Yes, the load time is already released.

CarloLucibello · 2022-06-10T14:59:41Z

Ok, let's release then

ablaom · 2023-11-03T02:11:06Z

As a further breakdown, I also suggest to move getobs and numobs and their implementations for the Base types to a MLCore.jl repo.

Yes, please.

lorenzoh mentioned this issue May 28, 2022

Archiving DataLoaders.jl lorenzoh/DataLoaders.jl#33

Open

CarloLucibello mentioned this issue May 29, 2022

deprecate DataLoader #92

Closed

This was referenced Jun 2, 2022

don't automatically batch when getobs from an array of graphs CarloLucibello/GraphNeuralNetworks.jl#170

Closed

DataLoader transducers reducible API #75

Merged

darsnack mentioned this issue Jun 10, 2022

Use MLUtils.ones_like for ClassTokens FluxML/Metalhead.jl#166

Merged

ablaom mentioned this issue Nov 3, 2023

MLUtils seems quite heavy #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archiving DataLoaders.jl #90

Archiving DataLoaders.jl #90

lorenzoh commented May 28, 2022 •

edited

CarloLucibello commented May 28, 2022 •

edited

CarloLucibello commented May 28, 2022 •

edited

lorenzoh commented May 28, 2022

juliohm commented May 28, 2022

ToucheSir commented May 28, 2022

darsnack commented May 28, 2022

lorenzoh commented May 29, 2022 •

edited

CarloLucibello commented May 29, 2022

ToucheSir commented May 29, 2022

CarloLucibello commented May 29, 2022

ToucheSir commented May 29, 2022

theabhirath commented May 31, 2022

CarloLucibello commented May 31, 2022

ToucheSir commented May 31, 2022

CarloLucibello commented Jun 5, 2022

lorenzoh commented Jun 10, 2022

darsnack commented Jun 10, 2022 •

edited

lorenzoh commented Jun 10, 2022

CarloLucibello commented Jun 10, 2022

ablaom commented Nov 3, 2023

Archiving DataLoaders.jl #90

Archiving DataLoaders.jl #90

Comments

lorenzoh commented May 28, 2022 • edited

CarloLucibello commented May 28, 2022 • edited

CarloLucibello commented May 28, 2022 • edited

lorenzoh commented May 28, 2022

juliohm commented May 28, 2022

ToucheSir commented May 28, 2022

darsnack commented May 28, 2022

lorenzoh commented May 29, 2022 • edited

CarloLucibello commented May 29, 2022

ToucheSir commented May 29, 2022

CarloLucibello commented May 29, 2022

ToucheSir commented May 29, 2022

theabhirath commented May 31, 2022

CarloLucibello commented May 31, 2022

ToucheSir commented May 31, 2022

CarloLucibello commented Jun 5, 2022

lorenzoh commented Jun 10, 2022

darsnack commented Jun 10, 2022 • edited

lorenzoh commented Jun 10, 2022

CarloLucibello commented Jun 10, 2022

ablaom commented Nov 3, 2023

lorenzoh commented May 28, 2022 •

edited

CarloLucibello commented May 28, 2022 •

edited

CarloLucibello commented May 28, 2022 •

edited

lorenzoh commented May 29, 2022 •

edited

darsnack commented Jun 10, 2022 •

edited