Merge pull request #949 from alan-turing-institute/dev

For a 0.18.3 release
JuliaAI · Jun 17, 2022 · 9fc3fa8 · 9fc3fa8
2 parents 5cf5511 + 946bed4
commit 9fc3fa8
Show file tree

Hide file tree

Showing 11 changed files with 550 additions and 527 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -2,40 +2,43 @@
 
 Contributions to MLJ are most welcome. Queries can be made through
 issues or the Julia [slack
-channel](https://julialang.org/slack/), #MLJ. 
+channel](https://julialang.org/slack/), #mlj. 
 
 - [Road map](ROADMAP.md)
 
 - [Code organization](ORGANIZATION.md)
 
-- Issues: Currently issues are split between [MLJ issues](https://github.com/alan-turing-institute/MLJ.jl/issues) and issues in all other repositories, collected in [this GitHub Project](https://github.com/orgs/JuliaAI/projects/1).
+- Issues: Currently issues are split between [MLJ
+  issues](https://github.com/alan-turing-institute/MLJ.jl/issues) and
+  issues in all other repositories, collected in [this GitHub
+  Project](https://github.com/orgs/JuliaAI/projects/1).
 
 
 ### Conventions
 
-We follow
+Most larger MLJ repositories follow
 [this](https://nvie.com/posts/a-successful-git-branching-model/) git
-work-flow and, in particular, ask that **all pull requests be made to
-the`dev` branch** of the appropriate repo, and not to `master`. This
-includes changes to documentation. All pull requests onto `master`
-come from `dev` and generally precede a tagged release.
+work-flow. In all cases please make **all pull requests to the default
+branch** of the appropriate repo (branch appearing on the repo's
+landing page). This is `dev` for larger repos, and `master`
+otherwise. This includes changes to documentation.
 
 Contributors are kindly requested to adhere to the
 [Blue](https://github.com/invenia/BlueStyle) style guide, with line
-widths capped at 80 characters.
+widths capped at 92 characters.
 
 
 ### Very brief design overview
 
 MLJ has a basement level *model* interface, which must be implemented
 for each new learning algorithm. Formally, each model is a `mutable
 struct` storing hyperparameters and the implementer defines
-model-dispatched `fit` and `predict` methods; for details, see
-[here](docs/src/adding_models_for_general_use.md). The general user
-interacts using *machines* which bind models with data and have an
-internal state reflecting the outcomes of applying `fit!` and
-`predict` methods on them. The model interface is pure "functional";
-the machine interface more "object-oriented".
+model-dispatched `fit` and `predict`/`transform` methods; for details,
+see [here](docs/src/adding_models_for_general_use.md). The general
+user interacts using *machines* which bind models with data and have
+an internal state reflecting the outcomes of applying `fit!` and
+`predict`/`transform` methods on them. The model interface is pure
+"functional"; the machine interface more "object-oriented".
 
 A generalization of machine, called a *nodal* machine, is a key
 element of *learning networks* which combine several models together,

diff --git a/ORGANIZATION.md b/ORGANIZATION.md
@@ -100,3 +100,7 @@ its conventional use, are marked with a ⟂ symbol:
   [DataScienceTutorials](https://github.com/alan-turing-institute/DataScienceTutorials.jl)
   collects tutorials on how to use MLJ, which are deployed
   [here](https://alan-turing-institute.github.io/DataScienceTutorials.jl/)
+
+* [MLJTestIntegration](https://github.com/JuliaAI/MLJTestIntegration.jl)
+  provides tests for implementations of the MLJ model interface, and
+  integration tests for the entire MLJ ecosystem
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "MLJ"
 uuid = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
 authors = ["Anthony D. Blaom <anthony.blaom@gmail.com>"]
-version = "0.18.2"
+version = "0.18.3"
 
 [deps]
 CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"

diff --git a/docs/src/composing_models.md b/docs/src/composing_models.md
@@ -3,23 +3,20 @@
 Three common ways of combining multiple models together
 have out-of-the-box implementations in MLJ:
 
-- [Linear Pipelines](@ref) - for unbranching chains that take the
+- [Linear Pipelines](@ref) (`Pipeline`)- for unbranching chains that take the
   output of one model (e.g., dimension reduction, such as `PCA`) and
   make it the input of the next model in the chain (e.g., a
   classification model, such as `EvoTreeClassifier`). To include
   transformations of the target variable in a supervised pipeline
   model, see [Target Transformations](@ref).
 
-- [Homogeneous Ensembles](@ref) - for blending the predictions of
-  multiple supervised models all of the same type, but which receive different
-  views of the training data to reduce overall variance. The technique
-  is known as observation
-  [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating). Bagging
-  decision trees, like a `DecisionTreeClassifier`, gives what is known
-  as a *random forest*, although MLJ also provides several canned
-  random forest models.
+- [Homogeneous Ensembles](@ref) (`EnsembleModel`) - for blending the
+  predictions of multiple supervised models all of the same type, but
+  which receive different views of the training data to reduce overall
+  variance. The technique implemented here is known as observation
+  [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating). 
 
-- [Model Stacking](@ref) - for combining the predictions of a smaller
+- [Model Stacking](@ref) - (`Stack`) for combining the predictions of a smaller
   number of models of possibly *different* type, with the help of an
   adjudicating model.
 
@@ -83,7 +80,7 @@ use and test the learning network as it is defined, which is also a
 good way to understand how learning networks work under the hood. This
 data, if specified, is ignored in the export process, for the exported
 composite model, like any other model, is not associated with any data
-until wrapped in a machine.
+until bound to data in a machine.
 
 In MLJ learning networks treat the flow of information during training
 and prediction/transforming separately.
@@ -130,7 +127,8 @@ x3 = rand(300)
 y = exp.(x1 - x2 -2x3 + 0.1*rand(300))
 X = DataFrames.DataFrame(x1=x1, x2=x2, x3=x3)
 
-train, test  = partition(eachindex(y), 0.8); # hide
+train, test  = partition(eachindex(y), 0.8); 
+nothing # hide
 ```
 Step one is to wrap the data in *source nodes*:
 
@@ -140,11 +138,10 @@ ys = source(y)
 ```
 
 *Note.* One can omit the specification of data at the source nodes (by
-writing instead `Xs = source()` and `ys = source()`) and
-still export the resulting network as a stand-alone model using the
-@from_network macro described later; see the example under [Static
-operations on nodes](@ref). However, one will be unable to fit
-or call network nodes, as illustrated below.
+writing instead `Xs = source()` and `ys = source()`) and still export
+the resulting network as a stand-alone model, as discussed later; see
+the example under [Static operations on nodes](@ref). However, one
+will be unable to `fit!` or call network nodes, as illustrated below.
 
 The contents of a source node can be recovered by simply calling the
 node with no arguments:
@@ -227,6 +224,15 @@ rms(y[test], yhat(rows=test))
 > **Notable feature.** The machine, `ridge::Machine{RidgeRegressor}`, is retrained, because its underlying model has been mutated. However, since the outcome of this training has no effect on the training inputs of the machines `stand` and `box`, these transformers are left untouched. (During construction, each node and machine in a learning network determines and records all machines on which it depends.) This behavior, which extends to exported learning networks, means we can tune our wrapped regressor (using a holdout set) without re-computing transformations each time a `ridge_model` hyperparameter is changed.
 
 
+#### Multithreaded training
+
+A more complicated learning network (e.g., some inhomogenous ensemble
+of supervised models) may contain machines that can be trained in
+parallel. In that case, a call to a node `N`, such as `fit!(N,
+accleration=CPUThreads())`, will parallelize the training using
+multithreading.
+
+
 ### Learning network machines
 
 As we show shortly, a learning network needs to be "exported" to create a
@@ -291,6 +297,17 @@ model). See [Exposing internal state of a learning network](@ref) for
 this advanced feature.
 
 
+#### Learning network machines with multithreading
+
+To indicate that a learning network machine should be trained using
+multithreading (see above for the node case) add the
+`acceleration=CPUThreads()` keyword argument to the machine
+constructor, as in 
+
+```julia
+machine(Deterministic(), Xs, ys; predict=yhat, acceleration=CPUThreads()) 
+```
+
 ## Exporting a learning network as a stand-alone model
 
 Having satisfied that our learning network works on the synthetic
@@ -356,17 +373,20 @@ WrappedRegressor(regressor = KNNRegressor(K = 7,
 										  weights = :uniform,),) @ 2…63
 ```
 
+!!! warning "Limitations of `@from_network`"
+
+    All the objects defined in an `@from_network` call need to be in the 
+	global scope of the module from which it is called. A more robust method for
+	exporting learning networks is described under "Method II" below.
 
-### Method II: Finer control (advanced)
 
-This section describes an advanced feature that can be skipped on a
-first reading.
+### Method II: Finer control
 
 In Method I above, only models appearing in the network will appear as
 hyperparameters of the exported composite model. There is a second
 more flexible method for exporting the network, which allows finer
 control over the exported `Model` struct, and which also avoids
-macros. The two steps required are:
+limitations of using a macro. The two steps required are:
 
 - Define a new `mutable struct` model type.
 

diff --git a/docs/src/homogeneous_ensembles.md b/docs/src/homogeneous_ensembles.md
@@ -2,7 +2,23 @@
 
 Although an ensemble of models sharing a common set of hyperparameters
 can defined using the learning network API, MLJ's `EnsembleModel`
-model wrapper is preferred, for convenience and best performance.
+model wrapper is preferred, for convenience and best
+performance. Examples of using `EnsembleModel` are given in [this Data
+Science
+Tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles/).
+
+When bagging decision trees, further randomness is normally introduced
+by subsampling *features*, when training each node of each tree ([Ho
+(1995)](https://web.archive.org/web/20160417030218/http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf),
+[Brieman and Cutler
+(2001)](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)). A
+bagged ensemble of such trees is known as a [Random
+Forest](https://en.wikipedia.org/wiki/Random_forest). You can see an
+example of using `EnsembleModel` to build a random forest in [this
+Data Science
+Tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles-2/). However,
+you may also want to use a canned random forest model. Run
+`models("RandomForest")` to list such models.
 
 ```@docs
 MLJEnsembles.EnsembleModel