Skip to content

Commit

Permalink
Merge pull request #949 from alan-turing-institute/dev
Browse files Browse the repository at this point in the history
For a 0.18.3 release
  • Loading branch information
ablaom committed Jun 17, 2022
2 parents 5cf5511 + 946bed4 commit 9fc3fa8
Show file tree
Hide file tree
Showing 11 changed files with 550 additions and 527 deletions.
31 changes: 17 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,43 @@

Contributions to MLJ are most welcome. Queries can be made through
issues or the Julia [slack
channel](https://julialang.org/slack/), #MLJ.
channel](https://julialang.org/slack/), #mlj.

- [Road map](ROADMAP.md)

- [Code organization](ORGANIZATION.md)

- Issues: Currently issues are split between [MLJ issues](https://github.com/alan-turing-institute/MLJ.jl/issues) and issues in all other repositories, collected in [this GitHub Project](https://github.com/orgs/JuliaAI/projects/1).
- Issues: Currently issues are split between [MLJ
issues](https://github.com/alan-turing-institute/MLJ.jl/issues) and
issues in all other repositories, collected in [this GitHub
Project](https://github.com/orgs/JuliaAI/projects/1).


### Conventions

We follow
Most larger MLJ repositories follow
[this](https://nvie.com/posts/a-successful-git-branching-model/) git
work-flow and, in particular, ask that **all pull requests be made to
the`dev` branch** of the appropriate repo, and not to `master`. This
includes changes to documentation. All pull requests onto `master`
come from `dev` and generally precede a tagged release.
work-flow. In all cases please make **all pull requests to the default
branch** of the appropriate repo (branch appearing on the repo's
landing page). This is `dev` for larger repos, and `master`
otherwise. This includes changes to documentation.

Contributors are kindly requested to adhere to the
[Blue](https://github.com/invenia/BlueStyle) style guide, with line
widths capped at 80 characters.
widths capped at 92 characters.


### Very brief design overview

MLJ has a basement level *model* interface, which must be implemented
for each new learning algorithm. Formally, each model is a `mutable
struct` storing hyperparameters and the implementer defines
model-dispatched `fit` and `predict` methods; for details, see
[here](docs/src/adding_models_for_general_use.md). The general user
interacts using *machines* which bind models with data and have an
internal state reflecting the outcomes of applying `fit!` and
`predict` methods on them. The model interface is pure "functional";
the machine interface more "object-oriented".
model-dispatched `fit` and `predict`/`transform` methods; for details,
see [here](docs/src/adding_models_for_general_use.md). The general
user interacts using *machines* which bind models with data and have
an internal state reflecting the outcomes of applying `fit!` and
`predict`/`transform` methods on them. The model interface is pure
"functional"; the machine interface more "object-oriented".

A generalization of machine, called a *nodal* machine, is a key
element of *learning networks* which combine several models together,
Expand Down
4 changes: 4 additions & 0 deletions ORGANIZATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,7 @@ its conventional use, are marked with a ⟂ symbol:
[DataScienceTutorials](https://github.com/alan-turing-institute/DataScienceTutorials.jl)
collects tutorials on how to use MLJ, which are deployed
[here](https://alan-turing-institute.github.io/DataScienceTutorials.jl/)

* [MLJTestIntegration](https://github.com/JuliaAI/MLJTestIntegration.jl)
provides tests for implementations of the MLJ model interface, and
integration tests for the entire MLJ ecosystem
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "MLJ"
uuid = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
authors = ["Anthony D. Blaom <anthony.blaom@gmail.com>"]
version = "0.18.2"
version = "0.18.3"

[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Expand Down
62 changes: 41 additions & 21 deletions docs/src/composing_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,20 @@
Three common ways of combining multiple models together
have out-of-the-box implementations in MLJ:

- [Linear Pipelines](@ref) - for unbranching chains that take the
- [Linear Pipelines](@ref) (`Pipeline`)- for unbranching chains that take the
output of one model (e.g., dimension reduction, such as `PCA`) and
make it the input of the next model in the chain (e.g., a
classification model, such as `EvoTreeClassifier`). To include
transformations of the target variable in a supervised pipeline
model, see [Target Transformations](@ref).

- [Homogeneous Ensembles](@ref) - for blending the predictions of
multiple supervised models all of the same type, but which receive different
views of the training data to reduce overall variance. The technique
is known as observation
[bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating). Bagging
decision trees, like a `DecisionTreeClassifier`, gives what is known
as a *random forest*, although MLJ also provides several canned
random forest models.
- [Homogeneous Ensembles](@ref) (`EnsembleModel`) - for blending the
predictions of multiple supervised models all of the same type, but
which receive different views of the training data to reduce overall
variance. The technique implemented here is known as observation
[bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating).

- [Model Stacking](@ref) - for combining the predictions of a smaller
- [Model Stacking](@ref) - (`Stack`) for combining the predictions of a smaller
number of models of possibly *different* type, with the help of an
adjudicating model.

Expand Down Expand Up @@ -83,7 +80,7 @@ use and test the learning network as it is defined, which is also a
good way to understand how learning networks work under the hood. This
data, if specified, is ignored in the export process, for the exported
composite model, like any other model, is not associated with any data
until wrapped in a machine.
until bound to data in a machine.

In MLJ learning networks treat the flow of information during training
and prediction/transforming separately.
Expand Down Expand Up @@ -130,7 +127,8 @@ x3 = rand(300)
y = exp.(x1 - x2 -2x3 + 0.1*rand(300))
X = DataFrames.DataFrame(x1=x1, x2=x2, x3=x3)
train, test = partition(eachindex(y), 0.8); # hide
train, test = partition(eachindex(y), 0.8);
nothing # hide
```
Step one is to wrap the data in *source nodes*:

Expand All @@ -140,11 +138,10 @@ ys = source(y)
```

*Note.* One can omit the specification of data at the source nodes (by
writing instead `Xs = source()` and `ys = source()`) and
still export the resulting network as a stand-alone model using the
@from_network macro described later; see the example under [Static
operations on nodes](@ref). However, one will be unable to fit
or call network nodes, as illustrated below.
writing instead `Xs = source()` and `ys = source()`) and still export
the resulting network as a stand-alone model, as discussed later; see
the example under [Static operations on nodes](@ref). However, one
will be unable to `fit!` or call network nodes, as illustrated below.

The contents of a source node can be recovered by simply calling the
node with no arguments:
Expand Down Expand Up @@ -227,6 +224,15 @@ rms(y[test], yhat(rows=test))
> **Notable feature.** The machine, `ridge::Machine{RidgeRegressor}`, is retrained, because its underlying model has been mutated. However, since the outcome of this training has no effect on the training inputs of the machines `stand` and `box`, these transformers are left untouched. (During construction, each node and machine in a learning network determines and records all machines on which it depends.) This behavior, which extends to exported learning networks, means we can tune our wrapped regressor (using a holdout set) without re-computing transformations each time a `ridge_model` hyperparameter is changed.

#### Multithreaded training

A more complicated learning network (e.g., some inhomogenous ensemble
of supervised models) may contain machines that can be trained in
parallel. In that case, a call to a node `N`, such as `fit!(N,
accleration=CPUThreads())`, will parallelize the training using
multithreading.


### Learning network machines

As we show shortly, a learning network needs to be "exported" to create a
Expand Down Expand Up @@ -291,6 +297,17 @@ model). See [Exposing internal state of a learning network](@ref) for
this advanced feature.


#### Learning network machines with multithreading

To indicate that a learning network machine should be trained using
multithreading (see above for the node case) add the
`acceleration=CPUThreads()` keyword argument to the machine
constructor, as in

```julia
machine(Deterministic(), Xs, ys; predict=yhat, acceleration=CPUThreads())
```

## Exporting a learning network as a stand-alone model

Having satisfied that our learning network works on the synthetic
Expand Down Expand Up @@ -356,17 +373,20 @@ WrappedRegressor(regressor = KNNRegressor(K = 7,
weights = :uniform,),) @ 2…63
```

!!! warning "Limitations of `@from_network`"

All the objects defined in an `@from_network` call need to be in the
global scope of the module from which it is called. A more robust method for
exporting learning networks is described under "Method II" below.

### Method II: Finer control (advanced)

This section describes an advanced feature that can be skipped on a
first reading.
### Method II: Finer control

In Method I above, only models appearing in the network will appear as
hyperparameters of the exported composite model. There is a second
more flexible method for exporting the network, which allows finer
control over the exported `Model` struct, and which also avoids
macros. The two steps required are:
limitations of using a macro. The two steps required are:

- Define a new `mutable struct` model type.

Expand Down
18 changes: 17 additions & 1 deletion docs/src/homogeneous_ensembles.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,23 @@

Although an ensemble of models sharing a common set of hyperparameters
can defined using the learning network API, MLJ's `EnsembleModel`
model wrapper is preferred, for convenience and best performance.
model wrapper is preferred, for convenience and best
performance. Examples of using `EnsembleModel` are given in [this Data
Science
Tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles/).

When bagging decision trees, further randomness is normally introduced
by subsampling *features*, when training each node of each tree ([Ho
(1995)](https://web.archive.org/web/20160417030218/http://ect.bell-labs.com/who/tkh/publications/papers/odt.pdf),
[Brieman and Cutler
(2001)](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)). A
bagged ensemble of such trees is known as a [Random
Forest](https://en.wikipedia.org/wiki/Random_forest). You can see an
example of using `EnsembleModel` to build a random forest in [this
Data Science
Tutorial](https://juliaai.github.io/DataScienceTutorials.jl/getting-started/ensembles-2/). However,
you may also want to use a canned random forest model. Run
`models("RandomForest")` to list such models.

```@docs
MLJEnsembles.EnsembleModel
Expand Down

0 comments on commit 9fc3fa8

Please sign in to comment.