Skip to content

Commit

Permalink
Merge pull request #977 from alan-turing-institute/dev
Browse files Browse the repository at this point in the history
For a 0.18.6 release
  • Loading branch information
ablaom committed Oct 26, 2022
2 parents 212dcf6 + 681432b commit 2f7bca6
Show file tree
Hide file tree
Showing 9 changed files with 81 additions and 43 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "MLJ"
uuid = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
authors = ["Anthony D. Blaom <anthony.blaom@gmail.com>"]
version = "0.18.5"
version = "0.18.6"

[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Expand Down
5 changes: 3 additions & 2 deletions docs/src/about_mlj.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,9 @@ Extract:

## Key features

* Data agnostic, train models on any data supported by the
[Tables.jl](https://github.com/JuliaData/Tables.jl) interface.
* Data agnostic, train most models on any data `X` supported by the
[Tables.jl](https://github.com/JuliaData/Tables.jl) interface (needs `Tables.istable(X)
== true`).

* Extensive, state-of-the-art, support for model composition
(*pipelines*, *stacks* and, more generally, *learning networks*). See more
Expand Down
41 changes: 36 additions & 5 deletions docs/src/adding_models_for_general_use.md
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,12 @@ convention, elements of `y` have type `CategoricalValue`, and *not*
[BinaryClassifier](https://github.com/JuliaAI/MLJModels.jl/blob/master/src/GLM.jl)
for an example.

#### Report items returned by predict

A `predict` method, or other operation such as `transform`, can contribute to the report
accessible in any machine associated with a model. See [Reporting byproducts of a
static transformation](@ref) below for details.


### The predict_joint method

Expand Down Expand Up @@ -1191,11 +1197,12 @@ Your document string must include the following components, in order:
Unsupervised models implement the MLJ model interface in a very
similar fashion. The main differences are:

- The `fit` method has only one training argument `X`, as in
`MLJModelInterface.fit(model, verbosity, X)`. However, it has
the same return value `(fitresult, cache, report)`. An `update`
method (e.g., for iterative models) can be optionally implemented in
the same way.
- The `fit` method has only one training argument `X`, as in `MLJModelInterface.fit(model,
verbosity, X)`. However, it has the same return value `(fitresult, cache, report)`. An
`update` method (e.g., for iterative models) can be optionally implemented in the same
way. For models that subtype `Static <: Unsupervised` (see also [Static
transformers](@ref) `fit` has no training arguments but does not need to be implemented
as a fallback returns `(nothing, nothing, nothing)`.

- A `transform` method is compulsory and has the same signature as
`predict`, as in `MLJModelInterface.transform(model, fitresult, Xnew)`.
Expand All @@ -1220,6 +1227,30 @@ similar fashion. The main differences are:
input features into a space of lower dimension. See [Transformers
that also predict](@ref) for an example.

## Static models (models that do not generalize)

See [Static transformers](@ref) for basic implementation of models that do not generalize
to new data but do have hyperparameters.

### Reporting byproducts of a static transformation

As a static transformer does not implement `fit`, the usual mechanism for creating a
`report` is not available. Instead, byproducts of the computation performed by `transform`
can be returned by `transform` itself by returning a pair (`output`, `report`) instead of
just `output`. Here `report` should be a named tuple. In fact, any operation, (e.g.,
`predict`) can do this, and in the case of any model type. However, this exceptional
behavior must be flagged with an appropriate trait declaration, as in

```julia
MLJModelInterface.reporting_operations(::Type{<:SomeModelType}) = (:transform,)
```

If `mach` is a machine wrapping a model of this kind, then the `report(mach)` will include
the `report` item form `transform`'s output. For sample implementations, see [this
issue](https://github.com/JuliaAI/MLJBase.jl/pull/806) or the code for [DBSCAN
clustering](https://github.com/jbrea/MLJClusteringInterface.jl/blob/41d3c2195ad33f1840596c9762a3a67b9a124c6a/src/MLJClusteringInterface.jl#L125).


## Outlier detection models

!!! warning "Experimental API"
Expand Down
2 changes: 1 addition & 1 deletion docs/src/common_mlj_workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Loading a built-in data set already split into `X` and `y`:

```@example workflows
X, y = @load_iris;
selectrows(X, 1:4) # selectrows works for any Tables.jl table
selectrows(X, 1:4) # selectrows works whenever `Tables.istable(X)==true`.
```

```@example workflows
Expand Down
33 changes: 15 additions & 18 deletions docs/src/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,11 @@ schema(iris)
```

Because this data format is compatible with
[Tables.jl](https://tables.juliadata.org/stable/), many MLJ methods
(such as `selectrows`, `pretty` and `schema` used above) as well as
many MLJ models can work with it. However, as most new users are
already familiar with the access methods particular to
[DataFrames](https://dataframes.juliadata.org/stable/) (also
compatible with Tables.jl) we'll put our data into that format here:
[Tables.jl](https://tables.juliadata.org/stable/) (and satisfies `Tables.istable(iris) ==
true`) many MLJ methods (such as `selectrows`, `pretty` and `schema` used above) as well
as many MLJ models can work with it. However, as most new users are already familiar with
the access methods particular to [DataFrames](https://dataframes.juliadata.org/stable/)
(also compatible with Tables.jl) we'll put our data into that format here:

```@example doda
import DataFrames
Expand Down Expand Up @@ -334,14 +333,12 @@ scitype(X)

### Two-dimensional data

Generally, two-dimensional data in MLJ is expected to be *tabular*.
All data containers compatible with the
[Tables.jl](https://github.com/JuliaData/Tables.jl) interface (which
includes all source formats listed
[here](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md))
have the scientific type `Table{K}`, where `K` depends on the
scientific types of the columns, which can be individually inspected
using `schema`:
Generally, two-dimensional data in MLJ is expected to be *tabular*. All data containers
`X` compatible with the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface and
sastisfying `Tables.istable(X) == true` (most of the formats in [this
list](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md)) have the
scientific type `Table{K}`, where `K` depends on the scientific types of the columns,
which can be individually inspected using `schema`:

```@repl doda
schema(X)
Expand Down Expand Up @@ -385,10 +382,10 @@ resampling is always more efficient in this case.

### Inputs

Since an MLJ model only specifies the scientific type of data, if that
type is `Table` - which is the case for the majority of MLJ models -
then any [Tables.jl](https://github.com/JuliaData/Tables.jl) format is
permitted.
Since an MLJ model only specifies the scientific type of data, if that type is `Table` -
which is the case for the majority of MLJ models - then any
[Tables.jl](https://github.com/JuliaData/Tables.jl) container `X` is permitted, so long as
`Tables.istable(X) == true`.

Specifically, the requirement for an arbitrary model's input is `scitype(X)
<: input_scitype(model)`.
Expand Down
5 changes: 3 additions & 2 deletions docs/src/list_of_supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,14 @@ independent assessment.
[BetaML.jl](https://github.com/sylvaticus/BetaML.jl) | - | BetaMLGMMImputer, BetaMLGMMRegressor, BetaMLGenericImputer, BetaMLMeanImputer, BetaMLRFImputer, DecisionTreeClassifier, DecisionTreeRegressor, GMMClusterer, KMeans, KMedoids, KernelPerceptronClassifier, MissingImputator, PegasosClassifier, PerceptronClassifier, RandomForestClassifier, RandomForestRegressor | medium |
[Clustering.jl](https://github.com/JuliaStats/Clustering.jl) | [MLJClusteringInterface.jl](https://github.com/JuliaAI/MLJClusteringInterface.jl) | KMeans, KMedoids | high | †
[DecisionTree.jl](https://github.com/bensadeghi/DecisionTree.jl) | [MLJDecisionTreeInterface.jl](https://github.com/JuliaAI/MLJDecisionTreeInterface.jl) | DecisionTreeClassifier, DecisionTreeRegressor, AdaBoostStumpClassifier, RandomForestClassifier, RandomForestRegressor | high |
[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) | - | EvoTreeRegressor, EvoTreeClassifier, EvoTreeCount, EvoTreeGaussian | medium | gradient boosting models
[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) | - | EvoTreeRegressor, EvoTreeClassifier, EvoTreeCount, EvoTreeGaussian | medium | tree-based gradient boosting models
[EvoLinear.jl](https://github.com/jeremiedb/EvoLinear.jl) | - | EvoLinearRegressor | medium | linear boosting models
[GLM.jl](https://github.com/JuliaStats/GLM.jl) | [MLJGLMInterface.jl](https://github.com/JuliaAI/MLJGLMInterface.jl) | LinearRegressor, LinearBinaryClassifier, LinearCountRegressor | medium | †
[LIBSVM.jl](https://github.com/mpastell/LIBSVM.jl) | [MLJLIBSVMInterface.jl](https://github.com/JuliaAI/MLJLIBSVMInterface.jl) | LinearSVC, SVC, NuSVC, NuSVR, EpsilonSVR, OneClassSVM | high | also via ScikitLearn.jl
[LightGBM.jl](https://github.com/IQVIA-ML/LightGBM.jl) | - | LGBMClassifier, LGBMRegressor | high |
[Flux.jl](https://github.com/FluxML/Flux.jl) | [MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl) | NeuralNetworkRegressor, NeuralNetworkClassifier, MultitargetNeuralNetworkRegressor, ImageClassifier | low |
[MLJLinearModels.jl](https://github.com/JuliaAI/MLJLinearModels.jl) | - | LinearRegressor, RidgeRegressor, LassoRegressor, ElasticNetRegressor, QuantileRegressor, HuberRegressor, RobustRegressor, LADRegressor, LogisticClassifier, MultinomialClassifier | medium |
[MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl) (built-in) | - | StaticTransformer, FeatureSelector, FillImputer, UnivariateStandardizer, Standardizer, UnivariateBoxCoxTransformer, OneHotEncoder, ContinuousEncoder, ConstantRegressor, ConstantClassifier, BinaryThreshholdPredictor | medium |
[MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl) (built-in) | - | ConstantClassifier, ConstantRegressor, ContinuousEncoder, DeterministicConstantClassifier, DeterministicConstantRegressor, FeatureSelector, FillImputer, InteractionTransformer, OneHotEncoder, Standardizer, UnivariateBoxCoxTransformer, UnivariateDiscretizer, UnivariateFillImputer, UnivariateTimeTypeToContinuous, Standardizer, BinaryThreshholdPredictor | medium |
[MLJText.jl](https://github.com/JuliaAI/MLJText.jl) | - | TfidfTransformer, BM25Transformer, CountTransformer | low |
[MultivariateStats.jl](https://github.com/JuliaStats/MultivariateStats.jl) | [MLJMultivariateStatsInterface.jl](https://github.com/JuliaAI/MLJMultivariateStatsInterface.jl) | LinearRegressor, MultitargetLinearRegressor, RidgeRegressor, MultitargetRidgeRegressor, PCA, KernelPCA, ICA, LDA, BayesianLDA, SubspaceLDA, BayesianSubspaceLDA, FactorAnalysis, PPCA | high |
[NaiveBayes.jl](https://github.com/dfdx/NaiveBayes.jl) | [MLJNaiveBayesInterface.jl](https://github.com/JuliaAI/MLJNaiveBayesInterface.jl) | GaussianNBClassifier, MultinomialNBClassifier, HybridNBClassifier | low |
Expand Down
9 changes: 4 additions & 5 deletions docs/src/quick_start_guide_to_adding_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,10 @@ of how things work with MLJ. In particular, you are familiar with:

- what `Probabilistic`, `Deterministic` and `Unsupervised` models are

- the fact that MLJ generally works with tables rather than
matrices. Here a *table* is a container satisfying the
[Tables.jl](https://github.com/JuliaData/Tables.jl) API (e.g.,
DataFrame, JuliaDB table, CSV file, named tuple of equal-length
vectors)
- the fact that MLJ generally works with tables rather than matrices. Here a *table* is a
container `X` satisfying the [Tables.jl](https://github.com/JuliaData/Tables.jl) API and
satisifying `Tables.istable(X) == true` (e.g., DataFrame, JuliaDB table, CSV file, named
tuple of equal-length vectors)

- [CategoricalArrays.jl](https://github.com/JuliaData/CategoricalArrays.jl)
(if working with finite discrete data, e.g., doing classification)
Expand Down
19 changes: 13 additions & 6 deletions docs/src/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,16 @@ MLJModels.UnivariateTimeTypeToContinuous

## Static transformers

The main use-case for static transformers is for insertion into
[Linear Pipelines](@ref) or other exported learning networks (see [Composing
Models](@ref)). If a static transformer has no hyper-parameters, it is
tantamount to an ordinary function. An ordinary function can be
inserted directly into a pipeline; the situation for learning
networks is only slightly more complicated; see [Static operations on nodes](@ref).
A *static transformer* is a model for transforming data that does not generalize to new
data (does not "learn") but which nevertheless has hyperparameters. For example, the
`DBSAN` clustering model from Clustering.jl can assign labels to some collection of
observations, cannot directly assign a label to some new observation.

The general user may define their own static models. The main use-case is insertion into a
[Linear Pipelines](@ref) some parameter-dependent transformation. (If a static transformer
has no hyper-parameters, it is tantamount to an ordinary function. An ordinary function
can be inserted directly into a pipeline; the situation for learning networks is only
slightly more complicated; see [Static operations on nodes](@ref).)

The following example defines a new model type `Averager` to perform
the weighted average of two vectors (target predictions, for
Expand Down Expand Up @@ -151,6 +155,9 @@ _.fitted_params_per_fold = [ … ]
_.report_per_fold = [ ]
```

A static transformer can also expose byproducts of the transform computation in the report
of any associated machine. See [Static models (models that do not generalize)](@ref) for
details.

## Transformers that also predict

Expand Down
8 changes: 5 additions & 3 deletions src/MLJ.jl
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,10 @@ export scitype, scitype_union, elscitype, nonmissing, trait
export coerce, coerce!, autotype, schema, info

# re-export from MLJBase:
import MLJBase: serializable, restore!
export nrows, color_off, color_on,
selectrows, selectcols, restrict, corestrict, complement,
training_losses, feature_importances,
training_losses, feature_importances,
predict, predict_mean, predict_median, predict_mode, predict_joint,
transform, inverse_transform, evaluate, fitted_params, params,
@constant, @more, HANDLE_GIVEN_ID, UnivariateFinite,
Expand All @@ -96,7 +97,8 @@ export nrows, color_off, color_on,
default_resource, pretty,
make_blobs, make_moons, make_circles, make_regression,
fit_only!, return!, int, decoder,
default_scitype_check_level
default_scitype_check_level,
serializable, restore!

# MLJBase/composition/abstract_types.jl:
for T in vcat(MLJBase.MLJModelInterface.ABSTRACT_MODEL_SUBTYPES,
Expand Down Expand Up @@ -132,7 +134,7 @@ export models, localmodels, @load, @iload, load, info, doc,
Standardizer, UnivariateBoxCoxTransformer,
OneHotEncoder, ContinuousEncoder, UnivariateDiscretizer,
FillImputer, matching, BinaryThresholdPredictor,
UnivariateTimeTypeToContinuous
UnivariateTimeTypeToContinuous, InteractionTransformer

# re-export from MLJIteration:
export MLJIteration
Expand Down

0 comments on commit 2f7bca6

Please sign in to comment.