Merge pull request #977 from alan-turing-institute/dev

For a 0.18.6 release
JuliaAI · Oct 26, 2022 · 2f7bca6 · 2f7bca6
2 parents 212dcf6 + 681432b
commit 2f7bca6
Show file tree

Hide file tree

Showing 9 changed files with 81 additions and 43 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "MLJ"
 uuid = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
 authors = ["Anthony D. Blaom <anthony.blaom@gmail.com>"]
-version = "0.18.5"
+version = "0.18.6"
 
 [deps]
 CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"

diff --git a/docs/src/about_mlj.md b/docs/src/about_mlj.md
@@ -143,8 +143,9 @@ Extract:
 
 ## Key features
 
-* Data agnostic, train models on any data supported by the
-  [Tables.jl](https://github.com/JuliaData/Tables.jl) interface.
+* Data agnostic, train most models on any data `X` supported by the
+  [Tables.jl](https://github.com/JuliaData/Tables.jl) interface (needs `Tables.istable(X)
+  == true`).
 
 * Extensive, state-of-the-art, support for model composition
   (*pipelines*, *stacks* and, more generally, *learning networks*). See more

diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md
@@ -697,6 +697,12 @@ convention, elements of `y` have type `CategoricalValue`, and *not*
 [BinaryClassifier](https://github.com/JuliaAI/MLJModels.jl/blob/master/src/GLM.jl)
 for an example.
 
+#### Report items returned by predict
+
+A `predict` method, or other operation such as `transform`, can contribute to the report
+accessible in any machine associated with a model. See [Reporting byproducts of a
+static transformation](@ref) below for details.
+
 
 ### The predict_joint method
 
@@ -1191,11 +1197,12 @@ Your document string must include the following components, in order:
 Unsupervised models implement the MLJ model interface in a very
 similar fashion. The main differences are:
 
-- The `fit` method has only one training argument `X`, as in
-  `MLJModelInterface.fit(model, verbosity, X)`. However, it has
-  the same return value `(fitresult, cache, report)`. An `update`
-  method (e.g., for iterative models) can be optionally implemented in
-  the same way.
+- The `fit` method has only one training argument `X`, as in `MLJModelInterface.fit(model,
+  verbosity, X)`. However, it has the same return value `(fitresult, cache, report)`. An
+  `update` method (e.g., for iterative models) can be optionally implemented in the same
+  way. For models that subtype `Static <: Unsupervised` (see also [Static
+  transformers](@ref) `fit` has no training arguments but does not need to be implemented
+  as a fallback returns `(nothing, nothing, nothing)`.
 
 - A `transform` method is compulsory and has the same signature as
   `predict`, as in `MLJModelInterface.transform(model, fitresult, Xnew)`.
@@ -1220,6 +1227,30 @@ similar fashion. The main differences are:
   input features into a space of lower dimension. See [Transformers
   that also predict](@ref) for an example.
 
+## Static models (models that do not generalize)
+
+See [Static transformers](@ref) for basic implementation of models that do not generalize
+to new data but do have hyperparameters. 
+
+### Reporting byproducts of a static transformation
+
+As a static transformer does not implement `fit`, the usual mechanism for creating a
+`report` is not available. Instead, byproducts of the computation performed by `transform`
+can be returned by `transform` itself by returning a pair (`output`, `report`) instead of
+just `output`.  Here `report` should be a named tuple. In fact, any operation, (e.g.,
+`predict`) can do this, and in the case of any model type. However, this exceptional
+behavior must be flagged with an appropriate trait declaration, as in
+
+```julia
+MLJModelInterface.reporting_operations(::Type{<:SomeModelType}) = (:transform,)
+```
+
+If `mach` is a machine wrapping a model of this kind, then the `report(mach)` will include
+the `report` item form `transform`'s output. For sample implementations, see [this
+issue](https://github.com/JuliaAI/MLJBase.jl/pull/806) or the code for [DBSCAN
+clustering](https://github.com/jbrea/MLJClusteringInterface.jl/blob/41d3c2195ad33f1840596c9762a3a67b9a124c6a/src/MLJClusteringInterface.jl#L125).
+
+
 ## Outlier detection models
 
 !!! warning "Experimental API"

diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md
@@ -77,7 +77,7 @@ Loading a built-in data set already split into `X` and `y`:
 
 ```@example workflows
 X, y = @load_iris;
-selectrows(X, 1:4) # selectrows works for any Tables.jl table
+selectrows(X, 1:4) # selectrows works whenever `Tables.istable(X)==true`.
 ```
 
 ```@example workflows

diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
@@ -33,12 +33,11 @@ schema(iris)
 ```
 
 Because this data format is compatible with
-[Tables.jl](https://tables.juliadata.org/stable/), many MLJ methods
-(such as `selectrows`, `pretty` and `schema` used above) as well as
-many MLJ models can work with it. However, as most new users are
-already familiar with the access methods particular to
-[DataFrames](https://dataframes.juliadata.org/stable/) (also
-compatible with Tables.jl) we'll put our data into that format here:
+[Tables.jl](https://tables.juliadata.org/stable/) (and satisfies `Tables.istable(iris) ==
+true`) many MLJ methods (such as `selectrows`, `pretty` and `schema` used above) as well
+as many MLJ models can work with it. However, as most new users are already familiar with
+the access methods particular to [DataFrames](https://dataframes.juliadata.org/stable/)
+(also compatible with Tables.jl) we'll put our data into that format here:
 
 ```@example doda
 import DataFrames
@@ -334,14 +333,12 @@ scitype(X)
 
 ### Two-dimensional data
 
-Generally, two-dimensional data in MLJ is expected to be *tabular*.
-All data containers compatible with the
-[Tables.jl](https://github.com/JuliaData/Tables.jl) interface (which
-includes all source formats listed
-[here](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md))
-have the scientific type `Table{K}`, where `K` depends on the
-scientific types of the columns, which can be individually inspected
-using `schema`:
+Generally, two-dimensional data in MLJ is expected to be *tabular*.  All data containers
+`X` compatible with the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface and
+sastisfying `Tables.istable(X) == true` (most of the formats in [this
+list](https://github.com/JuliaData/Tables.jl/blob/master/INTEGRATIONS.md)) have the
+scientific type `Table{K}`, where `K` depends on the scientific types of the columns,
+which can be individually inspected using `schema`:
 
 ```@repl doda
 schema(X)
@@ -385,10 +382,10 @@ resampling is always more efficient in this case.
 
 ### Inputs
 
-Since an MLJ model only specifies the scientific type of data, if that
-type is `Table` - which is the case for the majority of MLJ models -
-then any [Tables.jl](https://github.com/JuliaData/Tables.jl) format is
-permitted.
+Since an MLJ model only specifies the scientific type of data, if that type is `Table` -
+which is the case for the majority of MLJ models - then any
+[Tables.jl](https://github.com/JuliaData/Tables.jl) container `X` is permitted, so long as
+`Tables.istable(X) == true`.
 
 Specifically, the requirement for an arbitrary model's input is `scitype(X)
 <: input_scitype(model)`.

diff --git a/docs/src/list_of_supported_models.md b/docs/src/list_of_supported_models.md
@@ -30,13 +30,14 @@ independent assessment.
 [BetaML.jl](https://github.com/sylvaticus/BetaML.jl) | - | BetaMLGMMImputer, BetaMLGMMRegressor, BetaMLGenericImputer, BetaMLMeanImputer, BetaMLRFImputer, DecisionTreeClassifier, DecisionTreeRegressor, GMMClusterer, KMeans, KMedoids, KernelPerceptronClassifier, MissingImputator, PegasosClassifier, PerceptronClassifier, RandomForestClassifier, RandomForestRegressor | medium |
 [Clustering.jl](https://github.com/JuliaStats/Clustering.jl) | [MLJClusteringInterface.jl](https://github.com/JuliaAI/MLJClusteringInterface.jl) | KMeans, KMedoids | high | †
 [DecisionTree.jl](https://github.com/bensadeghi/DecisionTree.jl) | [MLJDecisionTreeInterface.jl](https://github.com/JuliaAI/MLJDecisionTreeInterface.jl) | DecisionTreeClassifier, DecisionTreeRegressor, AdaBoostStumpClassifier, RandomForestClassifier, RandomForestRegressor | high | 
-[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) | - | EvoTreeRegressor, EvoTreeClassifier, EvoTreeCount, EvoTreeGaussian | medium | gradient boosting models
+[EvoTrees.jl](https://github.com/Evovest/EvoTrees.jl) | - | EvoTreeRegressor, EvoTreeClassifier, EvoTreeCount, EvoTreeGaussian | medium | tree-based gradient boosting models
+[EvoLinear.jl](https://github.com/jeremiedb/EvoLinear.jl) | - | EvoLinearRegressor | medium | linear boosting models
 [GLM.jl](https://github.com/JuliaStats/GLM.jl) | [MLJGLMInterface.jl](https://github.com/JuliaAI/MLJGLMInterface.jl) | LinearRegressor, LinearBinaryClassifier, LinearCountRegressor | medium | †
 [LIBSVM.jl](https://github.com/mpastell/LIBSVM.jl) | [MLJLIBSVMInterface.jl](https://github.com/JuliaAI/MLJLIBSVMInterface.jl) | LinearSVC, SVC, NuSVC, NuSVR, EpsilonSVR, OneClassSVM | high | also via ScikitLearn.jl
 [LightGBM.jl](https://github.com/IQVIA-ML/LightGBM.jl) | - | LGBMClassifier, LGBMRegressor | high | 
 [Flux.jl](https://github.com/FluxML/Flux.jl) | [MLJFlux.jl](https://github.com/FluxML/MLJFlux.jl) | NeuralNetworkRegressor, NeuralNetworkClassifier, MultitargetNeuralNetworkRegressor, ImageClassifier | low |
 [MLJLinearModels.jl](https://github.com/JuliaAI/MLJLinearModels.jl) | - | LinearRegressor, RidgeRegressor, LassoRegressor, ElasticNetRegressor, QuantileRegressor, HuberRegressor, RobustRegressor, LADRegressor, LogisticClassifier, MultinomialClassifier | medium |
-[MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl) (built-in) | - | StaticTransformer, FeatureSelector, FillImputer, UnivariateStandardizer, Standardizer, UnivariateBoxCoxTransformer, OneHotEncoder, ContinuousEncoder, ConstantRegressor, ConstantClassifier, BinaryThreshholdPredictor | medium |
+[MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl) (built-in) | - | ConstantClassifier, ConstantRegressor, ContinuousEncoder, DeterministicConstantClassifier, DeterministicConstantRegressor, FeatureSelector, FillImputer, InteractionTransformer, OneHotEncoder, Standardizer, UnivariateBoxCoxTransformer, UnivariateDiscretizer, UnivariateFillImputer,  UnivariateTimeTypeToContinuous, Standardizer, BinaryThreshholdPredictor | medium |
 [MLJText.jl](https://github.com/JuliaAI/MLJText.jl) | - | TfidfTransformer, BM25Transformer, CountTransformer | low |
 [MultivariateStats.jl](https://github.com/JuliaStats/MultivariateStats.jl) | [MLJMultivariateStatsInterface.jl](https://github.com/JuliaAI/MLJMultivariateStatsInterface.jl) | LinearRegressor, MultitargetLinearRegressor, RidgeRegressor, MultitargetRidgeRegressor, PCA, KernelPCA, ICA, LDA, BayesianLDA, SubspaceLDA, BayesianSubspaceLDA, FactorAnalysis, PPCA | high | 
 [NaiveBayes.jl](https://github.com/dfdx/NaiveBayes.jl) | [MLJNaiveBayesInterface.jl](https://github.com/JuliaAI/MLJNaiveBayesInterface.jl) | GaussianNBClassifier, MultinomialNBClassifier, HybridNBClassifier | low |

diff --git a/docs/src/quick_start_guide_to_adding_models.md b/docs/src/quick_start_guide_to_adding_models.md
@@ -12,11 +12,10 @@ of how things work with MLJ.  In particular, you are familiar with:
 
 - what `Probabilistic`, `Deterministic` and `Unsupervised` models are
 
-- the fact that MLJ generally works with tables rather than
-  matrices. Here a *table* is a container satisfying the
-  [Tables.jl](https://github.com/JuliaData/Tables.jl) API (e.g.,
-  DataFrame, JuliaDB table, CSV file, named tuple of equal-length
-  vectors)
+- the fact that MLJ generally works with tables rather than matrices. Here a *table* is a
+  container `X` satisfying the [Tables.jl](https://github.com/JuliaData/Tables.jl) API and
+  satisifying `Tables.istable(X) == true` (e.g., DataFrame, JuliaDB table, CSV file, named
+  tuple of equal-length vectors)
 
 - [CategoricalArrays.jl](https://github.com/JuliaData/CategoricalArrays.jl)
   (if working with finite discrete data, e.g., doing classification)

diff --git a/docs/src/transformers.md b/docs/src/transformers.md
@@ -39,12 +39,16 @@ MLJModels.UnivariateTimeTypeToContinuous
 
 ## Static transformers
 
-The main use-case for static transformers is for insertion into
-[Linear Pipelines](@ref) or other exported learning networks (see [Composing
-Models](@ref)). If a static transformer has no hyper-parameters, it is
-tantamount to an ordinary function. An ordinary function can be
-inserted directly into a pipeline; the situation for learning
-networks is only slightly more complicated; see [Static operations on nodes](@ref).
+A *static transformer* is a model for transforming data that does not generalize to new
+data (does not "learn") but which nevertheless has hyperparameters. For example, the
+`DBSAN` clustering model from Clustering.jl can assign labels to some collection of
+observations, cannot directly assign a label to some new observation. 
+
+The general user may define their own static models. The main use-case is insertion into a
+[Linear Pipelines](@ref) some parameter-dependent transformation. (If a static transformer
+has no hyper-parameters, it is tantamount to an ordinary function. An ordinary function
+can be inserted directly into a pipeline; the situation for learning networks is only
+slightly more complicated; see [Static operations on nodes](@ref).)
 
 The following example defines a new model type `Averager` to perform
 the weighted average of two vectors (target predictions, for
@@ -151,6 +155,9 @@ _.fitted_params_per_fold = [ … ]
 _.report_per_fold = [ … ]
 ```
 
+A static transformer can also expose byproducts of the transform computation in the report
+of any associated machine. See [Static models (models that do not generalize)](@ref) for
+details.
 
 ## Transformers that also predict
 

diff --git a/src/MLJ.jl b/src/MLJ.jl
@@ -77,9 +77,10 @@ export scitype, scitype_union, elscitype, nonmissing, trait
 export coerce, coerce!, autotype, schema, info
 
 # re-export from MLJBase:
+import MLJBase: serializable, restore!
 export nrows, color_off, color_on,
     selectrows, selectcols, restrict, corestrict, complement,
-    training_losses, feature_importances, 
+    training_losses, feature_importances,
     predict, predict_mean, predict_median, predict_mode, predict_joint,
     transform, inverse_transform, evaluate, fitted_params, params,
     @constant, @more, HANDLE_GIVEN_ID, UnivariateFinite,
@@ -96,7 +97,8 @@ export nrows, color_off, color_on,
     default_resource, pretty,
     make_blobs, make_moons, make_circles, make_regression,
     fit_only!, return!, int, decoder,
-    default_scitype_check_level
+    default_scitype_check_level,
+    serializable, restore!
 
 # MLJBase/composition/abstract_types.jl:
 for T in vcat(MLJBase.MLJModelInterface.ABSTRACT_MODEL_SUBTYPES,
@@ -132,7 +134,7 @@ export models, localmodels, @load, @iload, load, info, doc,
     Standardizer, UnivariateBoxCoxTransformer,
     OneHotEncoder, ContinuousEncoder, UnivariateDiscretizer,
     FillImputer, matching, BinaryThresholdPredictor,
-    UnivariateTimeTypeToContinuous
+    UnivariateTimeTypeToContinuous, InteractionTransformer
 
 # re-export from MLJIteration:
 export MLJIteration