easystats
for ML
#340
Replies: 22 comments
-
I think this goes into the right direction, but do you have a working example? mfc(mtcars, kmeans(mtcars))
#> Error in kmeans(mtcars) : 'centers' must be a number or a matrix
#> Error in mfc(mtcars, kmeans(mtcars)) :
#> "x" must be a eclust objectYou have provided an object of class: try-error mfc(mtcars, kmeans(mtcars, centers = 5))
#> Error in methods::as(x, "eclust", strict = TRUE) :
#> no method or default for coercing “kmeans” to “eclust”
#> Error in mfc(mtcars, kmeans(mtcars, centers = 5)) :
#> "x" must be a eclust objectYou have provided an object of class: try-error |
Beta Was this translation helpful? Give feedback.
-
@strengejacke good catch! So I was quickly troubleshooting, and am having too many difficulties with the
Note, that if you want to go this route, we may have to take individual clustering algorithms (e.g., pam clustering via the |
Beta Was this translation helpful? Give feedback.
-
Yes that's fine. If the core function is identical, you can wrap it into a small helper function, which is called from the class-functions. |
Beta Was this translation helpful? Give feedback.
-
Yes, the idea is to find the common "denominator" to all classes, implement this subfunction as internal, and then create exported class-specific wrappers around it |
Beta Was this translation helpful? Give feedback.
-
Thanks for the dialogue here (this is really fun, by the way). I am not sure I know how to nest functions like that, specifically:
The only chunk within the function that would need to change is:
From here, everything else in the function is identical. Any guidance on the "helper function" idea? Apologies as this is a bit outside of my knowledge, though I am happy to learn. |
Beta Was this translation helpful? Give feedback.
-
For example model_performance.kmeans <- function(d, x, metrics = "all", ...) {
if (!requireNamespace("fpc", quietly = TRUE)) {
stop("Package \"fpc\" needed for this function to work. Please install it.",
call. = FALSE)
}
if(!is.data.frame(d)) {
} else if (!is.matrix(d)){
stop('"d" must be data frame or matrix\n',
'You have provided an object of class: ', class(d)[1])
}
x <- try(methods::as(x, "kmeans", strict=FALSE))
if (!inherits(x, "kmeans")){
stop('"x" must be a kmeans object\n',
'You have provided an object of class: ', class(x)[1])
}
my_repeated_stuff()
}
my_repeated_stuff <- function() {
# identical code here
} Maybe with some arguments. |
Beta Was this translation helpful? Give feedback.
-
This is brilliant- thanks for the suggestion. I will take a look just a bit later on. Thanks |
Beta Was this translation helpful? Give feedback.
-
Does this make sense in the
|
Beta Was this translation helpful? Give feedback.
-
@DominiqueMakowski Just revisiting this issue again... We have some cluster-stuff in parameters. I'm not sure if it makes sense to put parts of the code into that package, and leave some parts related to model performance here? |
Beta Was this translation helpful? Give feedback.
-
Yeah, having code in separate places is a bit odd. However, given that there is no current substantive support for ML methods (whether unsupervised like clustering or otherwise), perhaps it makes most sense as is now, and then possibly as we develop support for ML, we can revisit this and consolidate into a new easystats package, |
Beta Was this translation helpful? Give feedback.
-
I am not sure having a dedicated package for ML would conceptually be meaningful, as in the end, ML is just a fancy name for predictive modelling 😁, parts of which are taken care of by performance, parameters and all (I mean "performance" and "parameters" have a scope large enough to describe parameters and performance from regression and why not neural nets and all) However, some aspects often used in ML, such as feature reduction/selection, could indeed require a special treatment -> #47 |
Beta Was this translation helpful? Give feedback.
-
@DominiqueMakowski fair enough re: having certain packages already taking care of some things related to ML. However, I'd disagree that ML is a fancy name for predictive modeling. Though, yes, forecasting and prediction are key goals of ML, so too is classification. Further, ML is premised on a fundamentally different way of building models, from train/test splits to numeric optimization techniques, all for the goal of building effective learners based on historical patterns. See the canonical paper on this treatment, for example, https://kcir.pwr.edu.pl/~witold/ai/cacm12.pdf, for whatever it might be worth. Yet, ultimately, from an |
Beta Was this translation helpful? Give feedback.
-
True true :) |
Beta Was this translation helpful? Give feedback.
-
The one major thing I would like to see that's ML-related in easystats would be easier access to train-test and k-fold cross validation workflows. This is currently quite hard to do with most of the core R statistical modeling packages. For example, there isn't a good way to compute a true out-of-sample sigma or MSE from an Some specific things to this end I would like would be:
Thoughts? |
Beta Was this translation helpful? Give feedback.
-
As discussed in an earlier post, much of these types of ML-related tasks are covered elsewhere in the tidyverse, making an easystats version likely a duplication. For example,
Here is a more "complex" example (vary polynomial order), but in a still clean workflow.
|
Beta Was this translation helpful? Give feedback.
-
Those examples require a pretty un-easystats like syntax and philosophy. I'm saying that we should make it as easy to do basic cross validation as we do say bootstrapping or computing an AIC |
Beta Was this translation helpful? Give feedback.
-
I agree with @bwiernik.
Would that be any different than a wrapper for 1) update(model, data=test) and 2) re-run the perf functions?
have already too head(modelbased::estimate_prediction(lm(mpg ~ vs, data=mtcars)))
#> Model-based Prediction
#>
#> vs | Predicted | SE | 95% CI | Residuals
#> ----------------------------------------------------
#> 0.00 | 16.62 | 4.71 | [ 7.01, 26.23] | -4.38
#> 0.00 | 16.62 | 4.71 | [ 7.01, 26.23] | -4.38
#> 1.00 | 24.56 | 4.74 | [14.87, 34.24] | 1.76
#> 1.00 | 24.56 | 4.74 | [14.87, 34.24] | 3.16
#> 0.00 | 16.62 | 4.71 | [ 7.01, 26.23] | -2.08
#> 1.00 | 24.56 | 4.74 | [14.87, 34.24] | 6.46
#>
#> Variable predicted: mpg Created on 2021-07-31 by the reprex package (v2.0.0) |
Beta Was this translation helpful? Give feedback.
-
@pdwaggoner rather than "does this feature already exist somewhere else?", the relevant question guiding easystats efforts should be "can easystats make this feature more easy to use / intuitive / neat / smart" etc. After all, the majority of our features do exist elsewhere, we just make it smooth and easy :) |
Beta Was this translation helpful? Give feedback.
-
Yes, we don't want to refit the model. We want to use the same parameters and estimate performance with the new data. Essentially, predict(), then computing deviance/residuals/R2/etc So basically the above and a wrapper for cross-validation are what's missing |
Beta Was this translation helpful? Give feedback.
-
Absolutely @DominiqueMakowski and I love that about the easystats ecosystem/approach. I was just referencing the earlier conversation in #165. |
Beta Was this translation helpful? Give feedback.
-
I agree with @pdwaggoner that there are other (popular) tools out there that do many of these you've all discussed for ML models. Can we do them better? Probably (I am not a fan of the tidymodels piped-to-death design). But I think our time would be better spent focusing on non-ML models where possible. However, I also agree with @bwiernik that we should and can make simple out-of-(training)sample validation easy (give model, give newdata, get predictions and R2, RMSEm etc...) as these are operations that are not restricted to the ML world, and we can help popularize them here (: |
Beta Was this translation helpful? Give feedback.
-
That's exactly my thought! |
Beta Was this translation helpful? Give feedback.
-
Here is where we can more succinctly workshop an idea for approaching the ML world from an easystats perspective. Per @strengejacke 's suggestion, I wanted to just get the conversation started for a simple function idea to add to
performance
(based onmodel_performance.lm
):Is this more in line with something you're considering? If not, let me know as there are many directions we could take this.
Re:
caret
, I think it would take a bit more careful thought, as with most of the more advanced ML techniques, as there is much more behind "fitting" a model, such as tuning algorithms, creating and iterating over training and testing data sets, and so on. Its significantly more involved (though not overly arduous) than fitting a basiclm
orglm
, for example. Let me know your thoughts here, and we can decide how to best move forward in the ML world as it makes sense with the mission and focus of theeasyverse
(very nice, by the way ;) , love it)Beta Was this translation helpful? Give feedback.
All reactions