Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function to summarize loss reductions achieved by each base-learner #27

Open
fabian-s opened this issue Jan 18, 2016 · 1 comment
Open

Comments

@fabian-s
Copy link

Almond Stöcker und Tobi Kühn have started to write a function to extract the amount of risk reduction contributed by each base-learner from a fitted mboost model.
We'd like to see this included in mboost, with visualisation options, if possible, as it seems to answer a question that comes up frequently in postprocessing and interpreting boosting fits: which base-learners are the most important for the fit?
See code below.

# variable importance
RelRisk <- function( model, ohne1 = FALSE )
{
  baselearner <- names(model$baselearner)
  learner.type = sapply( strsplit(baselearner, "\\(") , "[[", 1)
  learner.type.new = as.vector(sapply(learner.type, FUN = function(x) 
    switch(x, bols = "(linear)",
              bbs  = "(nonlinear)",
              bspatial = "(spatial)")))
  var.names    = as.vector(variable.names(model))  

  baselearner.short = paste(var.names, learner.type.new) 
  n <- length(model$response)

  # Which Baselearners were selected while boosting:
  selected <- model$xselect()

  # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Emp. Risk:
  # Initial Risk for the Intercept Model:
  Risk0 <- with( model, family@risk( response, offset ) )
  # Risk after the Boosting-Steps:
  Risk <- model$risk()
  # Risk loss per step:
  RiskDif <- c(Risk0, Risk[-length(Risk)]) - Risk
  # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  # Falls gewuenscht, ohne den ersten Schritt:
  if(ohne1) 
  {
    RiskDif <- RiskDif[-1]
    selected <- selected[-1]
  }

  # Explained Risk attributed to Baselearners
  explained <- rep( 0, length(baselearner) )
  for( i in 1:length(baselearner)) explained[i] <- sum( RiskDif[which(selected==i)] )

  # Selection percentage of the baselearners
  frequence <- rep(0, length(baselearner))
  for( i in 1:length(baselearner)) frequence[i] <- mean( selected == i )

  par(mar = c(5, 13, 4, 2) + 0.1 )
  (b <- barplot(height = explained / n , names.arg = baselearner.short, las = 1, horiz = TRUE, main = paste("Endrisiko =", Risk[length(Risk)] / n) ) )
  text(x = max(explained/n)/10, y = b, labels = frequence)
  par(mar = c(5, 4, 4, 2) + 0.1 )
}

> cars.gb <- gamboost(dist ~ bols(speed) + bbs(speed, center=TRUE), data = cars,
+   control = boost_control(mstop = 50))
> RelRisk(cars.gb)

image

Any input on what such a function should or should not do would be highly appreciated!

@ja-thomas
Copy link
Member

Hi Fabian,

awesome!

Some notes/thoughts:

  • This does not work for binary response because of the risk implementation in mboost (note that factors are encoded with (-1,1)). Risk0 will be 0 and a negative importance is set for the first base-learner.
  • Support of glmboost would be nice
  • Decreasing order of the bars
  • fixed number of digits for frequence

Here is how it looks for my data:

RelRisk(zero)
example
RelRisk(zero, ohne1= FALSE)
example2

and to compare my ggplot solution:
zero_importance.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants