predicting house price using random forest model and smoothing spline model


  • Predicting house price using random forest model and smoothing spline model.
  • Both models are built using dtrain and tested using dtest


The comparison of the two models is outlined in this document. Detailed pre-processing and model-building processes can be found in the corresponding folders.

Smoothing Spline

Final Model <- mgcv::gam(price ~ s(rooms)
             +s(bedrm)+s(ayb, k = 20, by = cndtn)+s(eyb)+s(saledate)+s(gba)
             + heat+ac+style+grade+cndtn+roof+kitchens+ward
             + ti(eyb, ayb) + ti(gba,landarea)+ti(longitude,gba)
             + ti(longitude, ayb)
             +ti(longitude, eyb)+ti(saledate,latitude)
             ,data = dat_full)
  • Increase the prediction accuracy by 50% comparing to the basic linear model

Random Forest

Final Model

fit.rf <- ranger::ranger(price ~ . - fold,
                data = dat_full,
                mtry = 37, splitrule = "extratrees",
                min.node.size = 5)
  • Increase the prediction accuracy by 22% comparing to the basic linear model


  • R is the primary language
  • The prediction error is evaluated by RMLSE (Root- Mean-Squared-Logarithmic-Error)
  • The same pre-processing is applied for both models since the datasets have the same variables (differ in observation). However, some new variables are only used in random forest (interation 1-6) but not in smoothing spline. Specific pre-processing can be found in the markdown/pdf in corresponding folders.
  • 5-fold cross-validation is used for model comparison

Data Description

Variable Description (after pre-processing)

13 Factors:

  • heat: type of heat used in the house
  • ac: whether the house has air conditioning or not
  • style: describes the number of stories and/or structure of the house
  • grade: overall rating of the house
  • cndtn: condition of the house
  • extwall: material used for exterior wall
  • roof: type of roof
  • intwall: material used for interior wall
  • nbhd: ID of the neighborhood the house belongs to
  • ward: ID of the ward the house belongs to
  • quadrant: quadrant the house belongs to
  • if_rmdl: whether the house has been re-modeled ever
  • buy_first: indicator variable that has value of 1 if the house was bought before build

6 Integers:

  • rooms: total number of rooms
  • bathrm: number of full bathrooms (shower + toilet)
  • bedrm: number of bedrooms
  • eyb: the year an improvement was built
  • kitchens: number of kitchens
  • fireplaces: number of fireplaces

19 Numerical:

  • ayb: the earliest time the main portion of the building was built
  • stories: number of stories in the primary dwelling
  • saledate: date of sale as numerical values
  • price (response): price of the house
  • gba: gross building area in square feet
  • landarea: land area of property in square feet
  • latitude: latitude of the house
  • longitude: longitude of the house
  • saleyear: year the house sold
  • rmdl_diff: the difference between the sale year and the re-model year, if re-model is done after sale, then the value is 0
  • avg_room_size: average size of the room in sqre feet
  • build_age: how long the house has been built
  • total_bath: total number of full bathrooms and half bathrooms
  • inter1: interaction between latitude and saledate (used in random forest only)
  • inter2: interaction between longitude and saledate (used in random forest only)
  • inter3: interaction between gba and saledate (used in random forest only)
  • inter4: interaction between landarea and longitude (used in random forest only)
  • inter5: interaction between eyb and ayb (used in random forest only)
  • inter6: interaction between latitude and build_age (used in random forest only)

Numerical Variables

excluded_vars <- c("inter1", "inter2", "inter3", "inter4",
                   "inter5", "inter6", "fold")

plot_histograms <- function(df, exclude_vars = NULL,
                            bin_count = 30) {
  if (!is.null(exclude_vars)) {
    df <- select(df, -all_of(exclude_vars))
  numeric_df <- df[sapply(df, is.numeric)]

  long_df <- pivot_longer(numeric_df, cols = everything(),
                          names_to = "Column", values_to = "Value")

  p <- ggplot(long_df, aes(x = Value)) +
    geom_histogram(bins = bin_count, fill = "orange", color = "black") +
    facet_wrap(~ Column, scales = "free") +
    theme_minimal() +
    theme(plot.title = element_text(size = 10, face = "bold"),
          axis.text = element_text(size = 6),
          axis.title = element_text(size = 6)) +
    labs(title = "Histograms for Numeric variables", x = "Value", y = "Count")


plot_histograms(dat_full, exclude_vars = excluded_vars)
Screenshot 2024-05-10 at 14 36 59
plot_numeric <- function(data, target_var, exclude_vars) {
  numeric_vars <- sapply(data, is.numeric)
  numeric_vars[exclude_vars] <- FALSE
  plots <- list()  
  for (var in names(numeric_vars)[numeric_vars]) {
    if (var != target_var) {  
      p <- ggplot(data, aes_string(x = var, y = target_var)) +
        geom_point(alpha = 0.5, col = "steelblue") +  
        geom_smooth(method = "lm", color = "orange") +
        labs(title = paste( target_var, "vs", var),
             x = var,
             y = target_var) +
        theme(plot.title = element_text(size = 10),
              axis.text = element_text(size = 6),
              axis.title = element_text(size = 6))

      plots[[var]] <- p  

  plot_layout <- Reduce(`+`, plots) + 
                 plot_layout(guides = 'collect')

excluded_vars <- c("inter1", "inter2", "inter3", "inter4", "inter5", "inter6", "fold")
plot_numeric(dat_full, "price", excluded_vars)
Screenshot 2024-05-10 at 14 37 41

All the numerical variables other than longtitude have a positive relationship with price.


ggplot(dat_ori, aes(x = longitude, y = latitude, color = price, size = price)) +
  geom_point(alpha = 0.5, shape = 15) + 
  scale_color_gradient(low = "lightblue", high = "firebrick") +  
  ggtitle("Geospatial Distribution of Price") +
  xlab("Longitude") +
  ylab("Latitude") +
Screenshot 2024-05-10 at 14 41 18

Using latitude and longitude values, we see area around longitude = -74.2 and latitude = 40.725 has higher price, which indicates location is an important factor in house price.

Categorical Variables

plot_cate <- function(data, target_var) {
  factor_vars <- sapply(data, is.factor)
  plots <- list()
  for (var in names(factor_vars)[factor_vars]) {
      p <- ggplot(data, aes_string(x = var, y = target_var)) +
          geom_jitter(width = 0.2, alpha = 0.5, color = "darkblue") +
          labs(title = paste(target_var, "vs", var),
               x = var, y = target_var) +
          theme(plot.title = element_text(size = 10),
                axis.text = element_text(size = 6),
                axis.title = element_text(size = 6),
                axis.text.x = element_text(angle = 45, hjust = 1))
      plots[[var]] <- p
  group1 <- plots[1:min(7, length(plots))]  
  group2 <- if (length(plots) > 5) plots[6:length(plots)] else NULL 

  if (!is.null(group1)) {
    plot_group1 <- wrap_plots(group1)
  if (!is.null(group2)) {
    plot_group2 <- wrap_plots(group2)

plot_cate(data = dat_full, target_var = "price")
Screenshot 2024-05-10 at 14 42 44 Screenshot 2024-05-10 at 14 42 55

It can be seen that some variables has obvious different impact on prices based on their levels. Such variables are

  • cndtn: the better the condition, the higher the price.
  • grade: the better the rating, the higher the price.
  • ward: houses located in ward 2 and 3 have higher prices while house located in ward 7 and 8 have the lowest prices.
  • quadrant: houses located in northwest tend to have higher prices.

Comparison Summary

Detailed comparsion can be found in final report

Screenshot 2024-05-10 at 18 37 48
  • Prediction accuracy: smoothing spline is better
  • Computational complexity and runtime: random forest is better
  • Ease of use/model building: random forest is easier to build as there are fewer fine-tuning parameters
  • Interpretation: smoothing spline is more interpretable than random forest
  • Sensitivity to outliers: random forest is more robust to outliers than smoothing spline


