Skip to content

pdawczak/palm_springs_bnb

Repository files navigation

PalmSpringsBnB

./docs/assets/ruby_less_3_python.png

A Machine Learning powered project that is able to suggest a list of prices (nightly price, weeknight price, weekend night price, weekly price and monthly price) based on information about the property:

https://img.youtube.com/vi/ggG5sFgQwDM/0.jpg

The Machine Learning model has been trained on the Palm Springs Vacation Rentals dataset available on Kaggle.

This application uses PyCall Ruby gem that allows calling Python code from Ruby.

Unfortunately, due to requirements of how Python should be installed to allow accessing from Ruby, the way it is installed on Heroku, it wasn’t possible to deploy it with default settings.

For this technical restriction, I recorded a short demo of how it works, and it is available on Youtube here.

The whole source code of the application is available on Github and whole research is available in this Jupyter Notebook.

Training the Model

  1. For training the model, the following features were selected (here it shows their correlation to the target ~”price”~ we want to predict):
    df.corr()["price"].sort_values(ascending=False)
    price          1.000000
    numBed         0.447529
    numBathroom    0.301408
    numBedroom     0.284166
    numPeople      0.241956
        
  2. The model’s accuracy achieved was as follows:
    ModelTRAIN AccuracyTEST Accuracy
    Decision Tree80.04%78.39%
    Random Forest80.04%78.44%
    LinearRegression35.23%-
    Ridge35.23%-
  3. Choosing the model.

    The table shows accuracy achieved both on Train and Test sets (there are no results for LinearRegression and Ridge models for test sets because they were not performing well on the Train set).

    As expected - the accuracy is a bit lower on the Test set (the data the model didn’t see before), but the difference to the accuracy achieved when training the model isn’t too big at that point so it doesn’t look like the models were prone to the overfitting.

    We will productionise the RandomForest as it has a slightly better score.

Deploy the Model

Because PyCall allows calling Python code from Ruby like it was native to it, we can export the required pre-trained pipeline and estimator from Jupyter.

What are pipeline and estimator?

To allow Machine Learning models perform predictions, it is required the data is passed in a certain format. It’s not uncommon, that the shape of the data passed by the user will have to change before passing to the model to perform predictions.

Assuming the user created a new property with the following data:

numBednumBathroomnumBedroomnumPeoplecity
1112Palm Springs

It will have to be transformed to something like:

111200100000010

The rules of encoding are not important at the moment, because it’s the job of the pipeline that will know how to turn the first table in the second one.

It’s the last table that will be passed to the estimator (our Model) that will perform the predictions.

Exporting pipeline and estimator

# in Jupyter
import pickle

pickle.dump(index_based_pipeline, open('idx_pipeline.pickle', 'wb'))

pickle.dump(final_estimator, open('model.pickle', 'wb'))

Ruby class that will use the Python exported code

# app/lib/ml/estimator.rb
require "pycall/import"
include PyCall::Import

class Ml::Estimator
  def initialize
    @pipeline = initialize_pipeline   # 1.
    @estimator = initialize_estimator # 2.

    @all_periods = ["Monthly", "Nightly", "Weekend night", "Weekly", "Weeknight"]
  end

  def predict(num_bed, num_bathroom, num_bedroom, num_people, city) # 3.
    all_variants = @all_periods.map { |period|   # 4.
      [num_bed, num_bathroom, num_bedroom, num_people, city, period] }

    prepared = @pipeline.transform(all_variants) # 5.
    predicted = @estimator.predict(prepared)     # 6.
    @all_periods.each_with_index                 # 7.
      .each_with_object({}) { |(period, idx), result| 
        result[period] = predicted[idx]
      }
  end

  def initialize_pipeline                                 # 1.
    pyimport :pickle                                      #
    pipeline_pkl = open("./ML/idx_pipeline.pickle", "rb") #
    pickle.load(pipeline_pkl)                             #
  end                                                     #

  def initialize_estimator                                # 2.
    pyimport :pickle                                      #
    estimator_pkl = open("./ML/model.pickle", "rb")       #
    pickle.load(estimator_pkl)                            #
  end                                                     #
end
  1. Initialises the pipeline - it will load and instantiate the Python code
  2. Initialises the estimator - this is similar initialisation for the estimator
  3. Defines a method that will accept Property parameters specified by the user
  4. Prepares 5 variants - enriches the data by appending one of five possible time periods
  5. Prepares data to the format that will be possible to perform estimations
  6. Performs predictions
  7. Turns predictions into the format that will be easier to handle by the caller. The result of this will be a hash that will look like:
{ "Monthly" => 123.45, 
  "Nightly" => 123.45,
  "Weekend night" => 123.45,
  "Weekly" => 123.45,
  "Weeknight" => 123.45 }

Ruby code that will allow cleaner integration with the rest of the application

The earlier code loads Python code and knows a lot of low-level details we don’t want to expose to the rest of the application, so let’s introduce another layer that will facilitate it - it will accept a Property, will extract the data and pass it for predictions, and assign results back to the Property:

# app/lib/ml.rb
class Ml
  def self.estimate_prices(property)
    predictions = 
      estimator.predict(
        property.number_of_beds,
        property.number_of_bathrooms,
        property.number_of_bedrooms,
        property.number_of_people,
        property.city
      )

    property.nightly_price = predictions["Nightly"]
    property.weeknight_price = predictions["Weeknight"]
    property.weekend_night_price = predictions["Weekend night"]
    property.weekly_price = predictions["Weekly"]
    property.monthly_price = predictions["Monthly"]

    property
  end

  def self.estimator
    @@estimator
  end

  def self.estimator=(estimator)
    @@estimator = estimator
  end
end

Initialising Ruby’s Ml::Estimator

When checking how long it takes to load the Python code from the files, it is quite a slow process. It takes relatively a lot of time and it wouldn’t be a good idea to add this time to every prediction we perform.

To address this issue, we could benefit from the Rails’ initialisation phase.

Let’s add this code:

# config/initializers/ml_initializer.rb
Ml.estimator = Ml::Estimator.new

This will ensure we will load the Python code only once when the Rails server (or any other command, eg. Rake task) is started. Without this additional overhead, all the further predictions will be rapid.

Summary

Learnings

PyCall allows great opportunity and much more flexibility for porting Python code to make it available to Ruby applications. As contrary to Sklearn-porter, it doesn’t have any limitations to what kind of Machine Learning algorithms it can support.

From my brief experience with PyCall I feel like: If you can export the code from Jupyter, you will be able to use it in Ruby app seems applicable. It is still important to have an understanding of how the particular model works internally to decide if PyCall will be a good choice.

In this example application, we exported a RandomForest - it is quite accurate, performant and the generated pickle code is not very big, but if you consider KNN, that internally uses the whole dataset to make predictions, and this dataset’s size is in Gigabytes or Petabytes - could you afford such a server to host your Ruby/Rails application? Would you accept the time it would take to start such an application when it loads this whole dataset to memory?

Further considerations

PyCall was easy to use, but the initial set up might be a bit tricky. I have two distributions of Python installed on my machine - one by .asdf version manager, and Anaconda. I had to set up my environment variables (like PYTHON) to point proper executables in order for it to work.

For this very reason, it wasn’t possible to deploy this simple app to Heroku with its default configuration. Maybe Docker would be a way to go forward?

Secondly, in this exercise I disregarded a lot of data from the original set to build model quickly and try porting the pipeline and estimator to use them in Ruby. Unfortunately, the model is not super accurate - around 78.44% accuracy achieved on the test set. I considered it good enough to productionise it and release, to find what are the technical problems with this approach, and most importantly - if this would even be of interest for future users, but it is not the model that could be considered final in the longer term.

Now, when it is (hypothetically) deployed and available to public use, we can get back to the research and continue improving it while monitoring if it is being used. How much does it matter if it has only 78.44% accuracy, but no one really wants to use it?

Ideas for improving the Model

The dataset has many more features which we didn’t consider during this exercise. We could spend more effort on extracting data from JSON columns (like property features, additional fees), or extracting some sentiment score from property description.

At the moment we have only one model that does all the predictions for all the available periods - maybe it would be a good idea to train separate models that would focus on the feature importance for a particular time periods?

It was fun

Analysing the data set was great fun, and this little exercise has proven, that it’s data cleaning and preparing, that is the most time-consuming part of building Machine Learning models (it might not sound exciting, but I actually enjoyed this part - I could understand the market a little bit better!).

Integrating ready model and seeing it being used from the application in such an automatic way that is completely transparent to the end user gives great satisfaction!

All of it learned and achieved in spare time was a little bit tiring. I guess I deserve some holidays in a nice place.

I’ve heard Palm Springs is quite nice :).

./docs/assets/palm-springs.jpg

About

A Machine Learning powered project that will suggest price for your Property in Palm Springs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published