Skip to content

This repository contains a Python script mf.py that implements Matrix Factorization for collaborative filtering. Collaborative filtering is a technique used in recommendation systems to predict user preferences by collecting information from many users. Matrix Factorization is one of the popular methods used in collaborative filtering.

License

wangyuhsin/matrix-factorization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matrix Factorization

License

This repository contains the implementation of Matrix Factorization in Python. Matrix Factorization is a collaborative filtering technique commonly used in recommender systems. It aims to factorize a user-item rating matrix into two lower-rank matrices, representing user and item embeddings. These embeddings capture the latent features of users and items and can be used to predict missing ratings and generate personalized recommendations.

Introduction to Matrix Factorization

Matrix Factorization is a dimensionality reduction technique that decomposes a matrix into two lower-rank matrices. In the context of recommender systems, the matrix being factorized is the user-item rating matrix, where each entry represents the rating given by a user to an item. The goal is to find two matrices, one representing users and the other representing items, such that their product approximates the original rating matrix.

The factorization process discovers latent features or factors that represent the underlying characteristics of users and items. These latent factors can capture various attributes such as genre preferences, item popularity, user tastes, etc. By multiplying the user and item embeddings, we obtain an approximation of the rating matrix, which can be used to predict missing ratings and generate personalized recommendations.

Files

  • mf.py: This file contains the implementation of Matrix Factorization using Python. It includes functions for data encoding, creating embeddings, computing predictions, calculating the cost function, performing gradient descent, and more. The file is well-documented and provides detailed explanations for each function.

Dataset

The implementation uses the MovieLens dataset, specifically the ml-latest-small dataset. You can download the dataset from the following link: MovieLens Dataset . The dataset contains user-item ratings for movies, which will be used to train and evaluate the matrix factorization model.

Getting Started

To get started with this repository, follow these steps:

  1. Make sure you have Python installed on your system (version 3 or above).
  2. Install the required dependencies by running the following command:
$ pip install -r requirements.txt
  1. Download and extract the MovieLens dataset from the provided link, and place the extracted dataset folder (ml-latest-small) in the same directory as the mf.py file.
$ wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
$ unzip ml-latest-small.zip

Usage

To use the matrix factorization implementation in mf.py, you can follow these steps:

  1. Import the necessary libraries and functions from mf.py into your Python script or interactive environment.
import numpy as np
import pandas as pd
from scipy import sparse
from mf import encode_data, encode_new_data, create_embedings, gradient_descent, cost
  1. Load the dataset using pandas and preprocess it using the provided encode_data and encode_new_data functions.
df = pd.read_csv("ml-latest-small/ratings.csv")
df, num_users, num_movies = encode_data(df)
  1. Create initial user and item embeddings using the create_embedings function.
K = 50  # Number of factors in the embedding
emb_user = create_embedings(num_users, K)
emb_movie = create_embedings(num_movies, K)
  1. Use the gradient_descent function to train the matrix factorization model on the training dataset.
emb_user, emb_movie = gradient_descent(df, emb_user, emb_movie, iterations=2000, learning_rate=1, df_val=None)
  1. Optionally, you can evaluate the model's performance on a validation dataset using the cost function.
df_val = pd.read_csv("ratings_val.csv")
df_val = encode_new_data(df_val, df)
validation_cost = cost(df_val, emb_user, emb_movie)
print("Validation cost:", validation_cost)

Output:

Validation cost: 2.467201855162542
  1. After training, you can make recommendations for users by computing the dot product of their user embeddings and item embeddings.
user_id = 42
user_embedding = emb_user[user_id]
item_embeddings = emb_movie
predicted_ratings = np.dot(user_embedding, item_embeddings.T)
top_movies = np.argsort(predicted_ratings)[-5:][::-1]
for movie_id in top_movies:
    print("Movie ID:", movie_id)

Output:

Movie ID: 1395
Movie ID: 2427
Movie ID: 232
Movie ID: 2158
Movie ID: 28

Please refer to the code in mf.py for more details on each function and their parameters. You can also find comments within the code that explain the purpose and functionality of each function.

Conclusion

The Matrix Factorization repository provides a Python implementation of Matrix Factorization, a popular technique used in recommender systems. It allows you to factorize a user-item rating matrix into user and item embeddings, which can be used to predict ratings and generate personalized recommendations. By utilizing gradient descent with momentum, the repository enables efficient training and optimization of the embeddings. Feel free to explore the repository and customize the code to suit your specific needs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

The initial codebase and project structure is adapted from the MSDS 630 course materials provided by the University of San Francisco (USFCA-MSDS). Special thanks to the course instructors for the inspiration.

About

This repository contains a Python script mf.py that implements Matrix Factorization for collaborative filtering. Collaborative filtering is a technique used in recommendation systems to predict user preferences by collecting information from many users. Matrix Factorization is one of the popular methods used in collaborative filtering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages