FSRS vs SM-17

It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.

Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:

The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
I use the R (SM17)(exp) recorded in sm18/systems/{collection_name}/stats/SM16-v-SM17.csv as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI.
To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.

Metrics

We use two metrics in the FSRS benchmark to evaluate how well these algorithms work: log loss and a custom RMSE that we call RMSE (bins).

Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for model evaluation in spaced repetition systems.
Weighted Root Mean Square Error in Bins (RMSE (bins)): This is a metric engineered for the FSRS benchmark. In this approach, predictions and review outcomes are grouped into bins according to the predicted probabilities of recall. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of model performance across different probability ranges.

Smaller is better. If you are unsure what metric to look at, look at RMSE (bins). That value can be interpreted as "the average difference between the predicted probability of recalling a card and the measured probability". For example, if RMSE (bins)=0.05, it means that that algorithm is, on average, wrong by 5% when predicting the probability of recall.

Result

Total users: 16

Total repetitions: 194,281

The following tables represent the weighted means and the 99% confidence intervals.

Weighted by number of repetitions

Algorithm	Log Loss	RMSE(bins)
FSRS-4.5	0.4±0.08	0.06±0.021
FSRSv4	0.4±0.09	0.07±0.025
FSRSv3	0.4±0.09	0.08±0.021
SM-17	0.4±0.10	0.08±0.020
SM-16	0.4±0.09	0.11±0.026

Weighted by ln(number of repetitions)

Algorithm	Log Loss	RMSE(bins)
FSRS-4.5	0.4±0.08	0.09±0.030
SM-17	0.5±0.10	0.10±0.029
FSRSv4	0.4±0.09	0.11±0.043
FSRSv3	0.5±0.10	0.11±0.035
SM-16	0.5±0.11	0.12±0.033

The image below shows the p-values obtained by running the Wilcoxon signed-rank test on the RMSE (bins) of all pairs of algorithms. Red means that the row algorithm performs worse than the corresponding column algorithm, and green means that the row algorithm performs better than the corresponding column algorithm. Grey means that the p-value is >0.05, and we cannot conclude that one algorithm performs better than the other.

It's worth mentioning that this test is not weighted, and therefore doesn't take into account that RMSE (bins) depends on the number of reviews.

Share your data

If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.

You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose

Contributors

_{leee_} 🔣	_{Jarrett Ye} 🔣	_{天空守望者} 🔣	_reallyyy 🔣	_shisuu 🔣	_Winston 🔣	_Spade7 🔣
_{John Qing} 🔣	_{WolfSlytherin} 🔣	_HyFran 🔣	_Hansel221 🔣	_{曾经沧海难为水} 🔣	_Pariance 🔣	_{github-gracefeng} 🔣

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
dataset		dataset
plots		plots
result		result
.all-contributorsrc		.all-contributorsrc
.gitattributes		.gitattributes
.gitignore		.gitignore
5 models.csv		5 models.csv
FSRS-v-SM16-v-SM17.ipynb		FSRS-v-SM16-v-SM17.ipynb
README.md		README.md
evaluate.py		evaluate.py
models.py		models.py
script.py		script.py
significance_table.py		significance_table.py

open-spaced-repetition/fsrs-vs-sm17

Folders and files

Latest commit

History

Repository files navigation

FSRS vs SM-17

Metrics

Result

Weighted by number of repetitions

Weighted by ln(number of repetitions)

Share your data

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Languages