Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev diary: single-atom Beta power law #64

Open
fasiha opened this issue Nov 20, 2023 · 11 comments
Open

Dev diary: single-atom Beta power law #64

fasiha opened this issue Nov 20, 2023 · 11 comments

Comments

@fasiha
Copy link
Owner

fasiha commented Nov 20, 2023

I was doodling and realized that while everything in the single-atom Ebisu case (v2 and v3) takes pNow = p**(elapsed / t) for an atom parameterized by [a, b, t] (where the probability recall at time t is assumed to be a Beta(a, b) random variable, that is, probability recall at time t ~ Beta(a, b)), there's nothing stopping us from changing thing.

The p**elapsed exponentiation is why our Beta random variable decays via an exponential, and we can very very easily get a single Beta to exhibit power-law forgetting by saying pNow = p**log2(1 + elapsed / t). Both these expressions share some nice properties:

  • both p**(elapsed / t) and p**log2(1 + elapsed / t) both are 0.5 when t==elapsed and a=b, i.e., t remains a halflife of the new expression
  • both are 1.0 as t → 0 and asymptotically approach 0 as t grows very large.

The difference of course is that the power-law p**log2(1 + elapsed / t) decays muuuch slower than the exponential decay. It turns out that it's very easy to reuse the existing Ebisu v2 Beta+exponential library to do this power-law scheme, since basically pNow = p**f(elapsed), i.e., the Beta random variable is raised to some power—elapsed for exponential decay, log2(elapsed...) for power-law decay.

I have a little script that demonstrates this: https://github.com/fasiha/ebisu/blob/v3-release-candidate/scripts/betapowerlaw.py

To run this,

  1. create a venv or Conda env,
  2. install dependencies: python -m pip install numpy scipy pandas matplotlib tqdm ipython "git+https://github.com/fasiha/ebisu@v3-release-candidate",
  3. then clone this repo and check out the release candidate rc1 branch: git clone https://github.com/fasiha/ebisu.git && cd ebisu && git fetch -a && git checkout v3-release-candidate,
  4. download my Anki reviews database: collection-no-fields.anki2.zip, unzip it, and place collection-no-fields.anki2 in the scripts folder so the script can find it
  5. start ipython: ipython
  6. run the script: %run scripts/betapowerlaw.py. This will produce some text/figures

Now you can follow along:

In [3]: predictRecall((2, 2, 10), 100) # THIS IS THE NEWLY DEFINED FUNCTION IN betapowerlaw.py
Out[3]: 0.17014120906719243

In [4]: ebisu2.predictRecall((2,2,10), 100, exact=True)
Out[4]: 0.03846153846153846

In [5]: predictRecall((2, 2, 10), 1000)
Out[5]: 0.07175073430740214

In [6]: ebisu2.predictRecall((2,2,10), 1000, exact=True)
Out[6]: 0.0005711022272986858

Above we compare the predicted recall 10 and 100 halflives:

  • power law decay: 17% and 7% respectively
  • exponential decay: 4% and 0.06% respectively

Running the script above will generate this chart comparing a few models for a few hundred quizzes in terms of log-likelihood:

Four curves, single-atom Beta power-law algorithm

I have a very similar script for benchmarking the v3 ensemble-of-Betas algorithm, %run scripts/analyzeHistory.py will run this and generate this:

Four curves, v3 ensemble algorithm

In the two charts above, higher is better (higher likelihood). Each point corresponds to the sum of log-likelihoods (product of raw likelihoods) for each quiz for that flashcard. Each is sorted by the worst log-likelihood to the best, and the 125 right-most quizzes are flashcards for which I have no failures.

Looking at these side-by-side:

  • the single-Beta power law algorithm is pretty damn good
  • the Beta-ensemble is better though. Assuming the best model in both cases is close to optimal, the best v3-ensemble algorithm is 2-3 units of log-likelihood higher than the Beta-power-law algorithm's best scenario.

Both scripts also spit out a text file containing per-flashcard, per-quiz details of what likelihood each model assigned to the current quiz and its current halflife. Looking through these is really interesting because you can see how different models result in very different halflives after each quiz. This also emphasizes why benchmarking algorithms via log-likelihood (see fasiha/ebisu.js#23) is tricky: an easy way to "cheat" is just be overly optimistic because in general failures are quite uncommon and the penalty an algorithm incurs by being very wrong about occasional failures is more than made up by the boost it gets by over-confidently predicting every quiz to be a success. This is really important: an algorithm/model that performs well in terms of sum-of-log-likelihoods doesn't mean it's the best, we have to look at how it handles failures, how it grows halflives after quizzes, if they're reasonable.

So right now I'm not sure what to do 😂 hence this dev diary—maybe writing things out will give me some ideas. I could try to see if there are better initial parameters that improve on these. I'm also going to investigate whether the halflives produced by the two algorithms are reasonable (since some apps will no doubt want to do the Anki thing and schedule reviews for when recall probability drops below a threshold).

If it turns out the single-atom Beta power law algorithm is good enough, should I scrap the Beta-ensemble model…? 😝!

@L-M-Sherlock
Copy link

an easy way to "cheat" is just be overly optimistic because in general failures are quite uncommon and the penalty an algorithm incurs by being very wrong about occasional failures is more than made up by the boost it gets by over-confidently predicting every quiz to be a success.

We can use cross entropy to measure the performance of the model. It cannot be cheated by over-confident predictions, because it measures the difference between prediction distribution and the real distribution.

@fasiha
Copy link
Owner Author

fasiha commented Nov 20, 2023

We can use cross entropy to measure the performance of the model. It cannot be cheated by over-confident predictions, because it measures the difference between prediction distribution and the real distribution

Oh nice! Is there somewhere I can read about this? Does this work for data that's real flashcards? Or do you need to simulate flashcards from a known probability distribution in order to obtain the K-L distance between the "true" prior and the distribution assumed by Ebisu/FSRS?

@L-M-Sherlock
Copy link

Or do you need to simulate flashcards from a known probability distribution in order to obtain the K-L distance between the "true" prior and the distribution assumed by Ebisu/FSRS?

It's very common to use cross entropy as the loss function in classification tasks: https://machinelearningmastery.com/cross-entropy-for-machine-learning/

@fasiha
Copy link
Owner Author

fasiha commented Nov 20, 2023

Oh I see what you mean 😕 alas in all my examples, what I call "log likelihood" is literally cross-entropy from binary classification: per https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html cross-entropy is sum(y * log(p) + (1 - y) * log(1 - p) for y, p in zip(results, probabilities)) where results is a list of bools and probabilities is a list of floats (between 0 and 1), whereas I call just scipy.stats.binom.logpmf in

def _binomialLogProbability(k: int, n: int, p: float) -> float:
assert k <= n
return float(binomrv.logpmf(k, n, p))
which simplifies to exactly the same function for the Bernoulli case (binary quiz result, i.e., when n=1) before summing them. In the Bernoulli/binomial quiz case, the "likelihood" is just the probability mass function so the two are identical.

I've definitely noticed that this metric is vulnerable to being misled by overconfident models that assign overoptimistic predictions all the time, like I shared in fasiha/ebisu.js#23 (comment). Let's carry on discussing that there!

@ckoshka
Copy link

ckoshka commented Nov 20, 2023

(disclaimer for the following: I am dyscalculic so I stumbled into this one more or less accidentally at 2am while making a combinatorial analogue of Anki that represents concepts latently and separately from their testing contexts. I have a rough mental image of what it's doing but I kind of just tried every single combination of "Math.log", "/", "*", etc. that I could think of and once it started spitting out good numbers I ran with it. I don't know how it works but I'm suspecting it's similar.)

let decayEstimate = -0.5; // gradient over time
let flexibility = 0.095; 
// how much it freaks out and adjusts if it gets something wrong
let base = 1.168;

//  -- learning --
// first entry is recorded
let $sinceLastSeen = 0.001; // in hours
let $accuracy = 1.0; // between 0 and 1

let impliedDecay = (Math.log($accuracy) / Math.log(base)) / $lastSeen;
// impliedDecay: 0
// i.e, "wow! the user has a perfect memory and will never forget this ever"
let newDecay = average( [decayEstimate, impliedDecay], { ratio: [flexibility, 1 - flexibility] } )
// newDecay: -0.4525
decayEstimate = newDecay;

// ...

// typically after a few repetitions, you will get:
// decayEstimate: -0.35

// -- situation 1: user waited way too long after learning --
let $sinceLastSeen = 24;

let expectedAccuracy = Math.pow(1.168, -0.35 * 24)
// expectedAccuracy: 0.27

// case A (did better than we expected):
let accuracy = 0.84;
// usually was calculated via levenshtein distance, I don't think it matters too much
let impliedDecay = (Math.log(0.84) / Math.log(1.168)) / 24;
// impliedDecay = -0.047
// ...
// -> newDecay = -0.33

// case B (did as well as expected) -> decay remains identical
// case C (did worse than expected) -> decay increases

All that gets stored per concept entry is lastSeen and decay. In retrospect, what I think this is implicitly doing is modelling both maturity/interval, strength of belief updates, cross-entropy loss, and inherent difficulty, using a single parameter. I don't know what the downsides of this approach are but it was fast to calculate, and held up surprisingly well across the ~20 people I tested it on (ranging from beginners to advanced learners who wanted daily practice). I wish I knew how to apply it to that test data.

edit: important thing to note, I optimized the hyperparams here for minimizing boredom, not maximizing accuracy.

@fasiha
Copy link
Owner Author

fasiha commented Nov 20, 2023

@ckoshka thank you so much for sharing! I love these simple, battle-tested models, and I totally adore language learning apps that break the mold 🤩 Langwitch is such a fresh take, I look forward to experimenting with it!

Can you check if the following is right (I used Python, sorry, if you prefer, I can readily rewrite in JavaScript):

from math import log

initDecay = -0.5
flexibility = 0.095
base = 1.168


def update(decay: float, sinceLastSeen: float, accuracy: float) -> float:
  impliedDecay = (log(accuracy) / log(base)) / sinceLastSeen
  return decay * flexibility + impliedDecay * (1 - flexibility)


def predict(decay: float, sinceLastSeen: float) -> float:
  return base**(decay * sinceLastSeen)


model = initDecay
for sinceLastSeen, accuracy in [(24, .2), (12, .6), (36, .3)]:
  prediction = predict(model, sinceLastSeen)
  model = update(model, sinceLastSeen, accuracy)
  print(
      f'{sinceLastSeen} hours: pred={prediction:.2f}. Quiz accuracy={accuracy} → new decay={model}')

That last bit is a little demo that starts with the initial decay and goes through three quizzes, 24 hours, then 12 hours, then 36 hours, updating after each quiz. At each quiz it prints out the prediction and the new decay. This is the output:

24 hours: pred=0.16. Quiz accuracy=0.2 → new decay=-0.4383049072144775
12 hours: pred=0.44. Quiz accuracy=0.6 → new decay=-0.2897170796140843
36 hours: pred=0.20. Quiz accuracy=0.3 → new decay=-0.22242283525112294

If I did the coding right, these numbers hopefully make sense?

I haven't dug enough to know if Langwitch actively prompts users to review a concept or if it's fully passive, but is there an accuracy (returned by predict above) that's "bad", indicating you should review this?

If the above code is right, it might be straightforward to plug this into the test frameworks @L-M-Sherlock and I both have, and see how this algorithm would have handled a bunch of flashcards I'm familiar with (the same ~380 test cases from my original post). Thanks again!!!

@fasiha
Copy link
Owner Author

fasiha commented Nov 21, 2023

(Apologies for spamming those subscribed 🙇 please feel free to mute this issue, I'll be continuing to use it as a dev diary)

I was thinking about how annoying it is that the sum-of-log-likelihoods under-penalizes overconfident models—for example, consider a flashcard for which the student got five successes followed by a failure. A model that predicted 90% for all six of these would score considerably better than a more conservative model that predicted 60% for all six:

In [10]: from scipy.stats import bernoulli

In [16]: bernoulli.logpmf(1, 0.9) * 5 + bernoulli.logpmf(0, 0.9)
Out[16]: -2.829387671283177

In [17]: bernoulli.logpmf(1, 0.6) * 5 + bernoulli.logpmf(0, 0.6)
Out[17]: -3.4704188507041085

In the sum-of-log-likelihood metric, higher is better. (I'm being very extra by importing bernoulli just to evaluate it's density 😝 logpmf = lambda result, p: log(p if result else 1-p) does exactly the same thing of course.)

I feel like the conservative model that predicts 60% is "better" here and I'd prefer a metric that handled this.

So I was reading up about how machine learning folks handle unbalanced data, which is basically what we have here: failures being a lot less common than successes is what causes this problem, and thanks to @L-M-Sherlock's very kind pointer I was able to understand that what I called "sum-of-log-likelihoods" was called cross-entropy 🙇. So a few years ago a user on the Data Science Stack Exchange pointed to a paper on Arxiv introducing a focal loss metric, a modification to cross-entropy that appeared to ameliorate the impact of unbalanced data, in an intriguing way. Rather than the score of a single observation being

log(p if result else 1-p)

Lin et al. propose

log(p**((1-p)**gamma)) if result else (1-p)**(p**gamma))

for some gamma >= 0. When gamma is 0, you get standard cross-entropy (also known as the Bernoulli distribution's log-probability mass function). But for gamma>0 (they recommend gamma=2), this behaves quite interestingly: it ignores easily-predicted data and heavily focuses on "hard, misclassified examples".

Check it out:

from math import log


def bernoulliLogProbabilityFocal(result: bool, p: float, gamma: float = 2) -> float:
  assert 0 <= p <= 1
  assert 0 <= gamma
  focalP = p**((1 - p)**gamma)
  focalQ = (1 - p)**(p**gamma)
  return log(focalP if result else focalQ)

and

In [20]: bernoulliLogProbabilityFocal(True, 0.9, 2)*5 + bernoulliLogProbabilityFocal(False, 0.9, 2)
Out[20]: -1.8703619511080685

In [21]: bernoulliLogProbabilityFocal(True, 0.6, 2)*5 + bernoulliLogProbabilityFocal(False, 0.6, 2)
Out[21]: -0.7385251624874881

With the gamma=2 focal loss, the example we saw before, five successes and one failure,

  • the overconfident model that predicted 90% for each has a score -1.9 (previously -2.8 under cross entropy)
  • whereas the conservative model that predicted 60% has a score -0.7 (previously -3.5 under cross entropy)

The important thing to see here is that the rankings have switched. The conservative model is better (higher number). Hmmmm, this might be what we want!

So last time we compared a few parameterizations of the v3-ensemble algorithm (an ensemble of weighted Beta distributions over logarithmically-increasing halflives, all decaying exponentially) and the single-Beta power law algorithm. I wanted to find the "best" parameterization for both algorithms for my training sample of ~380 flashcards (~4700 reviews):

  • for the v3-ensemble (see this branch): initial alpha/beta, initial halflife, and firstWeight
  • for the Beta-power-law: initial alpha/beta and initial halflife

I did a quick 2D grid search varying alpha/beta and halflife for both models to see what parameters led to the best sum of focal loss for all flashcards, all reviews in my test set:

Beta power law:

2D heat map of initial alpha/beta along the x-axis going from 1.25 to 4 and initial halflife along the y-axis going from 10 to 400 hours: the hottest/highest score is 1.25 and 100 hours

v3-ensemble, for firstWeight=0.5 (I also tried 0.7 and 0.9 but these were much worse than 0.5):

Similar heat map, x-axis initial alpha/beta going from 1.25 to 3 and halflives going from 10 to 400. The hotspot is at 1.25 and 35

So these tell us the "optimal" parameters for each of these algorithms that yield the best focal loss summed over all flashcards, all quizzes. The quotes around "optimal" are very heavy because this is such a rough grid search, and who knows maybe focal loss isn't what we'll want to use anyway but let's see where it leads us.

Picking some parameters around the optima for both algorithms and remaking the charts similar to the first post above yields:

Five curves, single-atom Beta power-law algorithm

Three curves, v3 ensemble algorithm

Umm, I'm not really sure how focal loss can go positive. It could be because I adapted it from its original Bernoulli-probability-mass-function setting to the binomial's and the fuzzy-binary probability distribution that Ebisu uses (I map my flashcard history from Anki's 1-4 ranks to binomial (4=easy=2 out of 2) and fuzzy-binary (1=hard=q0=0.2)) and maybe that's causing some issues with some probabilities becoming non-probabilities and exceeding 1 (so log goes positive). I'll dig into this more, but I am hoping the impact of any bugs I find is small.

But assuming this is reasonable, the two algorithms perform surprisingly similar. The best parameterization for either doesn't really outstrip the other.

Of course I then wanted to see what halflives these algorithms output. For each of the 382 flashcards I can take the final model after the last quiz and compute its 50-percentile decay time (halflife) and 80-percentile, and plot it:

Two side-by-side figures, that show the same data but sorted differently. X-axis runs from 0 to 381 flashcards. Y-axis is hours in the log scale. Each plot has four lines: v3-ensemble and Beta-power-law, versus 50-percentile and 80-percentile.

It's clear that the v3-ensemble algorithm behaves a lot better than the Beta-power-law algorithm when it comes to halflives. The v3-ensemble final halflives are between a few thousand and a few tens of thousands of hours, and the 80-percentile times are all uniformly a bit less than that. The beta-power-law however has enormous halflives up to 10^10 hours (one million years…): for a hundred out of 382 flashcards the Beta-power-law halflife is somewhat comparable to the v3-ensemble halflife but the rest of them are considerably higher. The 80-percentile are a lot more reasonable.

I don't think this is a major condemnation of the Beta-power-law model (though I might be biased, this is after all its dev diary 😝). There is literally no serious bound on the power law's recall probability: p**log(1 + elapsedTime) decays verrrrrrry slowly for all time, whereas the v3-ensemble's recall probability is a power-law up to the last atom's halflife (10,000 hours) and then collapses into an exponential. The halflife behavior here honestly makes sense.

In terms of usability, I think we could very reasonably build Anki-style quiz apps that schedule reviews when the recall probability drops to 80% or 70%? (I personally don't like such apps, I prefer apps that calculate which card to review now when I'm free, on my schedule, not on the app's schedule, so I could use some feedback).

And of course both algorithms output reasonable probability of recall for each flashcard's review. I'll be digging into the individual reviews next and seeing if I prefer one or the other algorithm, now that I have a metric (focal loss) that I think helps deal with unbalanced data.

The code to generate all these charts is in this branch. Setup instructions are in the top of this, then

cd ebisu # make sure you're in the right branch
python scripts/analyzeHistory.py # saves the last plot above, for v3-ensemble
python scripts/betapowerlaw.py # saves the second-to-last plot above, beta-power-law
python scripts/compareEnsemblePowerlawHalflives.py # generates halflife/80-percentile plots

If you run these in ipython you can zoom and analyze the plots. Both scripts analyzeHistory.py and betapowerlaw.py have a GRID_MODE boolean that you can set to True to generate the 2D heat maps.

@L-M-Sherlock
Copy link

L-M-Sherlock commented Nov 21, 2023

I was thinking about how annoying it is that the sum-of-log-likelihoods under-penalizes overconfident models—for example, consider a flashcard for which the student got five successes followed by a failure. A model that predicted 90% for all six of these would score considerably better than a more conservative model that predicted 60% for all six:

I figure it out. The main divergence of our perspectives is that Ebisu considers the probability of recall in card level, but FSRS considers the P(recall) in review history level. I built a dataset here: https://huggingface.co/datasets/open-spaced-repetition/fsrs-dataset. I hope it would be helpful to your development and research.

@fasiha
Copy link
Owner Author

fasiha commented Nov 25, 2023

The main divergence of our perspectives is that Ebisu considers the probability of recall in card level, but FSRS considers the P(recall) in review history level

@L-M-Sherlock yeah this is a great observation. I still am not exactly sure how you convert Ebisu's predictRecall predictions for each card review to the benchmarks in fasiha/ebisu.js#23 (comment) but I'm sure you're taking care to avoid the problems with cross-entropy that I'm finding pop up when handling unbalanced data.


Continuing with the dev diary for this single-Beta power-law model. I found the reason the focal loss was going positive (since focal loss is in the log-domain, this means that probability > 1 🙊): I had a typo in the noisy-binary focal loss function, I was mixing p vs (1-p) in the wrong way and fixed it in 6d933d0 and 8cd4f08.

This changes the results of the grid search just a little, and I confirmed that if I map my Anki history to just binary (1 and 2 = fail, 3 and 4 = pass), the overall shape of this graph doesn't really change, giving us confidence that the extensions I've made taking the focal loss from modeling Bernoulli data to binomial/noisy-binary are valid.

Single-atom Beta power-law. Heatmap of initial alpha=beta on the x-axis versus initial halflife on the y-axis, with colors representing the sum of focal loss over all cards, all quizzes. The peak is at alpha=beta=1.25 and for halflives between 115 and 145 hours

Similar heatmap for the v3-ensemble case, but the shape of the peak is different: it's at 30-60 hours and alpha=beta=1.25

So it appears that the bug in focal loss hasn't destroyed our nice algorithm here 😅🙏!

I've been digging into the predictions and halflives generated by this algorithm and I like them. The v3-ensemble model (an ensemble of several Beta random variables at different halflives, each with exponential decay) produces different predictions and posterior halflives compared to this single-Beta power law when it comes to failures, early successes, and surprise long successes, but I think the power law more quickly reacts? In the table below (sorry for how big it is, if possible please read it on desktop), I have the results for flashcard 1300038030806 (in my Anki database, linked in the first post). The results for each quiz are 1-4:

  1. "fail", a binary quiz, 0 out of 1 point
  2. "hard", a noisy-binary quiz, result=1, q1 = 1, q0 = 0.2
  3. "good", a binary quiz, 1 out of 1 point
  4. "easy", a binomial quiz, 2 out of 2 points.

For each quiz, I show

  1. quiz number,
  2. quiz result,
  3. how many hours elapsed since it was last studied,
  4. power law (initial alpha=beta=1.25 and initial halflife = 125 hours) probability of recall at the time of quiz,
  5. power law posterior halflife after update,
  6. power law, posterior time to 80% probability of recall,
  7. power law focal loss for this quiz (higher is better)
  8. cumulative sum of focal loss,
  9. and similarly for v3-ensemble (firstHalflife=50, firstWeight=0.5, initialAlphaBeta=1.25, lastHalflife=10e3) algorithms predicted probability of recall before seeing the quiz,
  10. v3 posterior halflife after update,
  11. v3 posterior time to 80% recall probability,
  12. focal loss,
  13. and cumulative focal loss.
# result time plaw pRecall hl 80% hl floss ∑floss v3 pRecall hl 80% hl floss ∑floss
1 2 ~ 0.3h 1.00 125.4 25.1 -0.0 -0.0 1.00 132.5 25.7 -0.0 -0.0
2 2 ~ 22.7h 0.81 150.0 28.8 -0.0 -0.0 0.82 177.6 34.1 -0.0 -0.0
3 1 ❌ 70.3h 0.65 70.6 17.2 -0.4 -0.4 0.68 57.2 14.8 -0.5 -0.5
4 1 ❌ 34.1h 0.67 46.4 12.3 -0.5 -0.9 0.63 31.9 9.2 -0.4 -0.9
5 3 ✅ 13.7h 0.78 51.1 13.4 -0.0 -0.9 0.72 37.2 10.6 -0.0 -1.0
6 3 ✅ 43.8h 0.54 65.1 16.6 -0.1 -1.1 0.45 57.1 15.9 -0.2 -1.2
7 3 ✅ 148.0h 0.30 104.6 24.9 -0.6 -1.7 0.25 169.2 42.6 -0.8 -2.0
8 3 ✅ 204.3h 0.35 162.9 35.5 -0.4 -2.1 0.45 384.1 94.8 -0.2 -2.2
9 3 ✅ 357.9h 0.35 267.6 51.2 -0.5 -2.5 0.52 733.7 186.6 -0.2 -2.4
10 3 ✅ 541.1h 0.38 449.3 72.5 -0.4 -2.9 0.58 1194.4 309.9 -0.1 -2.5
11 4 ✅+ 1032.0h 0.38 1444.9 142.1 -0.7 -3.7 0.54 2781.5 731.8 -0.3 -2.7
12 1 ❌ 2545.4h 0.44 916.0 113.4 -0.1 -3.8 0.52 1835.7 516.3 -0.2 -2.9
13 4 ✅+ 49.1h 0.89 1043.6 121.9 -0.0 -3.8 0.98 1881.5 529.1 -0.0 -2.9
14 4 ✅+ 192.4h 0.74 1492.4 147.6 -0.0 -3.8 0.92 2061.7 579.3 -0.0 -2.9
15 1 ❌ 693.5h 0.59 961.1 118.1 -0.3 -4.1 0.77 1401.0 408.1 -0.9 -3.8
16 4 ✅+ 47.8h 0.90 1064.5 125.0 -0.0 -4.1 0.97 1432.9 417.2 -0.0 -3.8
17 4 ✅+ 119.3h 0.81 1310.7 139.9 -0.0 -4.2 0.94 1513.3 440.0 -0.0 -3.8
18 3 ✅ 359.8h 0.67 1611.5 155.9 -0.0 -4.2 0.83 1637.0 474.9 -0.0 -3.8
19 3 ✅ 747.6h 0.59 2155.5 180.5 -0.1 -4.3 0.71 1904.5 549.7 -0.0 -3.8
20 3 ✅ 94.9h 0.87 2343.4 188.0 -0.0 -4.3 0.96 1939.6 559.4 -0.0 -3.8
21 1 ❌ 1847.5h 0.53 1592.5 156.5 -0.2 -4.5 0.51 1490.9 443.0 -0.2 -4.0
22 3 ✅ 27.0h 0.95 1632.5 158.5 -0.0 -4.5 0.99 1497.9 445.0 -0.0 -4.0
23 3 ✅ 96.9h 0.85 1755.2 164.5 -0.0 -4.5 0.95 1523.4 452.3 -0.0 -4.0
24 4 ✅+ 236.9h 0.75 2289.7 187.9 -0.0 -4.5 0.89 1651.2 488.7 -0.0 -4.0
25 3 ✅ 809.3h 0.62 2934.1 211.7 -0.1 -4.6 0.70 1884.1 554.0 -0.0 -4.0
26 3 ✅ 713.2h 0.66 3701.3 235.9 -0.0 -4.7 0.75 2106.7 615.4 -0.0 -4.1

We see by the 11th quiz, after a run of ✅s with the last one 1000 hours after the last review, the v3 ensemble (right-most columns) has exploded the halflife to almost 2800 hours. However, the power law algorithm is still conservative and predicts 1500 hours halflife. Anki chose to schedule the 12th quiz 2500 hours later which failed. There are three runs of ✅s and in each, the power law chooses to strengthen its halflife in a different way from the v3 algorithm. Both choices seem defensible, and the probability of recall predicted by each is not bad.

The main issue that authors of Anki-style apps will see is that the time to reach 80% recall probability is extremely conservative for the power-law algorithm. Given that this quiz schedule (picked by Anki) results in the power-law algorithm assigning the quizzes 60-ish percent predicted recall probability, quiz apps might use the time to 60% (or 70%) recall to schedule reviews? I don't have good instincts when it comes to designing apps like this (as I said, I much prefer apps that don't schedule cards but rather check which are most at risk of being forgotten when I am ready to spend time reviewing, on my schedule), but it seems like there's a risk when the time-to-80% is too soon but time-to-70% is too far? Is this inevitable when scheduling reviews in the future, and should I not worry about it?

For apps that I tend to write, of course, that don't schedule reviews, the fact that the power-law model predicts probability of recall reasonably is very exciting. We now have a few different algorithms that solve the core issue raised in #43: we have an ensemble of Gamma random variables, an ensemble of Beta random variables (reusing Ebisu2), and also potentially this single-Beta power-law algorithm (also leveraging Ebisu2).

@L-M-Sherlock
Copy link

I still am not exactly sure how you convert Ebisu's predictions for each card review to the benchmarks

The code is here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/0da0a9adeecb4d354ce3f8933fc2d5ee300a9bb6/other.py#L234-L264

I'm sure you're taking care to avoid the problems with cross-entropy that I'm finding pop up when handling unbalanced data.

I think the unbalanced data is not big deal for FSRS. Because FSRS has could predict the probability of recall accurately:

image

Source: https://www.reddit.com/r/Anki/comments/15mab6e/fsrs_explained_part_2_accuracy/

The main issue that authors of Anki-style apps will see is that the time to reach 80% recall probability is extremely conservative for the power-law algorithm. Given that this quiz schedule (picked by Anki) results in the power-law algorithm assigning the quizzes 60-ish percent predicted recall probability, quiz apps might use the time to 60% (or 70%) recall to schedule reviews?

FSRS allows users to set their desired retention. FSRS use this formula to schedule the interval based on desired retention and stability:

$I(r,S) = 9 \cdot S \cdot \left(\cfrac{1}{r} - 1\right)$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants