Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half-life does not meaningfully increase after reps in some conditions #35

Open
brownbat opened this issue Aug 28, 2020 · 17 comments
Open

Comments

@brownbat
Copy link

brownbat commented Aug 28, 2020

Apologies in advance if this is a non-issue. My hunch is this is a failure on my part to understand the methods or documentation.

I'm coming from Anki and Memrise. After each rep, Anki will predict your "half-life" increases significantly, maybe doubles or more. Memrise seems to follow a similar pattern.

When testing how ebisu might behave in similar conditions, it seems that review spacing increases, but only very gradually.

The following short code assumes I'm reviewing a card every time it hits around 75% success. Suppose I succeed every single time. The code then prints the ratio between the prior review period and the next review period.

import ebisu

model = (3, 3, 1)

m2pd = ebisu.modelToPercentileDecay
ur = ebisu.updateRecall

new_model = model
last_test_time = 0

for i in range(30):
    test_time = m2pd(new_model, 0.75)
    new_model = ur(new_model, 1, 1, test_time)
    if i > 2:
        print(round(test_time / last_test_time, 3))
    last_test_time = test_time

Based on ebbinghaus's work and my own performance in Anki, I'd expect those review periods to more than double every time, but I'm not seeing that. The ratios maybe back off by 110%, usually lower.

I take your point from another comment that you don't like scheduling reviews, that ebisu's strength is that it frees you from scheduling.

But this seems like it would still be an issue even with unscheduled reviews. It would predict that very strong memories are in the worst decile much more quickly than it should.

I'm probably just missing a core aspect of the algorithm, so sorry for the confusion. Maybe you manually double t after each review or something, or just use t as a coefficient to some other backoff function, I'm not sure.

Would appreciate a heads up as to where I went wrong, or let me know this behavior is just expected. Maybe the algorithm is purely backwards looking and doesn't try to take into account a rep's ability to strengthen a memory.

@brownbat brownbat changed the title Half-life do not meaningfully increase after reps in some conditions Half-life does not meaningfully increase after reps in some conditions Aug 28, 2020
@fasiha
Copy link
Owner

fasiha commented Aug 28, 2020

Thanks for writing! This is a great question.

Your code snippet is working exactly as it should, and your findings jibe with this online simulator which additionally lets you tweak the initial model and the success/failure of individual tests: https://svelte.dev/repl/67711abbe3784fccb67dc6b3f614a6d6?version=3.15.0 if you set a=b=3 and "Probability recall percent @ quiz" to 75%, you'll see the same results there as in Python (especially if you also print out the halflife in your loop: print(i, round(test_time / last_test_time, 3), m2pd(new_model, .5))).

The two knobs you should try tuning are:

  1. initial a=b: lowering this will weaken the confidence the algorithm has in your initial halflife and make it more eager to incorporate evidence from quizzes;
  2. also tune the "Probability recall percent @ quiz": lowering it means you spend more time between quizzes and renders each quiz is that much more informative to the algorithm.

In a nutshell, at initial a=b=3 and reviewing when probability drops to 75% doesn't give the algorithm a lot of confidence to dramatically change the halflife, so after fifteen reviews, halflife has gone from 1 to 4-ish time units, over a calendar-time of twelve time units.

If you drop from 75% to 50%, in fifteen quizzes, the halflife goes from 1 to almost 30, over a calendar time of 114 time units.

If you further drop a=b=2, and have fifteen correct quizzes in a row, the increase in halflife are even more dramatic.

(I have to shoutout to @mustafa0x for creating this online explorer for Ebisu, in #16, it's the gift that keeps on giving!)


Besides tuning the parameters of your quiz app (initial a=b, and the percentage at which you quiz), I use another trick: I'm going to expose a new function in the API called rescaleHalflife (see discussion) which will allow a user to indicate a numeric scaling after a quiz, something like "this quiz was way too easy and I'm annoyed at how quickly I saw it, increase its halflife by 10x" or "eek please reduce the halflife of this quiz by 1/2".

This gives you a kind of backdoor to the algorithm: you're saying, use the same uncertainty around the halflife (the same Beta distribution), just shift that distribution left or right to this new halflife. I tend to use this maybe 15~30% of the quizzes on a given day?, maybe because I like to micromanage?, because I get pretty annoyed when I see things too frequently and I panic when I see things too infrequently, but it's another option available to you if you're overall happy with Ebisu but dissatisfied with a particular quiz.

I haven't released this yet, it's part of a bigger release, but let me know if you'd like to use it, I'll hurry up and release it.

Hopefully this answers your questions, but please feel free to followup as much as needed!

@fasiha
Copy link
Owner

fasiha commented Aug 28, 2020

And I didn't explicitly answer your question,

Maybe the algorithm is purely backwards looking and doesn't try to take into account a rep's ability to strengthen a memory.

but hopefully it's clear from the above but Ebisu can indeed increase and decrease its estimate of the halflife of a memory based on a review. Exactly how much is governed by the Bayesian math, and is a function of those two parameters (the initial model (alpha and beta parameters) and how long it's been since your last review) as well as the quiz result itself.

@brownbat
Copy link
Author

brownbat commented Aug 29, 2020

Super helpful, thanks for the detailed explainer, and the simulator is great to play with.

With a=b=1, targeting a 50% pass rate, continual successes will make the review period double. (Like a 200% ease factor in Anki.)

In fact, it looks like if a=b=1, then calculating the ratio of increased durations is really simple. With continual successes, the gaps between reviews will increase at 1/target_pass_rate. So targeting a 20% success rate would lead to 500% increases in gaps between reviews: 1 time unit, then 5, 25, 125, etc. That would match some of my strongest cards in Anki.

But... if you were actually hitting a 20% pass rate, I imagine that could get very demotivating, you'd be failing the overwhelming majority of your reviews.

Do you generally fail the majority of your reviews in order to properly advance the cards you do know?

Or, maybe in practice, you're succeeding consistently on cards ebisu predicts you will only pass 10-20% of the time? I guess that would mean the predictions are only well calibrated after years reviewing a card?

Thanks for your help, and don't mistake my curiosity for criticism, I still think this is a really bold and interesting project and a great set of tools, so just worth understanding in depth.

@fasiha
Copy link
Owner

fasiha commented Aug 29, 2020

Great points!

With a=b=1, targeting a 50% pass rate, continual successes will make the review period double. (Like a 200% ease factor in Anki.)

Initial a=b=1 is the degenerate Beta distribution—equivalent to the Uniform distribution between 0 and 1, indicating to the algorithm that you have no prior belief on the recall probability 1 time unit from now. Therefore the Bayesian update gives you the result you see: exponential growth of the halflife until you hit your first failure.

The README recommends initial alpha=beta≥2 because then the resulting Beta distribution is well-behaved—it's unimodal and has two inflection points at its edges (Wikipedia has a nice summary of this too), but honestly we don't really have a great sense of what initial alpha=beta ought to be set to operationally.

Or, maybe in practice, you're succeeding consistently on cards ebisu predicts you will only pass 10-20% of the time? I guess that would mean the predictions are only well calibrated after years reviewing a card?

It's exactly this. Even when Ebisu predicts 0.001, I regularly get them right.

And this isn't as pathological as it may seem: I like to use initial halflife of one hour, so if I learn a new fact just before bed and then review it when I wake up, I'm eight halflives out, so recall probability is microscopic, yet chances are decent I'll remember what I learned (depending on how vivid I made the mnemonic, how motivated I was to learn this fact, what I had for dinner, what drinks I had with dinner, etc.).

I like using fifteen minutes or one hour initial halflife because I like over-studying things when I first learn them, so if I learn something during the morning, I just like having it reviewed several times throughout the day. This initial model is not remotely an accurate reflection of my prior beliefs on the recall strength of most cards. It just works out that, because my flashcard app will pick a card in the lowest decile (or log-decile) of recall probability, I tend to see cards that are more likely to be weak than not.

(Hence also the need for rescaling halflifes, because most cards are so ill-matched to the default initial model, to the point of annoyance.)

I'm very happy to field your questions and observations! I don't feel any particular sense of ownership or any sensitivity towards the algorithm, it's one of those things where, is mathematics discovered or invented?, the algorithm is what it is—if you start with the initial premises, Ebisu is what follows. And the initial premises aren't great, but they lead to analytically tractable results, and they're transparent (unlike many machine learning algorithms), so if it solves an engineering problem, so much the better :).

@brownbat
Copy link
Author

brownbat commented Aug 30, 2020

Ah, ok, very helpful!

So the model is:

  1. Assume the forgetting rate is constant,*
  2. Use Bayes to gradually gain confidence in our prediction of that forgetting rate.

And, sure, we all know that in reality, the forgetting rate isn't really constant at first. But ebisu's predictions strengthen slowly over time, and the forgetting rate may eventually hit some plateau and become constant eventually, so they converge.

The downside: very inaccurate predictions at first, maybe for the majority of the reps in a card's life.
The workaround: just ignore that completely, and focus on the sorting of cards, don't worry about the prediction numbers too much. Just do the most at risk cards first, and don't bother looking at the odds.

Is that right?

That seems perfectly reasonable for organizing reviews and workflows, and might be all most folks really need.

But it seems like you could pretty easily could get much more accurate predictions throughout the entire lifecycle of a card, if you wanted that. Shouldn't change your workflow much, so maybe just cosmetic. But more accurate predictions seem useful for their own sake, if you could get them cheaply.

To do that... Suppose instead of predicting a fixed term for half-life, you instead predict some coefficient that we multiply the half-life by after each review.

So here's what that model looks like:

  1. Assume most memories are "well-behaved,"† and decay at a half-life that doubles with each review (i.e., coefficient = 2).
  2. Use Bayes to predict/test/adjust that coefficient.

In the beginning, half-life predictions would be much more accurate. Memories in the beginning (per Ebbinghaus) tend to exponentially increase in duration. In the long run, memories may revert to a fixed interval, like a year, and the coefficient would need to slowly slide back down to 1 too at that point, giving a fixed interval.

The biggest improvement on your workflow would be that very new cards that you perform very well on will drop in priority more quickly, which should give you more useful work per review.

The biggest risk would be that the coefficient wouldn't revert to 1 quickly enough as the memory matures, leading to very long intervals with no new data to catch that they're not getting reviewed enough.

I am NOT/NOT recommending any change to ebisu, which is really well implemented and produces very consistent output for lots of its users now.

But... I might try to implement a wrapper for targeting increasing half-lifes for one of my own projects. Happy to let you know how it goes if it sounds intriguing at all.

* well, ok, at least that the half-life period is a consistent interval, even if decay is a curve.
† The first assumption here is also very imperfect! But should be accurate in more situations, I think many more.

@fasiha
Copy link
Owner

fasiha commented Aug 31, 2020

Suppose instead of predicting a fixed term for half-life, you instead predict some coefficient that we multiply the half-life by after each review

So bear with me while I review what Ebisu is doing to make sure we're on the same page, because I'm not confident I understand your proposal—

  • An Ebisu model (a, b, t) specifies a belief on recall t time after review: (recall @ t) ~ Beta(a, b). The "Beta" here doesn't matter, it's just a probability distribution, but what does matter is, notice—there's no halflife specified here.
    • (Aside. When a=b, then t happens to be a halflife, i.e., recall's expectation is 50%. We start out with a=b, but deviate from this after you review. Why not always rebalance? #31 tracks the idea that we always force a=b after reviewing, so t will always be the halflife. But even when that lands, the rest of this continues to apply—model-wise, t is just some time in the future.)
  • When you ask Ebisu for the expected recall probability at time t2, it applies Ebbinghaus' exponential forgetting model to get you a number.
  • When you tell Ebisu that you did a quiz at time t3, poof, that's it, you don't give Ebisu anything else. It'll update the original model and give you (a4, b4, t4), a new model that specifies its belief on recall t4 time after last review.

So the thing I'd like to emphasize is—because the random variable Ebisu concerns itself with is recall, and not halflife, the update phase doesn't know anything about halflives or what halflives ought to do after quizzes. We are gratified to see numerically that halflife does increase after successful quizzes (a lot when the quiz happens way past the halflife, a little bit when the quiz happens way before the halflife), and similarly it decreases after failed quizzes (a lot if you failed just after reviewing, and not by much if you failed way past the halflife), gratified because that's not baked into the model anywhere, it just happens via Bayes.

To make the point even more concrete—we had a discussion in #32 about modeling the halflife explicitly with the Bayesian machinery (and it turns out to be possible but bad from an engineering perspective because predictRecall turned out to be much more expensive than what we have currently). Anyway point is, even then, the update step would update the model with the same parameters:

  • the old distribution around halflife,
  • the time of the quiz, and
  • the quiz result.

It wouldn't enforce any mechanical rule on what halflife does after successful quizzes, like Anki does with increasing the interval by a factor depending on the quiz's ease.

SO. If we were to think about modeling explicitly the change in halflife as a function of quiz result and quiz time, that's certainly worth thinking about—it'll be a big change in modeling, and it's less general than the existing model, but it might turn out to be more accurate. But I want to make sure that that's what you want to do here.

Because, Ebisu now is entirely dependent on you to specify meaningful priors if you want accurate predictions, and I prefer to not spend the effort to reflect on each item I learn and specify my prior on its initial recall probability. And furthermore, even when I use the as-yet-unpublished rescaleHalflife, I'm not concerned with accurate predictions of recall probability as much as moving individual cards relative to other cards.

But that doesn't necessarily mean you can't get accurate predictions out of Ebisu already, with meaningful priors. We can maybe test this actually, with any database of flashcard history data: given the entire history for a flashcard, what initial model is most accurate? If it turns out that there's no good way to fit a good initial model retroactively, then probably Ebisu as it stands now won't give you accurate predictions of recall probability—just predictions that are maybe internally consistent with all the flashcards you're tracking. But it might turn out that for large sets of cards, the same "best initial model" turns up, suggesting the existence of one or more clusters of cards for which we can get good accurate (in an absolute sense) predictions!

@brownbat
Copy link
Author

brownbat commented Sep 1, 2020

Thanks for the detailed reply.

because the random variable Ebisu concerns itself with is recall, and not halflife, the update phase doesn't know anything about halflives or what halflives ought to do after quizzes.

Yup, I understand. I should have said "x-life" or something. I'm not hung up on halflives specifically, it was just a convenient way to talk about two reviews with roughly equivalent amounts of decay. Or, roughly equivalent probability of expected recall. We might be able to pick any spot. Tenthlife. Fifthlife. Whatever it happens to be. Thanks for the link to #32, really interesting discussion on that!

SO. If we were to think about modeling explicitly the change in halflife as a function of quiz result and quiz time, that's certainly worth thinking about—

Yes, like in so many domains, the thing you're trying to predict changes whenever you measure it.

e.g., every review strengthens a memory somewhat. I think successful reviews strengthen memory by a little over a factor of 2.

By that I mean, a successful review roughly doubles the time it takes to forget a card. Or, roughly doubles the time it takes to get to halflife. Or tenthlife. Or we could say it roughly doubles the probability of successful recall at a certain time delta after the review, relative to the same time delta after the time of the prior review.[0]

it'll be a big change in modeling,

Yes, this is not trivial at all! I hesitate to even raise it because of that. But if I'm right, the accuracy gains could be significant.

I'm still hoping that I can build a demo as a wrapper function and come back with test data. Will take a bit of work though.

I'm not concerned with accurate predictions of recall probability as much as moving individual cards relative to other cards.

Fair, but if I'm right, this could improve sorting too. I need to build it and test it to prove that though.

But that doesn't necessarily mean you can't get accurate predictions out of Ebisu already, with meaningful priors. We can maybe test this actually, with any database of flashcard history data: given the entire history for a flashcard, what initial model is most accurate?

Most accurate for which review though?

If the model is predicting the strength of memory, and I'm right that memory strength ~doubles after every review, can we really measure the model's accuracy across a range of reviews?

Say we were both watching a rocket launch, and I asked you which speed most accurately describes its movement. You might pick an average speed over some time, you might estimate its final speed... But you might also say, "no, it's not speed we should use to explain this movement at all, but acceleration."

And so with memories, if they are getting stronger by a factor of two every rep, a model that predicts strength will always be playing catch up. If you have a good model for growth over time, then you can derive their strength at any t0, and it will be more accurate.

But... maybe I'm wrong or maybe this is impractical.

Here are the (significant) outstanding questions I need to work on before pushing any further, as I see them:

  1. Am I actually right that memories dramatically increase in strength after every review?

  2. Say g represents growth rate. Could a model with (a, b, t, g) even work?

  3. Could such a model fit in some wrapper function that doesn't actually change how ebisu works at its core, to do no harm and stay completely optional?

  4. Testing: For a deck of many different cards,[1] does an (a, b, t, g) model with reasonable defaults lead to more accurate predictions of recall probability across all reviews? And,

  5. Does this increased accuracy, if it exists, even affect sorting in any meaningful way?

I think this was very clarifying. I definitely have some work to do on those items before I can make any real recommendations though. I wouldn't ever push this big a change unless I had some more solid proof it'd be beneficial, and I really don't yet, and maybe it won't work out for reasons I can't foresee.

[0] The impact of failed reviews on memory is more complicated. Sometimes it means a fact has been completely forgotten, sometimes it means the fact was on the tip of your tongue and the review itself strengthened it almost as much as a successful review. I think this is best set aside for now...

[1] including new easy cards, new hard cards, old easy cards, and old hard cards, and a bunch in between

@brownbat
Copy link
Author

brownbat commented Sep 1, 2020

For example, this generates "interesting" results, though not yet sure if they're any good. ;)

The final element in the 4-tuple is the growth coefficient. It produces basically reasonable results at first glance.

(The third element explodes exponentially and will appear disconcerting, just ignore it for now.)

I just need to find a way to test this, see if it does any useful work at all in a real deck. It might not!

import ebisu
import numpy
import random

exp = numpy.exp


model = (3, 3, 1)

m2pd = ebisu.modelToPercentileDecay
ur = ebisu.updateRecall

def update_growth(growth_model, successful, attempts, test_time):
    original_model = growth_model[:3]
    original_updated_model = ur(original_model, successful, attempts, test_time)

    adjustment = m2pd(original_updated_model, 0.5) / m2pd(original_model, 0.5)
    
    new_a = original_updated_model[0]
    new_b = original_updated_model[1]
    new_growth_coefficient = growth_model[3] * adjustment
    new_t = new_growth_coefficient * growth_model[2]
    
    new_growth_model = (new_a, new_b, new_t, new_growth_coefficient)

    return new_growth_model

new_model = model + (2.1,)
last_test_time = 0

for i in range(30):
    test_time = m2pd(new_model[:3], 0.85)
    success = (random.random() < 0.85)
    new_model = update_growth(new_model, success, 1, test_time)
    if True or i > 2:
        print(f"{i}: {success} at {round(test_time,3)}")
        print(f"{new_model}")
        print()
        input()
    last_test_time = test_time

EDIT: ahh, syntax highlighting, neat.

@fasiha
Copy link
Owner

fasiha commented Sep 2, 2020

For example, this generates "interesting" results, though not yet sure if they're any good. ;)

Interesting! Small modification that alas needs a lot of code—instead of just overwriting model[2] (the "time") with new_t, the following snippet does uses setHalflife to do this more mathematically-robustly: the difference between your update_growth and the update_growth2 below are small but might be important as you do more experiments:

// same 30 lines as your snippet, then

import numpy as np
from scipy.special import betaln


def _meanVarToBeta(mean, var):
  "This function was stolen from ebisu.py"
  tmp = mean * (1 - mean) / var - 1
  alpha = mean * tmp
  beta = (1 - mean) * tmp
  return alpha, beta


def setHalflife(prior, newHalflife):
  "Given an Ebisu model, safely time-shift it to any new halflife"
  # Step 1: time-shift `prior` to its halflife via GB1 and Beta-fit there
  (alpha, beta, t) = prior
  oldHalflife = ebisu.modelToPercentileDecay(prior)
  dt = oldHalflife / t

  logDenominator = betaln(alpha, beta)
  logMean = betaln(alpha + dt, beta) - logDenominator
  logm2 = betaln(alpha + 2 * dt, beta) - logDenominator
  mean = np.exp(logMean)
  var = np.exp(logm2) - np.exp(2 * logMean)

  newAlpha, newBeta = _meanVarToBeta(mean, var)

  # Step 2: <indiana-jones.gif> just swap the true old halflife for new
  return (newAlpha, newBeta, newHalflife)


def update_growth2(growth_model, successful, attempts, test_time):
  original_model = growth_model[:3]
  original_updated_model = ur(original_model, successful, attempts, test_time)

  adjustment = m2pd(original_updated_model, 0.5) / m2pd(original_model, 0.5)

  new_growth_coefficient = growth_model[3] * adjustment
  new_t = new_growth_coefficient * growth_model[2]

  return setHalflife(original_updated_model, new_t) + (new_growth_coefficient,)


new_model = model + (2.1,)
last_test_time = 0

rng = random.Random(123)

for i in range(30):
  test_time = m2pd(new_model[:3], 0.85)
  success = (rng.random() < 0.85)
  new_model = update_growth2(new_model, success, 1, test_time)
  if True or i > 2:
    print(f"{i}: {success} at {round(test_time,3)} :: {new_model}")
  last_test_time = test_time

Full script at https://gist.github.com/fasiha/f201c0631c34bffd6c6f6be98b32dc5e.

If you swap update_growth vs update_growth2, you'll see the differences (since I'm reusing the random number generator seed). The alpha & beta are slightly different:

  • 29: True at 1320068401.924 :: (7.79487455546467, 7.182445331526644, 12581404517.899096, 2.2354122330130277)
  • 29: True at 1287937938.899 :: (7.139150573286551, 7.139150573286018, 12784520519.795912, 2.238070102895908)

But that doesn't necessarily mean you can't get accurate predictions out of Ebisu already, with meaningful priors. We can maybe test this actually, with any database of flashcard history data: given the entire history for a flashcard, what initial model is most accurate?

Most accurate for which review though?

I was thinking of measuring accuracy over all quizzes of a given flashcard.

Forget for a second about Ebisu. Suppose you start with a list of quiz elapsed-times and results: history = [(tnow, result) for tnow, result in zip([0.5, 1.5, 2.1], [True, True, False])]. And suppose you have some function that predicts recall probability and you've collected them in a list, predictions = [0.9, 0.1, 0.5]. I'm just making up numbers here. Now you can compute the likelihood of all those predictions: likelihood = math.prod([prediction if result else (1-prediction) for ((_, result), prediction) in zip(history, predictions)]). This computes the likelihood of seeing each quiz result and combines them into a single number. The higher this number, the more accurate your predictions matched the observed results.

In the following script, I do this using Ebisu: setting up a specific initial model, I first generate a history of quiz times and results of a student whose memory works exactly as Ebisu predicts. Then, I pretend that I don't know that true model that the history is generated from, and compute the likelihood of that history for a variety of initial models. The higher the likelihood, the more accurate that model.

import ebisu
from random import Random
from scipy.special import logsumexp
import numpy as np


def generateHistory(initialModel, Ntrials, minPrecall, maxPrecall):
  tnows = []
  results = []
  idealProbs = []

  model = initialModel
  for _ in range(Ntrials):
    # When do we find time to review? Random draw, since that's life.
    PrecallAtQuiz = rng.uniform(minPrecall, maxPrecall)
    tnow = ebisu.modelToPercentileDecay(model, PrecallAtQuiz)
    # What's Ebisu's prediction at that time?
    p = ebisu.predictRecall(model, tnow, exact=True)
    # Simulate the quiz
    result = rng.random() < p
    # Update the model
    model = ebisu.updateRecall(model, result, 1, tnow)

    tnows.append(tnow)
    results.append(result)
    idealProbs.append(p)
  return (tnows, results, idealProbs, model)


def likelihood(initialModel, tnows, results):
  model = initialModel
  logProbabilities = []
  for (tnow, result) in zip(tnows, results):
    logPredictedRecall = ebisu.predictRecall(model, tnow)
    # Bernoulli trial's probability mass function: p if result=True, else 1-p, except in log
    logProbabilities.append(logPredictedRecall if result else np.log(-np.expm1(logPredictedRecall)))
    model = ebisu.updateRecall(model, result, 1, tnow)

  # return joint probability assuming independent trials: prod(prob) or sum(logProb)
  return sum(logProbabilities)


if __name__ == "__main__":
  rng = Random(1234)

  init = (3., 3., 4.)
  Ntrials = 200
  minPrecall = 0.15
  maxPrecall = 0.5

  # generate review history that's exactly drawn from the model above
  tnows, results, idealProbs, finalModel = generateHistory(init, Ntrials, minPrecall, maxPrecall)

  print('| summary | value |')
  print('|---------|-------|')
  print("\n".join(
      list(f"| {k} | {v.round(3)} |" for k, v in dict(
          finalHalflife=ebisu.modelToPercentileDecay(finalModel),
          totalTime=sum(tnows),
          medianTimeBetweenQuiz=np.median(tnows)).items())))

  # Now, check the likelihood of a variety of models against that review history.
  # Higher likelihood is better.
  print("| halflife | likelihood |")
  print("|----------|------------|")
  for initHalflifeGuess in [0.1, 0.5, 1, 2, 4, 6, 10, 15, 20]:
    l = likelihood((3, 3, initHalflifeGuess), tnows, results)
    print(f"| {initHalflifeGuess} | {l.round(3)} |")

This simulates the student whose memory has an initial halflife of 4 time units and who reviews this flashcard when their memory of the flashcard drops to between 15% and 50%. After 200 reviews—

summary value
finalHalflife 6.113
totalTime 2305.154
medianTimeBetweenQuiz 11.08

Because the student lets their memory get pretty weak, the halflife after 200! quizzes is just 6 time units, barely budged from the initial value of 4. Their median time between quizzes is 11, but there's relatively little growth there because their memory follows the model and they let their memory get pretty low.

But no matter, that's the history we generate. You might replace this history with the history of any of your flashcards. Suppose we wanted to know what Ebisu memory model best predicted this sequence of quiz results. The script above sweeps over a series of initial halflife guesses and computes the log-likelihoods:

halflife likelihood
0.1 -125.553
0.5 -121.016
1 -119.189
2 -117.611
4 -116.423
6 -116.059
10 -116.261
15 -117.142
20 -118.212

The bigger the number, the more accurate the halflife was. In this case, 6 is the max, while halflives between 4 (the true value) and 10 are close.

As you tweak the halflife (final element) in init, minPrecall, maxPrecall, and the random seed, and rerun the script, you'll be able to confirm that usually, when we guess the true halflife, we get the highest or close to the highest log-probability.

So this is what I meant when I said for any scheme that produces probabilities for quizzes, we can quantify how accurate it is over a series of quizzes for a flashcard. In the above script, each initial guess at halflife was a "different scheme" and we verified that the likelihood is usually maximized when our initial halflife guess is close to the true halflife that the history was generated from.

But you can imagine searching for the initial a=b and initial halflife that maximized the likelihood for each flashcard in your Anki deck, and seeing if the resulting set of initial models clustered together, suggesting that one or two initial models do accurately capture your memories. When we have real data, we won't know what the "true" value is but we can still compare different algorithms and different parameters for each algorithm to see which maximizes the likelihood.


Sorry for pasting two huge scripts and writing so much! "I didn't have time to write a short letter"… 😕 hopefully something in the above is useful to you. It was helpful for me—I didn't appreciate how much Ebisu relies on successes at super-low recall probabilities to dramatically increase the halflife as we like to happen. It drives home your initial point that, if you time your reviews to when Ebisu predicts 75% which is a good level for morale, your halflives will grow prettttty slowly.

@fasiha
Copy link
Owner

fasiha commented Sep 2, 2020

Apologies for posting yet again, I think this comment is independent of the above one.

Basically what we're looking for is a way to (dramatically) increase halflife when we quiz at high predicted recall. E.g., we want to quiz around 75% and want some way of saying that a success boosts halflife by ~2x.

We can actually do that by giving Ebisu total>1 (# of total quizzes per quiz session) at each step.

total>1 is supposed to be used for when, instead of binary quizzes, you have binomial quiz sessions, like Duolingo, where the same fact can be reviewed more than once in one sitting: instead of counting this as multiple separate reviews, we say there was one review session with total trials and success successes.

Here's a simulator: https://svelte.dev/repl/242519396f894f73bbf96d960b804c07?version=3.15.0

It lets you set the total # of quizzes per session at the top, and then as you click the buttons for each quiz to modulate the # of successes at each sitting, between 0 to total (which unfortunately I call nquiz in the Svelte playground). As you increase # of total quizzes per session at the top, even at a=b=3 and Precall @ quiz at 75%, you'll see exponential increases in halflife.

The simulator is a bit temperamental 😢 if it stops responding, check your JS Console, it's likely because Ebisu.js encountered a numerical instability and threw. This is purely a numerical issue, with # of quizzes high, the algorithm encounters catastrophic cancellation because of 64-bit doubles and I haven't yet found a workaround.

If anyone wants to seriously use this technique, of course they'll have to figure out how to map binary or categorical (easy/normal/hard/fail) quiz results to success, for whatever total they settle on (and also deal with the numerical instability, I'm still looking for a fix for that).

I can imagine applying the technique described in the previous comment, about quantifying prediction accuracy via the likelihood of a flashcard's entire history, to benchmark this updater (as well as your update_growth) to see if we can find parameters for each algorithm that maximize the likelihood of the most cards.

@brownbat
Copy link
Author

brownbat commented Sep 2, 2020

Super helpful and great points! You code really fast, those will be hugely useful for testing.

I have some really short code that simplifies real world test data from Anki's review log.[0] I can test on that.

You seem to have a function that takes in (1) well formed review history, and (2) a functional model that predicts forgetting, then outputs the accuracy of that model over the history of reviews. So long as it can take in other functional models, it'd be a really powerful tool to test any proposed flashcard algorithms against real world test data, comparing their effects really easily.

Normalizing all the different flashcard algorithms doesn't seem trivial. Could be hugely powerful though, and sounds fun.

N.B.: I did play around with the success:attempts ratio to strengthen confidence as a solution a bit. I think that approach is hard to tune to give good results both at the high end of expected probability of recall and at the low end. But could work under some conditions.

[0] I know you don't use anki, but for anyone who finds this thread and wants to pull their own data from anki:

In anki, you can open a debug window with ctrl+shift+;
Then fill it with review history using:

text = mw.col.db.all('select * from revlog')
text = ',\n    '.join([str(t) for t in text])
print(text)

You can copy and paste that into a file. Then clean up the top and add a set of surrounding brackets for the whole list. Then import it and set it to revlog, or copy and paste and set to revlog if it's small enough in a new python script.

Then call this function to convert that to a stripped down structure: [card_id, review_timestamp, pass_or_fail], sorted by card_id then timestamp.

def make_new_revlog(revlog):
    new_table = []
    for row in revlog:
        success = False
        if row[8] > 1:
            success = True
        new_row = [row[0], row[1], success]
        new_table.append(new_row)
    new_table.sort(key=lambda row: row[0])
    new_table.sort(key=lambda row: row[1])
    return new_table

@fasiha
Copy link
Owner

fasiha commented Sep 2, 2020

Normalizing all the different flashcard algorithms doesn't seem trivial.

So for a specific flashcard's history, each algorithm's likelihood can in fact be directly numerically compared to one another. Since likelihood = math.prod(precall if result else 1-precall for result, precall in zip(results, precalls)), and where results is fixed but precalls is the list generated by different algorithms/parameters, you can rank each algorithm/parameter combination with this number. The oracle predictor would predict Precall = 1 just before you succeeded a quiz, and Precall = 0 just before you failed, so its likelihood would be 1. The anti-oracle would do the opposite, and its likelihood would be 0. All other algorithms will result somewhere in between.

(Implementation note—in the script I posted, I use log-likelihood since that's much more numerically stable. Log-likelihood of the oracle predictor is log(1)=0; of the anti-oracle is log(0)=-∞; and you'll expect negative numbers for real algorithms.)

But you're right, it's non-trivial to rank two algorithms over a set of flashcards. You might just combine the likelihoods (i.e., sum log-likelihoods) over all flashcards but that'll allow some flashcards where each algorithm did really well to dominate. You might look to see which algorithm got the higher likelihood for more flashcards (voting).

N.B.: I did play around with the success:attempts ratio to strengthen confidence as a solution a bit. I think that approach is hard to tune to give good results both at the high end of expected probability of recall and at the low end. But could work under some conditions.

Oof, yes, if we go this route we're kind of at the Anki level of customizing the interval increment as a function of quiz result and Precall in a pretty ad hoc way—e.g., if Precall > 80%, use total=8; if Precall>50%, use total=4; if Precall>20%, use total=2… Thanks for digging into this too.

[0] I know you don't use anki, but for anyone who finds this thread and wants to pull their own data from anki:

This webapp also lets you upload your collection.anki2 file and, among other things, download all reviews as CSV: https://fasiha.github.io/fuzzy-anki/ the author was a terrible JavaScript programmer when they wrote this, sorry 😔

cyphar added a commit to YolkiApp/migration-bench that referenced this issue Sep 21, 2020
This is a basic implementation of the import idea, however some quick
tests have found that Ebisu has some pretty major flaws in that it
doesn't actually make use of the more modern model of memory (it only
makes use of the forgetting curve but not the spacing effect) -- meaning
that its internal model is always chasing after the real recall curve.

This is something they are planning on fixing[1] but we should look into
alternative algorithms.

[1]: fasiha/ebisu#35

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
@brownbat
Copy link
Author

brownbat commented Sep 30, 2020

Sorry for the gap in replies. I'm still thinking about this a lot in the background, but haven't had the resources or time to educate myself enough to take the next steps. I mentioned in another thread, I feel like I should study Bayesian analysis and multiple regressions before getting myself in too much more trouble here.

If anyone else is out there reading this and has good ideas along these lines, by all means jump in... I'm still pretty confident someone could add an extra function that mimics this memory backoff, so it's totally optional and not disruptive to the main ebisu project at all. Then people could test them both to see what works for their project. Hopefully I'll get more time to work on that at some point soon.

@cyphar
Copy link

cyphar commented Mar 29, 2021

It took me a while to realise that this thread seems to be describing the issue I found which I just outlined in #43, albeit with quite different framing and language. As discussed in the reddit thread I linked, my impression is that the core issue is more to do with ebisu's model of cards (that the half-life if a fundamental property of a card rather than a second-order effect).

I don't have much more to add, just linking to #43 for some more visibility.

@fasiha
Copy link
Owner

fasiha commented Aug 16, 2021

I apologize for being literally this guy:

Flash from Zootopia

when it comes to figuring out a solution to this issue (closely related to #43). I experimented with some heavyweight solutions, before stepping back and trying to see the big picture and coming up with https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py

Before talking about that, here's a recap of the problem. I've given some explanation for how I see the problem that @cyphar saw and raised in #43 there but here's how I see the fundamental problem.

Suppose we learn a fact at midnight and model our memory with Ebisu model [a, b, t], i.e., recall probability t hours after midnight is Beta(a, b). Then one hour later we do a quiz, and call ebisu.updateRecall to get a new [a2, b2, t2] model. I didn't realize this all these years until @cyphar patiently broke it down for me on Reddit a few months ago, but the new posterior model still only refers to recall t2 hours after midnight. It doesn't encode our belief as of now, after an hour has elapsed. Ebisu generates an increasingly accurate estimate of recall after midnight without ever moving to an hour after midnight, a day after, etc., which is why we saw

  • slow growth in halflife if you review when recall probability dips to 80%,
  • Ebisu predicting much less than 5% recall for cards that you were getting correct, and
  • finding that the maximum likelihood estimates for initial halflife for real cards was 10'000 hours.

So we need some way to convert a posterior for quizzes after midnight to a posterior for quizzes after 1 am, and that's what both @brownbat above and others have asked for.

Forgive me for being so dense to understand this and slow to think of a solution! My incompetence at mathematics is truly gargantuan.

I don't yet have a great solution. But in https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py I have a framework to help evaluate possible ways to do this translation from midnight to after midnight of our belief on recall.

I picked a very Anki-like translation: after ebisu.updateRecall, just boost the resulting model's halflife by a fixed factor. The code does things a bit fancier, see these lines that show how,

  • for failed quizzes, we don't boost the output of updateRecall, (1.0)
  • for hard quizzes, we boost half-way between 1.0 and the base boost, (1.2 if the base boost is 1.4, i.e., 1.4 - (1.4-1)/2)
  • for normal quizzes, we boost the halflife by the base boost, (1.4 e.g.)
  • for easy quizzes, we boost the halflife by more than the base boost (1.6, i.e., 1.4 + (1.4-1)/2))

Obviously this can be greatly improved. The goal of Ebisu is to not use magic numbers, to use statistical analysis to estimate these numbers, etc. But https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py includes a bunch of machinery to evaluate this and other proposed ways to update Ebisu posteriors. If you can think of a better way to boost the models after quizzes, we can test it here.

This is by testing the proposed changes on real data and computing probabilistic likelihoods. In a nutshell, what demo.py does is:

  • loads an Anki database (I had my old collection.anki2, a SQLite database),
  • groups together reviews from the same card (it throws out cards with too few or too few correct reviews),
  • for a given initial model ([initialAlphaBeta, initialAlphaBeta, initialHalflife]) and some baseBoost, it sums up the log-probabilities returned by ebisu.predictRecall for each quiz. This is the likelihood of that model (initialAlphaBeta, initialHalflife, and baseBoost).

Then we sweep over different values of these parameters, initialAlphaBeta, initialHalflife, and baseBoost, and we can make plots that look like this:

exmaple-1300038030552

This plot shows, for a range of initial halflives (x axis), and a few different boosts (1.0 to 2.0, shown as different colors), the likelihood for a specific card I had with 27 quizzes (23 of them correct). (I fixed initialAlphaBeta=2 because it doesn't really matter.) Some notes:

  • The blue curve above corresponds to a boost of 1.0, i.e., the current Ebisu case. The max likelihood for that case we don't even know: it's still climbing at 1000 hours initial halflife, which is obviously wrong.
  • But for boosts greater than 1.0, we see the likelihood curves peaking. For likelihood, higher is better, so for boosts 1.4 and 1.7, we reach the same maximum likelihood (it's actually log likelihood, so -14; the value itself doesn't matter, since it's the product-of-probabilities which is equal to the sum-of-log-probabilities).
  • This tells us that mindlessly boosting the halflife after each review considerably improves the accuracy and believability of the algorithm.

https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py will also generate bigger charts, like this:

exmaples

If you run it as is, demo.py will look for collection.anki2 (which is a file inside the APKG files that Anki generates—APKG is just a zip file, so if you unzip it, you'll get this collection.anki2 SQLite database plus your images/sounds), load the reviews that correspond to actual undeleted cards, generate a training vs testing set (important for accurately benchmarking competing algorithms), calculate the likelihoods for a bunch of different halflife-boost combinations, and make a few plots.

I'm planning on finding a better method to boost the models after Ebisu's update, but the way I'll know that they're better is that they'll achieve higher likelihoods on more flashcards than worse methods.

Ideas right now:

  • instead of a static baseBoost applied after all quizzes, the boost needs to be dynamic and time-sensitive. I.e., if you review a mature card five times in five minutes, you shouldn't be boosting the halflife by 1.4**5 = 5.4.
  • Take into account the most recent reviews, within a time window. If all are successes, boost halflife by some value greater than 1. If all are failures, boost halflife by some value less than 1.
  • For all these "magic numbers" including the max boost, min boost, number of recent reviews to consider, etc., offer an Ebisu function that calculates these values for a card given your quiz history. I am not at all sure what that API would look like but the hope is we can change updateRecall to do a quick coarse local update of the model, and maybe once a day or once a week, you can run a recalibrate function that takes all quizzes for this flashcard (or all flashcards) and updates these magic numbers by finding which numbers maximize likelihood.

I know the script is pretty long, as is this comment, but I wanted to share some detailed thoughts and code about how I'm planning to evaluate proposed algorithms for boosting models after Ebisu's posterior update, i.e., detailing how to use likelihood to evaluate parameters and algorithms.

I really like how Ebisu right now is an automatic algorithm that just works given a couple of input numbers, and I'd like to find a way to do this boosting that retains the minimal mathematical nature of Ebisu currently, but we shall see!

I put some setup and run instructions at the top of https://github.com/fasiha/ebisu-likelihood-analysis/blob/main/demo.py, if you have time, please check it out!

@cryptocoinserver
Copy link

Incredible work. Will you port this to ANKI as plugin again, too?

@cyphar
Copy link

cyphar commented Feb 1, 2022

With the (still in beta) v3 Anki scheduler, you can implement custom scheduling plugins in JavaScript. This would allow you to use custom schedulers even with AnkiDroid and the iOS version of Anki.

I suspect once the work on ebisu is finished, it'd be fairly easy to port the code to the v3 scheduler (and update the existing ebisu add-on). If no-one else is planning to do it, I'd be happy to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants