[early WIP] Fix/rationalize loss-tallying #2922

gojomo · 2020-08-24T20:37:10Z

PR to eventually address loss-tallying issues: #2617, #2735, #2743. Early tinkering stage.

gojomo · 2020-09-03T23:28:17Z

Changes so far in Word2Vec:

using float64 for all loss tallying
resetting tally to 0.0 per epoch - but remembering history elsewhere for duration of current train() call
micro-tallying into a per-batch value rather than the global tally
then, adding to global tally rather than replacing it

Though the real goal is sensible loss-tallying across all classes, I think these small changes already remedy #2735 (float32 swallows large loss-values) & #2743 (worker losses clobber each other).

An oddity from looking at per-epoch loss across a full run: all my hs runs have shown increasing loss every epoch, which makes no sense to me. And yet, the models at the end have moved word-vectors to more useful places (thus passing our minimal sanity-tests). I don't think my small changes could have caused this oddity (but maybe); I suspect something pre-existing in HS-mode loss-tallying is the real reason. When I have a chance I'll compare to the loss patterns for similar-modes/similar-data in something like the Facebook FastText code, that also reports running loss.

gojomo · 2020-09-08T06:22:03Z

Training FB fasttext (HS, CBOW, no-ngrams ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs) shows decreasing loss reports through the course of training, as expected and unlike the strangely-increasing per-epoch loss our code (at least in this PR) reports. But, final results on a few quick most_similar ops seem very similar. So something remains odd about our loss reporting, especially in HS mode.

gojomo · 2020-09-08T20:01:45Z

As a point of comparison, Facebook's fasttext reports an "average loss", divided over some trial-count, like so:

(base) gojomo@Gobuntu-2020:~/Documents/Dev/fasttext/fastText-0.9.2$ time ./fasttext cbow -verbose 5 -maxn 0 -bucket 0 -lr 0.025 -loss hs -thread 3 -input ~/Documents/Dev/gensim/enwik9 -output enwik9-cbow-nongrams-lr025-hs
Read 142M words
Number of words:  847816
Number of labels: 0
Progress:  39.8% words/sec/thread:  431099 lr:  0.015052 avg.loss:  5.263475 ETA:   0h 5m31s
Progress:  45.4% words/sec/thread:  429306 lr:  0.013645 avg.loss:  4.725245 ETA:   0h 5m 1s
Progress:  58.6% words/sec/thread:  426932 lr:  0.010339 avg.loss:  3.865230 ETA:   0h 3m50s
Progress: 100.0% words/sec/thread:  422384 lr:  0.000000 avg.loss:  2.483185 ETA:   0h 0m 0s

Gensim should probably collect & report 2Vec-class training loss in a comparable way, so that numbers on algorithmically-analogous runs are broadly similar, for familiarity to users & as a cross-check of whatever it is we're doing.

piskvorky · 2020-09-08T21:25:10Z

+1 on matching FB's logic. What is "trial-count"? Is the average taken over words or something else?

gojomo · 2020-09-08T21:55:01Z

Unsure; their c++ (with a separate class for 'loss') is different enough from our code that I couldn't tell at-a-glance & will need to study it a bit more.

piskvorky · 2022-02-19T19:15:01Z

@gojomo cleaning up the loss-tallying logic still very much welcome. Did you figure out the "increasing loss" mystery?

We're planning to make a Gensim release soon – whether this PR gets in now or later, it will be a great addition.

gojomo · 2022-02-21T00:14:14Z

These changes would likely apply, & help a bit in Word2Vec, with just a little adaptation to current develop. I could take a look this week & wouldn't expect any complications.

But getting consistent loss-tallying working in Doc2Vec & FastText, & ensuring a similar calculation & roughly similar loss magnitudes with other libraries (mainly Facebook FastText), would require more, & hard-to-estimate, effort. We kind of need someone who both – (1) needs it; & (2) can get deep into understanding the code – to rationalize the whole thing.

Never figured out why our hs mode reports growing loss despite the model improving as expected on other checks.

gojomo force-pushed the loss-fixes branch 6 times, most recently from 8c61787 to 33ef202 Compare August 28, 2020 18:43

gojomo added 3 commits September 1, 2020 20:32

intensify cbow+hs tests; bulk testing method

7df38c3

loss: always tally; split to epoch_loss/minibatch_loss; use wider float

480e2f7

epoch_loss_history

a16824b

gojomo force-pushed the loss-fixes branch from 33ef202 to a16824b Compare September 2, 2020 03:39

gojomo mentioned this pull request Sep 8, 2020

[WIP] 2Vec SaveLoad improvements #2892

Closed

mpenkov marked this pull request as draft September 17, 2020 09:18

piskvorky mentioned this pull request Sep 24, 2020

[MRG] *2Vec SaveLoad improvements #2939

Merged

gojomo mentioned this pull request Oct 19, 2020

track training loss while using doc2vec issue. #2983

Open

mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021

mpenkov moved this from Triage to Draft in PR triage 2021-06 Jun 22, 2021

gojomo mentioned this pull request Nov 29, 2022

Loss Word2Vec #3405

Closed

gojomo mentioned this pull request Sep 28, 2023

Word2vec: loss tally maxes at 134217728.0 due to float32 limited-precision #2735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[early WIP] Fix/rationalize loss-tallying #2922

[early WIP] Fix/rationalize loss-tallying #2922

gojomo commented Aug 24, 2020

gojomo commented Sep 3, 2020

gojomo commented Sep 8, 2020

gojomo commented Sep 8, 2020

piskvorky commented Sep 8, 2020 •

edited

gojomo commented Sep 8, 2020

piskvorky commented Feb 19, 2022 •

edited

gojomo commented Feb 21, 2022

[early WIP] Fix/rationalize loss-tallying #2922

Are you sure you want to change the base?

[early WIP] Fix/rationalize loss-tallying #2922

Conversation

gojomo commented Aug 24, 2020

gojomo commented Sep 3, 2020

gojomo commented Sep 8, 2020

gojomo commented Sep 8, 2020

piskvorky commented Sep 8, 2020 • edited

gojomo commented Sep 8, 2020

piskvorky commented Feb 19, 2022 • edited

gojomo commented Feb 21, 2022

piskvorky commented Sep 8, 2020 •

edited

piskvorky commented Feb 19, 2022 •

edited