[WIP] DocBERT Colab notebook #58

mrkarezina · 2020-05-18T04:54:37Z

I was working on implementing a Colab notebook to replicate some of the results in the docBERT paper. Notebook still need to be polished and i'll add it in a future commit.

One of the issues when running the LogisticRegression was the default vocab size of 36308 which results in mismatch in LogisticRegression when using BagOfWordsTrainer. Updating the vocab size to 36311 solves the issue.

Full traceback:
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/hedwig/models/lr/__main__.py", line 85, in <module>
    trainer.train()
  File "/content/hedwig/common/trainers/bow_trainer.py", line 69, in train
    self.train_epoch(train_dataloader)
  File "/content/hedwig/common/trainers/bow_trainer.py", line 43, in train_epoch
    logits = self.model(features)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/hedwig/models/lr/model.py", line 15, in forward
    logit = self.fc1(x)  # (batch, target_size)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [32 x 36311], m2: [36308 x 90] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:283

It would also be helpful to have a command line argument for loading the pertained BERT model provided by Hugging Face instead of having to have locally saved weights.

Notebook specific:
It’s difficult to evaluate the rest of the models like BERT large and simple LSTM with word2vec because colab runs out of ram (16GB).

For the pretrained BERT models I used the uncased weights provided by HF. However in 2 trails of base BERT on Reuters the test F1 score is 87.9 and 87.7 slightly different to 90.7 reported in the paper.

Split  Dev/Acc.  Dev/Pr.  Dev/Re.   Dev/F1   Dev/Loss
 TEST    0.8323   0.9264   0.8370   0.8794    15.9778

Split  Dev/Acc.  Dev/Pr.  Dev/Re.   Dev/F1   Dev/Loss
 TEST    0.8297   0.9284   0.8317   0.8774    16.5026

In terms of content for replicating the results, would LR, BERT base on Reuters x 2, and BERT base IMDB be a good start? What other results would it be good to train and include?

achyudh · 2020-05-21T20:45:52Z

datasets/bow_processors/reuters_processor.py

@@ -6,7 +6,7 @@
 class ReutersProcessor(BagOfWordsProcessor):
    NAME = 'Reuters'
    NUM_CLASSES = 90
-    VOCAB_SIZE = 36308
+    VOCAB_SIZE = 36311


This should be due to a mismatch in the scikit library version in your environment. In any case, rather than hard-coding the vocabulary size, would it be possible to infer it at run time?

The vocab size is hard coded for all the datasets. Should I add a method to calculate vocab size in theBagOfWordsProcessor class that's extended for each dataset?

Yup that would work

achyudh · 2020-05-21T20:49:02Z

models/bert/args.py

@@ -7,6 +7,7 @@ def get_args():
    parser = models.args.get_args()

    parser.add_argument('--model', default=None, type=str, required=True)
+    parser.add_argument('--hf-checkpoint', default=False, type=bool)


Rather than adding a new flag, can't we just add the pre-trained model weights to https://git.uwaterloo.ca/jimmylin/hedwig-data? That way everyone would be able to get the models running out of the box

Yes, that would definitely be better! What were the pre-trained models? Where they the checkpoints from hugging face or seperatly trained? I tried using bert-base-uncased and got results comparable to the ones in the paper.

Yup, they are the checkpoints from hugging face. Do you have access to any of the DSG servers? We already have the exact pre-trained models we used for our experiments.

No I don't have access unfortunately. Is there any way I could download them into the notebook or maybe just wait until they are added into the hedwig-data repo so they are included when cloning.

I've added the pre-trained weights to our data repo: https://git.uwaterloo.ca/jimmylin/hedwig-data

achyudh · 2020-05-21T20:56:47Z

models/bert/__main__.py

@@ -71,8 +71,9 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'):

    args.is_hierarchical = False
    processor = dataset_map[args.dataset]()
-    pretrained_vocab_path = PRETRAINED_VOCAB_ARCHIVE_MAP[args.model]
-    tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path)
+    if not args.hf_checkpoint:


I am not sure I understand why we need this condition here

The flag --hf-checkpoint was just allowing to specify one of the hugging face model checkpoints which is automatically loaded making it easier to start training than having to download and move the weights to the hedwig-data directory.

However if the weights are included in hedwig-data then there's no need for the flag.

We didn't want to have the option to download a checkpoint in order to protect against any changes to these checkpoints by hugging-face, which would invalidate our results

Oh makes sense.

mrkarezina · 2020-05-23T20:35:23Z

Only the logistic regression model seemed to be using the VOCAB_SIZE property from the BOW processor. I rearranged the script to calculate the the proper vocab size before initializing the LogisticRegression model.

I also removed the hugging-face flag, although it would be nice if the weights were already in hedwig-data. Should I rename this PR and make a seperate one for the notebook?

achyudh · 2020-05-29T20:17:04Z

models/bert/__main__.py

@@ -136,5 +136,4 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'):
        model = model.to(device)

    evaluate_split(model, processor, tokenizer, args, split='dev')
-    evaluate_split(model, processor, tokenizer, args, split='test')
-
+    evaluate_split(model, processor, tokenizer, args, split='test')


Can you please remove this change? It's affecting an unrelated file. Also, as a convention, we have newlines at the end of files.

achyudh · 2020-05-29T20:19:08Z

models/lr/__main__.py

@@ -71,6 +70,12 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'):
        save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME)
        os.makedirs(save_path, exist_ok=True)

+    if train_examples:
+        train_features = vectorizer.fit_transform([x.text for x in train_examples])


Hmm but this means that we would fit_transform twice right? I am not sure if you ran this on our larger datasets (IMDB or Yelp) but it's a pretty memory and compute intensive operation.

Yes, the same fit transform would also be applied in trainers/bow_trainer.py. It's just tricky because you need to initialize the LR model having the vocab size and the BagOfWordsTrainer requires that model.

Would it be ok to pass the train_features to the BagOfWordsTrainer? It's only used by the LR model anyway. It just feels clunky since vectorizer is only passed to BagOfWordsTrainer to compute train / eval features.

Not sure whats a better way to do this.

achyudh · 2020-05-29T20:23:55Z

I also removed the hugging-face flag, although it would be nice if the weights were already in hedwig-data. Should I rename this PR and make a seperate one for the notebook?

Sorry I don't have much context on the notebook, but it's always preferable to have PRs as atomic, i.e., one pull request with one commit addressing one concern.

mrkarezina added 2 commits May 18, 2020 00:11

Update Reuters vocab size

7fd54e8

Add hf-checkpoint args

22131e2

mrkarezina mentioned this pull request May 20, 2020

DocBERT #59

Closed

achyudh reviewed May 21, 2020

View reviewed changes

achyudh mentioned this pull request May 22, 2020

Load model from model-zoo #60

Closed

mrkarezina added 2 commits May 23, 2020 15:37

Calculate vocab size

f3f665c

Remove hf model zoo flag

7a75012

achyudh reviewed May 29, 2020

View reviewed changes

achyudh requested changes May 29, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] DocBERT Colab notebook #58

[WIP] DocBERT Colab notebook #58

mrkarezina commented May 18, 2020 •

edited

achyudh May 21, 2020

mrkarezina May 22, 2020

achyudh May 22, 2020

achyudh May 21, 2020

mrkarezina May 22, 2020

achyudh May 22, 2020

mrkarezina May 22, 2020 •

edited

achyudh May 29, 2020

achyudh May 21, 2020

mrkarezina May 22, 2020

achyudh May 22, 2020

mrkarezina May 22, 2020

mrkarezina commented May 23, 2020

achyudh May 29, 2020

achyudh May 29, 2020

mrkarezina May 30, 2020

achyudh commented May 29, 2020

[WIP] DocBERT Colab notebook #58

Are you sure you want to change the base?

[WIP] DocBERT Colab notebook #58

Conversation

mrkarezina commented May 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrkarezina May 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrkarezina commented May 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achyudh commented May 29, 2020

mrkarezina commented May 18, 2020 •

edited

mrkarezina May 22, 2020 •

edited