Publish number of parameters for each task #2

redna11 · 2020-11-14T13:44:01Z

Hello,

you mention: "The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file"

Do you publish what those baseline number of params are respectively for each task?

Thanks

redna11 · 2020-12-01T15:46:04Z

After running the JAX code and using the structure information in the research paper, I obtain the following number of parameters for each task:

ListOps: 19.9M
Text: 3.5M
Retrieval: 1.087M
Image: 380K
Pathfinder: 315K

Could you kindly confirm that this is indeed correct, so that a fair comparison can be done with alternative models.

Thanks/

shawnkx · 2021-04-28T00:58:51Z

Hi, could you tell me how you count the parameters of the model on cifar10 task? After running, my model has just 50k parameters. Thanks!

alexmathfb · 2021-08-29T12:56:26Z

The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.

This constraint is not satisfied in the current code. Using default parameters for the Image/CIFAR10 task I found:

Transformer # params: 52 266
Performer #params: 248 458

The Performer model thus doesn't satisfy the 10% constraint, it has more than 400% times the parameters of the Transformer model. I suspect this is due to wrong hyperparameters

Transformer: emb_dim: 32, mlp_dim: 64, num_heads: 1, qkv_dim: 32
Performer: emb_dim: 128 mlp_dim: 128 num_heads: 8 qkv_dim: 64

Everything is larger, this obviously leads to more parameters. Again, I apologize to the authors if this is due to any misconception on my part.

vladyorsh · 2021-11-19T18:10:10Z

Hi @redna11,

can you tell which MLP dim you considered when calculated the size of a text classification model? It seems that it was 512, while I see 1024 in the LRA config. The information from the paper is also misleading:

All xformer models are parameterized by the same number of
layers, heads and hidden dimensions, namely 8 heads, 512 hidden dimensions and d = 2048 for
positional FFN layers.

Basically, the hyperparameters in the paper seem to be doubled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish number of parameters for each task #2

Publish number of parameters for each task #2

redna11 commented Nov 14, 2020

redna11 commented Dec 1, 2020

shawnkx commented Apr 28, 2021

alexmathfb commented Aug 29, 2021 •

edited

vladyorsh commented Nov 19, 2021 •

edited

Publish number of parameters for each task #2

Publish number of parameters for each task #2

Comments

redna11 commented Nov 14, 2020

redna11 commented Dec 1, 2020

shawnkx commented Apr 28, 2021

alexmathfb commented Aug 29, 2021 • edited

vladyorsh commented Nov 19, 2021 • edited

alexmathfb commented Aug 29, 2021 •

edited

vladyorsh commented Nov 19, 2021 •

edited