Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish number of parameters for each task #2

Open
redna11 opened this issue Nov 14, 2020 · 4 comments
Open

Publish number of parameters for each task #2

redna11 opened this issue Nov 14, 2020 · 4 comments

Comments

@redna11
Copy link

redna11 commented Nov 14, 2020

Hello,

you mention: "The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file"

Do you publish what those baseline number of params are respectively for each task?

Thanks

@redna11
Copy link
Author

redna11 commented Dec 1, 2020

After running the JAX code and using the structure information in the research paper, I obtain the following number of parameters for each task:

ListOps: 19.9M
Text: 3.5M
Retrieval: 1.087M
Image: 380K
Pathfinder: 315K

Could you kindly confirm that this is indeed correct, so that a fair comparison can be done with alternative models.

Thanks/

@shawnkx
Copy link

shawnkx commented Apr 28, 2021

Hi, could you tell me how you count the parameters of the model on cifar10 task? After running, my model has just 50k parameters. Thanks!

@alexmathfb
Copy link

alexmathfb commented Aug 29, 2021

The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.

This constraint is not satisfied in the current code. Using default parameters for the Image/CIFAR10 task I found:

Transformer # params: 52 266
Performer #params: 248 458

The Performer model thus doesn't satisfy the 10% constraint, it has more than 400% times the parameters of the Transformer model. I suspect this is due to wrong hyperparameters

Transformer: emb_dim: 32, mlp_dim: 64, num_heads: 1, qkv_dim: 32
Performer: emb_dim: 128 mlp_dim: 128 num_heads: 8 qkv_dim: 64

Everything is larger, this obviously leads to more parameters. Again, I apologize to the authors if this is due to any misconception on my part.

@vladyorsh
Copy link

vladyorsh commented Nov 19, 2021

Hi @redna11,

can you tell which MLP dim you considered when calculated the size of a text classification model? It seems that it was 512, while I see 1024 in the LRA config. The information from the paper is also misleading:

All xformer models are parameterized by the same number of
layers, heads and hidden dimensions, namely 8 heads, 512 hidden dimensions and d = 2048 for
positional FFN layers.

Basically, the hyperparameters in the paper seem to be doubled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants