Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research : Binary Mask vs Sparse Mask? #125

Open
Optimox opened this issue Jun 4, 2020 · 3 comments
Open

Research : Binary Mask vs Sparse Mask? #125

Optimox opened this issue Jun 4, 2020 · 3 comments
Labels
enhancement New feature or request Research Research Ideas to improve architecture

Comments

@Optimox
Copy link
Collaborator

Optimox commented Jun 4, 2020

Main Remark

Tabnet architecture is using sparsemax function in order to perform instance-wise feature selection, and this is one of the important feature of TabNet.

One of the interesting properties of sparsemax is that it's outputs sum to 1, but do we really want this?
Is it the role of the mask to perform both selection (0s for unused features) and importance (a value between 0 and 1)?
I would say that the feature transformer should be used to create importance (by summing values of the relu outputs as it's done in the paper) and the masks should output binary masks that would not sum to 1.

On problem I see with non binary maks is that they change the values for the next layers, if someone is 50 year old, and the attention layer think that age is half of the solution then attention for age would be 0.5, and the next layer would see age=25. But how can the next layers differentiate from 75 / 3, 50 /2 and 25? They can't really, so it seems that some information is lost along the way because of the masks, that's why I would be interested to see how binary masks perform!

Proposed Solutions

I'm not quite sure if there are known solutions for this, would thresholding a softmax works? Would you add this threshold as a parameter? or would it be learnt by the model itself? I'm not even sure that it would

If you feel like this is interesting and would like to contribute, please share your ideas in comments or open a PR!

@Optimox Optimox added the enhancement New feature or request label Jun 4, 2020
@Optimox Optimox added the Research Research Ideas to improve architecture label Jun 4, 2020
@pangjac
Copy link

pangjac commented Oct 11, 2021

Hi @Optimox , could you please help elaborate the example

if someone is 50 year old, and the attention layer think that age is half of the solution then attention for age would be 0.5, and the next layer would see age=25. But how can the next layers differentiate from 75 / 3, 50 /2 and 25?

I am a bit confused on how, once the attention layer thinks age is half of the solution, then "the next layer" would see age=25.( how 25 comes out?) Thank you!

@Optimox
Copy link
Collaborator Author

Optimox commented Oct 11, 2021

Well, once the attention mask is applied if you multiply age by 0.5 then it gives you a totally different age. In practice it still works but I wonder if it would work better with completely binary masks. That’s the point.

I once tried to add a very sharp layer that’s 0 in 0 but goes up to 1 very quickly, but I remember that it did not change much (and gradients exploded). It would be nice to perform an exhaustive comparison on multiple benchmark datasets.

@haoliangjiang
Copy link

The binary mask will help in that we are confident that the attention mask allows the entire value to join the computation of the next block. However, I believe one of the implicit goals is that network can learn how to pass information so that it can classify or regress correctly. So I do not think binary masks will significantly boost the performance, for both masks are more about introducing more visualizability of the network. But I am not sure about the optimization level where it might do something different to gradients.

Given a well-trained model, I can think of two situations that the network outputs 25 instead of 50 after the attention. One is that the input data is noisy data. It corrects the data by changing the age to a reasonable range. The other is that the model knows 25 is 50 somehow, as the attention mask is mainly based on the input. Both of these situations help to predict.

Not a pro. Just sharing my thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Research Research Ideas to improve architecture
Projects
None yet
Development

No branches or pull requests

6 participants