Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research : Change Attention Transformer Inputs #126

Open
Optimox opened this issue Jun 4, 2020 · 2 comments
Open

Research : Change Attention Transformer Inputs #126

Optimox opened this issue Jun 4, 2020 · 2 comments
Labels
enhancement New feature or request Research Research Ideas to improve architecture

Comments

@Optimox
Copy link
Collaborator

Optimox commented Jun 4, 2020

Main Remark

Currently in tabnet architecture, a part of the output of Feature Transformer is used for the predictions (n_d) and the rest (n_a) as input for the next Attentive Transformer.

But I see a flaw in this design, the Feature Transformer (let's call it FT_i) sees masked input from the previous Attentive Transformer (AT_{i-1}), so the input feature of FT_i don't contain all the initial information. How can this help to select other useful features for the next step?

Proposed Solution

I think that attentive transformer should take as input the raw features to select the next step features, using the previous mask as prior to avoid selecting always the same feature as each step would still work.

So an easy way to try this idea would be to use the feature transformer only for predictions. The attentive transformer could be preceded by it's own feature transformer if necessary, but inputs of at attentive block would be initial data + prior of the previous masks.

This could potentially improve the attentive transformer part.

If you find this interesting, don't hesitate to share your ideas in the comment section or open a PR to propose a solution!

@Optimox Optimox added enhancement New feature or request Research Research Ideas to improve architecture labels Jun 4, 2020
@mustaphabenhajm
Copy link

@Optimox Hello, could you please clear more the idea ? do you mean the input of attentive transformer will be initial data + previous mask which will replace the priors of the previous step ? Thanks

@Optimox
Copy link
Collaborator Author

Optimox commented Jul 19, 2021

Hello @MustaphaBM

I'll try to rephrase what I meant at that time.

The attentive transformer from step 1 is taking a vector of size n_a as input, which has been computed by the initial feature transformer (number 0). Until here I'm totally fine with the idea of masking certain features from this.

The attentive transformer 2 however gets as input the n_a output of feature transformer 1, but this feature transformer 1 has never seen the full data because it was masked by the attentive transformer 1. And here I think there might be something wrong, how can you chose which feature to use if you have only seen part of them?

Obviously this would be a real problem if the mask did not change at instance level, here the mask can adapt to each instance. However I feel that it would be interesting to try to create the mask from the original data and not from the previous attentive transformer.

This would somehow lower the 'sequential' attention of TabNet but I think that keeping the previous mask as a prior for the update of the next mask could mitigate this.

Actually I think this would be quite easy to implement and try, but I'm not sure on which dataset I should to the benchmark to see whether there is a real improvement.

Hope this is clearer, let me know otherwise. Let me know if you perform some experiments I would be interested to know about the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Research Research Ideas to improve architecture
Projects
None yet
Development

No branches or pull requests

2 participants