Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about BasicPatternRecognizer #56

Open
josh-marsh opened this issue Jul 22, 2021 · 2 comments
Open

Questions about BasicPatternRecognizer #56

josh-marsh opened this issue Jul 22, 2021 · 2 comments
Labels

Comments

@josh-marsh
Copy link

Hi,

Firstly, just want to say what a wonderful resource this is! I have several questions about the BasicPatternRecognizer:

  1. References - Of the references you have provided on how attention values can explain a model decision in simple terms [1, 2, 3, 4, 5 ], none seem to mention the using attention gradients. If possible could you either provided any references that informed your thinking on this or provide some intuition for why your method works. For example, in BasicPatternRecognizer you use x = tf.reduce_sum(x, axis=[0, 1], keepdims=True) to combine the attention_scores * gradients for all heads and layers. I think I understand why this works, but I've never seen it done before.
  2. Patterns - I am unclear on what some of the code used to construct the patterns is doing. In particular, I don't understand the line w = x[0, text_mask] , specifically, what 0 is doing; why do we care about the first row and why do we use it to calculate the importance of a given pattern?
  3. Combine patterns into one - I would like to be able to create a single visualization of what tokens the model treats as important of a given aspect. I have some ideas, like scaling the weights by importace and combining, but I really need to first understand the motivation behind the importance metric to do this. Do you have any thoughts of resources I could look at to achieve this? (I would also like to be able to extract the tokens / words which are most important of deciding the sentiment of a given aspect)

Thank you so much!

Josh

@rolczynski
Copy link
Contributor

Please read my blog post here maybe it'll give you more intuition how it works - this is an open-ended question 🤭 we take w = x[0, text_mask] the first row vector because it is related to the [CLS] token that holds the "general" information about the whole sentence (at least we believe that it's true). What I've done it's nothing new (or maybe a little bit). This is so called the "gradient-based attribution". I really recommend to read Yonatan Belinkov works, especially his excellent survey here

@josh-marsh
Copy link
Author

Thank you this is really useful!

One remaining question I have is why do you use w = x[0, text_mask] and not w = x[text_mask, 0], as the first column also is related to the [CLS] token for each sentence? They have different values, so whether the row or column related to [CLS] is used affects w and hence the visualizations.

Note if anyone is intrested, I found the best way to combine patterns was pattern_vectors.max(axis=0) / pattern_vectors.max().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants