Questions about BasicPatternRecognizer #56

josh-marsh · 2021-07-22T05:44:08Z

Hi,

Firstly, just want to say what a wonderful resource this is! I have several questions about the BasicPatternRecognizer:

References - Of the references you have provided on how attention values can explain a model decision in simple terms [1, 2, 3, 4, 5 ], none seem to mention the using attention gradients. If possible could you either provided any references that informed your thinking on this or provide some intuition for why your method works. For example, in BasicPatternRecognizer you use x = tf.reduce_sum(x, axis=[0, 1], keepdims=True) to combine the attention_scores * gradients for all heads and layers. I think I understand why this works, but I've never seen it done before.
Patterns - I am unclear on what some of the code used to construct the patterns is doing. In particular, I don't understand the line w = x[0, text_mask] , specifically, what 0 is doing; why do we care about the first row and why do we use it to calculate the importance of a given pattern?
Combine patterns into one - I would like to be able to create a single visualization of what tokens the model treats as important of a given aspect. I have some ideas, like scaling the weights by importace and combining, but I really need to first understand the motivation behind the importance metric to do this. Do you have any thoughts of resources I could look at to achieve this? (I would also like to be able to extract the tokens / words which are most important of deciding the sentiment of a given aspect)

Thank you so much!

Josh

The text was updated successfully, but these errors were encountered:

rolczynski · 2021-08-01T21:13:44Z

Please read my blog post here maybe it'll give you more intuition how it works - this is an open-ended question 🤭 we take w = x[0, text_mask] the first row vector because it is related to the [CLS] token that holds the "general" information about the whole sentence (at least we believe that it's true). What I've done it's nothing new (or maybe a little bit). This is so called the "gradient-based attribution". I really recommend to read Yonatan Belinkov works, especially his excellent survey here

josh-marsh · 2021-08-08T04:31:46Z

Thank you this is really useful!

One remaining question I have is why do you use w = x[0, text_mask] and not w = x[text_mask, 0], as the first column also is related to the [CLS] token for each sentence? They have different values, so whether the row or column related to [CLS] is used affects w and hence the visualizations.

Note if anyone is intrested, I found the best way to combine patterns was pattern_vectors.max(axis=0) / pattern_vectors.max().

rolczynski added the research label Aug 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about BasicPatternRecognizer #56

Questions about BasicPatternRecognizer #56

josh-marsh commented Jul 22, 2021

rolczynski commented Aug 1, 2021

josh-marsh commented Aug 8, 2021

Questions about BasicPatternRecognizer #56

Questions about BasicPatternRecognizer #56

Comments

josh-marsh commented Jul 22, 2021

rolczynski commented Aug 1, 2021

josh-marsh commented Aug 8, 2021