Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cockatiel #149

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

fredericboisnard
Copy link
Contributor

First PR about the integration of Cockatiel in Xplique.

Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
This patch separates the methods dedicated to image
processing/plotting into different sub classes, so that
Cockatiel and NLP related classes will not have to inheritate
these.

Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Currently by default the fit() method of Craft uses
a hardcoded sampler. This patch allows Craft to use
a sampler given in parameter.

Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
Signed-off-by: Frederic Boisnard <frederic.boisnard@irt-saintexupery.com>
@fredericboisnard
Copy link
Contributor Author

The conflicts must come from the 1st patch "test moving the EPSILON & similar to free some RAM" -> I will remove it from the PR since it was mostly to try to improve RAM management on my side (not sure if we want to take it ; and if yes, then it would be better in another PR I suppose)

Copy link
Collaborator

@AntoninPoche AntoninPoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only read part of the PR for now. We will have to discuss the architecture of the code. It seem that cockatiel forces us to introduce nlp attributions.

@@ -16,4 +16,5 @@
from .object_detector import BoundingBoxesExplainer
from .global_sensitivity_analysis import SobolAttributionMethod, HsicAttributionMethod
from .gradient_statistics import SmoothGrad, VarGrad, SquareGrad
from .nlp_occlusion import NlpOcclusion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss where to place this, but in my view, it should not be in the same folder as the other attributions methods. My suggestions :

  • xplique.attributions.nlp.occlusion
  • xplique.nlp.attributions.occlusion

@@ -40,10 +40,6 @@ class GradCAMPP(GradCAM):
If a string is provided it will look for the layer name.
"""

# Avoid zero division during procedure. (the value is not important, as if the denominator is
# zero, then the nominator will also be zero).
EPSILON = tf.constant(1e-4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it necessary to move this?


def explain(self,
sentence: str,
words: List[str],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would allow an easy way not to specify words so that all the words are occluded, such as a None.

I think we should talk about this one also because it requires sentence-splitting, and there are many possibilities. Do we include punctuation? Do we split possessive parts? Do we allow the user to decide?

In my view, we should find a simple library that does this and refer to their API for possible parameters. However, this leads to the question, do we include such library in xplique initial requirements or do we make an additionnal nlp requirements?

perturbated_words = NlpOcclusion._apply_masks(words, masks)

perturbated_sentences = [sentence]
perturbated_sentences.extend(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the perturbed sentences lack words from the initial sentence. Here you just concatenated the perturbed words. However, if I give the following inputs:

sentence = "My cat is in the garden."
words = ["cat", "garden"]
separator = " "

The perturbed_sentences value will be: `["My cat is in the garden.", "garden", "cat"]

I do not think this is what we expect.

def explain(self,
sentence: str,
words: List[str],
separator: str) -> np.ndarray:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you do when the user provides sentences with punctuation? This may completely change the meaning of a sentence. I mean if you just concatenate words with space between them, the final sentence may not be readable.

return WordExtractor()
if extract_fct == "excerpt":
return ExcerptExtractor()
raise ValueError("Extraction function can be only 'clause', \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not include "flaire_clause" and "spacy_clause" in the error message, but you left "clause" which is not a possibility.

Factory for extractor classes.
"""
@staticmethod
def get_extractor(extract_fct="sentence", tagger=None, pipeline=None, clause_type=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that language should be an argument of this function.

@@ -7,7 +7,7 @@
import tensorflow as tf

from ..types import Callable, Optional
from ..utils_functions.object_detection import _box_iou, _format_objects, _EPSILON
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the epsilon should not be loaded simultaneously as loading operators. But is it a good thing to initialize it at each run of the tf function?

Parameters
----------
tokenizer
The tokenizer function to be used for tokenization. It must be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description needs to be clarified as no model is defined here. Does this correspond to what you define in the nlp_commons?

If it corresponds to the tokenizer-model pair, as in HuggingFace, then I recommend having a class that treats them both at the same time. It is easier to understand and the API would be more natural.

from ..types import Union, Tuple, Callable, List, Dict


class NlpPreprocessor(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is called NLP... Should it not be in the nlp commons?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wrote NlpProcessor is it not more natural to call it NLPProcessor as NLP is an acronym?

Copy link
Collaborator

@AntoninPoche AntoninPoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Fred, this is a huge PR! A complex one at that, which raises several points where we should all agree before merging. I listed several general remarks:

  • No description of the PR in general, it can be great to clarify your choices and what you are trying to do.
  • No notebook or tutorials, I cannot review the plot function or the API without examples. Furthermore, it will be necessary for the final version.
  • I cannot launch the tests (from github), I think it is because of the conflicts.
  • You did not base your code on the last version of master (1.3.3). I think that the problems you have with memory were solved with last pull requests. It should solve the conflicts at the same time.
  • You did not make the documentation, it might be better to discuss the choice made before making it. It will save you some time.
  • You remade a lot of indentations (automatically, I think), but it created many weird code parts. The line limit is 100 in Xplique.
  • We will have to discuss two choices:
    • Where do we put the nlp parts? In my view a xplique.npl module is pertinent
    • The structure of the code (multi-inheritance)
  • The code you added introduced a lot of dependencies to other libraries (flair, nltk, transformers (for tests)); we should discuss how to treat this. Maybe have an additional nlp_requirements.txt file to install when calling xplique.nlp? Furthermore, we should check the licenses of these libraries, they should be MIT, otherwise we cannot use them.

self,
input_to_latent_model: Callable,
latent_to_logit_model: Callable,
preprocessor: NlpPreprocessor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seem that the user have to wrap his tokenizer using NlpPreprocessor.

Is it possible for the user to give his tokenizer and for us to wrap it ourselves? Otherwise, it adds a step for the user. I would prefer is the API is as simple as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand your concern ; it's possible, but one of the goals of the NlpPreprocessor is to gather both the tokenizer and its parameters in one place, in order to 'proxy' the calls to the tokenizer. Then we can use this feature in cockatiel but also in external parts such as nlp_batch_predict():

  • cockatiel._latent_predict() calls self.preprocessor.tokenize(), which will pass the good arguments to the tokenizer
  • in torch_operations.nlp_batch_predict(preprocessor): we use preprocessor.preprocess() which again calls preprocessor.tokenize() to pass the good arguments to the tokenizer

If we pass the tokenizer alone in cockatiel.init(), then we will have to provide a variable number of parameters in addition (tokenizer arguments, can vary fron one tokenizer to another), and duplicate these parameters in nlp_batch_predict() :/

Or perhaps there is a better / easier way to implements this ?

Maybe I should rename the NlpPreprocessor into something like TokenizerHelper to better show it's simply to encapsulate tokenizer calls ?

from ..types import Union, Tuple, Callable, List, Dict


class NlpPreprocessor(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wrote NlpProcessor is it not more natural to call it NLPProcessor as NLP is an acronym?

super().__init__(input_to_latent_model, latent_to_logit_model,
number_of_concepts, batch_size, device=device)
self.preprocessor = preprocessor
self.patch_extractor = ExcerptExtractor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are forcing the patch_extractor here, should it not be an argument of the init function?

self.preprocessor = preprocessor
self.patch_extractor = ExcerptExtractor()

def _latent_predict(self, inputs: List[str], resize=None) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that in this function you make really similar computations as in torch_operations.nlp_batch_predict. What is the reason for not using it or extracting the common part between the two?


return self._to_np_array(activations)

def _extract_patches(self, inputs: List[str]) -> Tuple[List[str], np.ndarray]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is not clear enough because it also computes embeddings. I suggest using _crop_and_embed, _split_text_and _embed, or _from_inputs_to_latent_patches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in previous pull requests, you should rebase on the last version of master.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, previously solved.


assert np.array_equal(masks, expected_mask)

def test_apply_masks():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I detected a problem with the Occlusion explain function. If you change the _apply_mask method so that it also takes the sentence as input and returns the perturbed_inputs, then you will be able to test it more precisely.

import torch.nn as nn

from torch.nn import MSELoss
from transformers import RobertaPreTrainedModel, RobertaModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates quite a big dependence for xplique tests. The way to manage requirements for dev should also be discussed.

nb_excerpts = 2
nb_most_important_concepts = 2

best_sentences_per_concept = cockatiel_explainer_pos.get_best_excerpts_per_concept(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you be sure of which sentence will correspond to which concept?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants