Decoupling language model computes from within Surprise #187

TushaarGVS · 2022-12-09T19:02:34Z

Description

This change attempts to decouple the language model computes out of the Surprise transformer, so as to enable more flexibility in choosing several other language models, including KenLM, and in enhancing speed of these computations. Furthermore, this change introduces a LanguageModel class, and also implements two LanguageModel-inherited classes, including modifying the existing cross-entropy function to the ConvoKitLanguageModel class and the Kenlm class that uses the KenLM model. Furthermore, the computation of surprise has been thread-parallelized, and the language model evaluation functions themselves have also been thread-parallelized for runtime efficiency (although, this efficiency is quite minimal since we use threads and Python employs GIL to disallow for extensive thread-parallelization; any speed-up observed can be attributed to file I/O operations and NumPy operations; this is to be explored further).

The demos/surprise_demo.ipynb and demos/tennis_demo.ipynb demo files have been updated to reflect the changes made. To ensure most possible backward compatibility, only the smooth argument has been removed from the Surprise class signature; the fit and transform methods still run with the same arguments (use kwargs to input arguments pertaining to added functionality). Use LanguageModel.config to view the config of individual LanguageModel classes, and LanguageModel.type to view the type of the language model.

Other changes include code modularization, utils addition, and other code formatting. TQDM imports have been modified to enable notebook rendering when using notebooks, and additional TQDM (disappearing) bars have been included to track the progress of surprise computations.

Motivation and context

This change was made to enable flexibility in language modeling. Currently, we use cross-entropy as a measure of how surprising the conversations are; however, it needn't be the case that cross-entropy has to be used. One may choose to employ perplexity, or KenLM log-probability scores. In this regard, decoupling surprise (language model) computes are of importance. This changes attempts to achieve the same.

How has this been tested?

The code has been tested by comparing this implementation to an earlier implementation, and with the same preprocessing, the same results were observed on the tennis dataset (working on the utterance-level). Furthermore, the cross-entropy refactoring has been tested using the old ConvoKit surprise demo, and (with some randomness in choosing the context and target) the outputs seem to be about the same, i.e., almost the same ordering in most and least surprising involvements. In the tennis demo, with the use of cross-entropy, the values remained the same.

Usage

Create a new LanguageModel class object as follows:

convokit_lm = convokit.ConvoKitLanguageModel(smooth=True, n_jobs=1)  # default language model
kenlm = convokit.Kenlm(kenlm_path='/Users/tushaar/kenlm', 
                       model_filename='kenlm_surprise',
                       trained_model_filepath='/Users/tushaar/Desktop/kenlm_models/surprise.arpa', 
                       is_persistent=True)

Use the created LanguageModel when transforming as follows:

surprise.transform(corpus, language_model=kenlm, eval_type='cross_entropy')

Pending

Documentation and final testing (estimated: December 15, 2022).

This commit attempts to decouple the perplexity computes out of the transformer, so as to enable more flexibility in choosing several other language models, including KenLM. Furthermore, this change also implements two perplexity modules, including modifying the existing cross-entropy function to the CrossEntropy class and the KenlmPerplexity class that uses the KenLM model. Furthermore, the computation of surprise has been thread-parallelized, and the perplexity functions themselves have also been thread-parallelized for runtime efficiency. The surprise_demo has been updated to reflect the changes made. To ensure most possible backward compatibility, only the 'smooth' argument has been removed from the Surprise class signature; the fit and transform methods still run with the same arguments (use kwargs to input arguments pertaining to added functionality). Use Perplexity.config to view the config of individual perplexity classes, and Perplexity.type to view the perplexity type. Other changes include code modularization, utils addition, and other code formatting. TQDM imports have been modified to enable notebook rendering when using notebooks, and additional TQDM (disappearing) bars have been included to track the progress of surprise computations.

This change adds typing hints to all the functions in the Surprise transformer. Further, it also includes a minor correction of having varied surprise_scores dictionary per object type.

The change has been tested by comparing to an earlier KenLM implementation, and with the same preprocessing, the same results were observed on the tennis dataset (works on utterance-level). Further, the cross-entropy refactoring has been tested using the old ConvoKit surprise demo, and (with some randomness in choosing the context and target) the outputs seem to be about the same, i.e., almost the same ordering in most and least surprising involvements. In the tennis demo, the values remained the same.

The change make minor changes to the types of specific functions that were added after the first typing commit.

jpwchang · 2022-12-14T19:09:28Z

Thanks for the contribution! There are a few changes needed to get everything working with our github infrastructure:

The KenLM code introduces a new dependency on (presumably) the kenlm python package. This dependency needs to be added to setup.py so that it gets automatically installed alongside convokit; right now the automated tests cannot run since this dependency is missing.
Please automatically format the new code with the Black formatter to maintain consistency with the style guidelines. Instructions can be found in the contribution guidelines

In addition to the above, I have a few further minor suggestions related to style and consistency:

In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"
I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above
This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise
I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

Thanks again for the contribution, hopefully these asks aren't too much. Let me know if anything is unclear!

cristiandnm · 2022-12-14T21:40:26Z

Thanks Jonathan, very good points. One nore: I would like to not force KenLM as a requirement fit convokit. I think if LM modules like forecaster modules; so the same way we don’t force tensorflow for convokit, we should not force KenLM. This should be documented properly (as I believe we do that for the CRAFT forecaster module).

…

On Wed, Dec 14, 2022 at 16:09 Jonathan Chang ***@***.***> wrote: Thanks for the contribution! There are a few changes needed to get everything working with our github infrastructure: - The KenLM code introduces a new dependency on (presumably) the kenlm python package. This dependency needs to be added to setup.py so that it gets automatically installed alongside convokit; right now the automated tests cannot run since this dependency is missing. - Please automatically format the new code with the Black formatter to maintain consistency with the style guidelines. Instructions can be found in the contribution guidelines <https://github.com/CornellNLP/ConvoKit/blob/master/CONTRIBUTING.md#contributing-code> In addition to the above, I have a few further minor suggestions related to style and consistency: - In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score" - I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above - This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise - I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that Thanks again for the contribution, hopefully these asks aren't too much. Let me know if anything is unclear! — Reply to this email directly, view it on GitHub <#187 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCMFE5RG5J2PNFRWE2BVSLWNILPHANCNFSM6AAAAAASZU7PRY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

TushaarGVS · 2022-12-16T00:54:18Z

Thanks for the extensive review, Jonathan, I will address the comments and post my questions before I push the final version of my changes for a PR.

TushaarGVS · 2022-12-16T02:44:17Z

In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"

Jonathan, the "main" function is called evaluate(..., eval_type='') that takes eval_type as one of the arguments, where the user can specify the name of their evaluation function, and then that function gets called using eval_fn = getattr(self, eval_type) in evaluate(...). The function cross_entropy is a dummy function that is shown in LanguageModel class, and it really needn't be implemented by the user, if not used (hence only a RuntimeError, which throws an error only upon usage).

Also, KenLM returns log-likelihood score, which is essentially cross-entropy that we compute; hence, both our model and KenLM compute cross-entropy.

This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise

This is a great idea, I will merge utils into the main utils file.

I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above

It was initially called CrossEntropy, but based on my answer above, it seems too restrictive to call it CrossEntropyLanguageModel. I think we should come up with a better name.

I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

I have used _ to indicate semi-private variables; variables that aren't supposed to be accessible outside the class and __ for true private variables; those that can't be accessed (even if you wanted to) outside the class (at least not without _class_name__variable_name). I want to leave that functionality as is, because I think it is a good programming practice, but let me know if this is a big issue concerning inconsistency, I will change it accordingly.

This change formats the code in accordance with the existing black formatter, and it deletes the utility files created for the surprise transformer and includes them in the main utils file.

TushaarGVS · 2022-12-16T06:14:12Z

Thanks Jonathan, very good points. One note: I would like to not force KenLM as a requirement fit convokit. I think if LM modules like forecaster modules; so the same way we don’t force tensorflow for convokit, we should not force KenLM. This should be documented properly (as I believe we do that for the CRAFT forecaster module).

I have addressed this, Cristian. The code only installs kenlm when installed with pip install convokit[kenlm].

This commit includes extensive documentation of all the modules concerning the Surprise transformer, LanguageModel, Kenlm, and ConvoKitLanguageModel classes, and the newly added util functions.

TushaarGVS · 2022-12-18T07:46:54Z

I have finished documenting all the changes made, all the classes and functions now have extensive documentation. Also, following Jonathan’s comments, I have moved utility functions to the utils file. Following your comment, I have ensured that kenlm is not a direct dependency of convokit, instead, it can be used by using pip install convokit[kenlm].

Additionally, I have also implemented perplexity() function for both the language model classes (ConvoKitLanguageModel and Kenlm), in addition to cross_entropy(). I have also run both the demos with the latest changes to ensure that I have not introduced any breaking changes.

This change includes unit tests to test the functionality of the Surprise transformer, LanguageModel (using nltk.lm), and the added ConvoKitLanguageModel. The newly added tests were verified to ensure that they run as intended.

TushaarGVS added 4 commits December 1, 2022 06:13

Include typing hints for all functions in the Surprise transformer.

878a7ab

This change adds typing hints to all the functions in the Surprise transformer. Further, it also includes a minor correction of having varied surprise_scores dictionary per object type.

Fix typing issues for newly added functions.

636a851

The change make minor changes to the types of specific functions that were added after the first typing commit.

Format code and unify all the utility functions.

66fde3b

This change formats the code in accordance with the existing black formatter, and it deletes the utility files created for the surprise transformer and includes them in the main utils file.

Add documentation to Surprise and related classes.

312c435

This commit includes extensive documentation of all the modules concerning the Surprise transformer, LanguageModel, Kenlm, and ConvoKitLanguageModel classes, and the newly added util functions.

TushaarGVS added 3 commits December 18, 2022 03:01

Only load Kenlm if such a module exists, else ignore it.

bd9ca58

Add .rst documentation files for Surprise transformer.

4e60aef

Include unit tests to test Surprise transformer.

a49d8e4

This change includes unit tests to test the functionality of the Surprise transformer, LanguageModel (using nltk.lm), and the added ConvoKitLanguageModel. The newly added tests were verified to ensure that they run as intended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupling language model computes from within Surprise #187

Decoupling language model computes from within Surprise #187

TushaarGVS commented Dec 9, 2022 •

edited

jpwchang commented Dec 14, 2022

cristiandnm commented Dec 14, 2022 via email

TushaarGVS commented Dec 16, 2022

TushaarGVS commented Dec 16, 2022 •

edited

TushaarGVS commented Dec 16, 2022

TushaarGVS commented Dec 18, 2022

Decoupling language model computes from within Surprise #187

Are you sure you want to change the base?

Decoupling language model computes from within Surprise #187

Conversation

TushaarGVS commented Dec 9, 2022 • edited

Description

Motivation and context

How has this been tested?

Usage

Pending

jpwchang commented Dec 14, 2022

cristiandnm commented Dec 14, 2022 via email

TushaarGVS commented Dec 16, 2022

TushaarGVS commented Dec 16, 2022 • edited

TushaarGVS commented Dec 16, 2022

TushaarGVS commented Dec 18, 2022

TushaarGVS commented Dec 9, 2022 •

edited

TushaarGVS commented Dec 16, 2022 •

edited