Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add inference and evaluation script with dataset transformations #733

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

mattmazzola
Copy link
Contributor

@mattmazzola mattmazzola commented Jun 9, 2023

⚠️ This PR is not intended to merge directly, but to share work from our fork which may be useful to Metaseq ⚠️

Issue

  1. Wanted a way to evaluate models using the same methods and commands we used for training them
  2. Wanted a structured way to use different metrics or normalize differences depending on the dataset being tested
  3. Wanted to use the same evaluation metrics used by Azure Babel for comparison
  4. Wanted to be able to generate few-shot prompts for inference

Solutions

  1. Add script for model inference and evaluation

    1. Add evaluation support for HuggingFace, ParlAI, COCO, Grindstone implementations of HELM metrics
    2. Update Docker file to install evaluation dependencies
      1. Unlikely to be correct modification
    3. Move _flatten_config to metaseq.utils as it used by multiple code
  2. Add mappings between dataset and pipeline configuration of eval libraries, metrics, and transformation functions

    1. Metrics Example: Summarization datasets (Reddit and CNN-DM) should use evaluation libraries with ROUGE-L and BERTScore-F, but Classification datasets (HellaSwag and PIQA) should use evaluation libraries with Accuracy metric
    2. Transformation Example: HellaSwag model outputs (4) something something -> 4
  3. Added necessary evaluation libraries and re-implemented some metrics

  4. Add PromptGeneratror to create few-shot prompts based on configuration using Jinja templates

This PR is quite large so it may be hard to make sense of.
Originally was only going to be inference.py and few other modifications, but then I kept brining in missing dependencies to avoid gaps and it grew a lot 🤔

Testing

Did not test 😔

Related to: #726

Much of this work was done by @sahajgg, @tupini07, and @anselmwang 🙏

Dockerfile Outdated Show resolved Hide resolved
Comment on lines +44 to +45
tokenizer_vocab_file_path="/mnt/input_data_dir/pretrained_models/OPT/dependencies/gpt2-vocab.json",
tokenizer_merges_file_path="/mnt/input_data_dir/pretrained_models/OPT/dependencies/gpt2-merges.txt",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Metaseq has a standardized path for the vocab and merges files then we'll need to replace them here :) If not we might need to remove the default value.

metaseq/utils.py Outdated Show resolved Hide resolved
Copy link

@tupini07 tupini07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments :)

metaseq/cli/inference.py Show resolved Hide resolved
metaseq/cli/inference.py Outdated Show resolved Hide resolved
metaseq/cli/inference.py Show resolved Hide resolved
metaseq/cli/inference.py Outdated Show resolved Hide resolved
Dockerfile Outdated
Comment on lines 45 to 54
RUN pip install \
aim==3.16.2 \
py-rouge==1.1 \
rouge_score==0.1.2 \
parlai==1.7.1 \
evaluate==0.4.0

ENV NLTK_DATA="/usr/share/nltk_data"
RUN python -c "import nltk; nltk.download('punkt', download_dir='${NLTK_DATA}')"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely isn't the correct place to make this change.

It is only snippet from our whole Dockerfile which adds the evaluation libraries

from metaseq.data.datasets.types import CommonDatasetConfiguration, DatasetConfiguration, DatasetConfigurationTeacherGenerated, DatasetModelConfig, DatasetModelHooks, DatasetTeacherGeneratedDataHooks, IdentityDict

# Visual diagram of where hooks/functions are called during inference or data generation
# https://excalidraw.com/#json=zoAk_TdynBHQnP9vZufGm,ekcVg_HqiF79cAp58_HKRQ
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This visualization may be important for understanding

Dockerfile Outdated Show resolved Hide resolved
metaseq/generation_metrics/__init__.py Outdated Show resolved Hide resolved
metaseq/generation_metrics/grindstone_metrics.py Outdated Show resolved Hide resolved
metaseq/utils.py Outdated Show resolved Hide resolved
metaseq/cli/inference.py Show resolved Hide resolved
metaseq/cli/inference.py Outdated Show resolved Hide resolved
metaseq/cli/inference.py Show resolved Hide resolved
metaseq/cli/inference.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants