M2D2: A Massively Multi-domain Language Modeling Dataset

Scripts and data links for M2D2: A Massively Multi-domain Language Modeling Dataset (EMNLP 2022) by Machel Reid, Victor Zhong, Suchin Gururangan, and Luke Zettlemoyer.

Data

Update: The data is currently hosted on HuggingFace here!

To load the dataset use the following steps:

pip install --upgrade datasets

import datasets

dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice

print(dataset['train'][0]['text']

We're currently exploring ways to host this large amount of data online in an accessible manner, so please stay tuned! If you would like to access sooner, feel free to reach out at {machelreid}-{at}-{google-dot-com}.

Evaluation Sets

Feel free to download the test sets for all domains at this Google Drive link.

or via gdown:

#!/bin/bash
# install and/or upgrade gdown with pip
pip install --upgrade gdown
# Download M2D2 test sets
gdown "1U5wki_V-IFQy733HC6NO5ZuM2jaOaw8y"
tar -xvzf m2d2_test_sets.tar.gz
# File structure
# m2d2_test_sets/
# ├─ DOMAIN_AA/
# │  ├─ test.txt
# ├─ DOMAIN_AB/
# │  ├─ test.txt/

Reproduction Scripts for Modeling

Find scripts for finetuning language models in lm_scripts/adapt.sh. Furthermore, we provide meta-scripts for generating scripts for multiple domains given an input file containing a list of directories containing domain specfici data (within train.txt and valid.txt should exist): lm_scripts/generate_multiple.sh. Respective instructions/parameters are included in each file.

For validation on multiple files, we also include lm_scripts/validate_on_multiple_files.py for calculating perplexity measures given a file containing a list of evaluation text files and a model checkpoint.

Helper Scripts for Wikipedia Data Collection

For Wikipedia data collection, we include scripts for data dump processing (data_scripts/wiki/get_data), ontology gathering (data_scripts/wiki/ontology), and generating splits (data_scripts/wiki/split_generation).

Helper Scripts for S2ORC Data Collection

To be uploaded with documentation

Scripts to reproduce analyses in the paper

To be uploaded with documentation

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data_scripts		data_scripts
lm_scripts		lm_scripts
README.md		README.md
m2d2_image.pdf		m2d2_image.pdf
m2d2_image.png		m2d2_image.png
m2d2_split_names.py		m2d2_split_names.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_scripts

data_scripts

lm_scripts

lm_scripts

README.md

README.md

m2d2_image.pdf

m2d2_image.pdf

m2d2_image.png

m2d2_image.png

m2d2_split_names.py

m2d2_split_names.py

Repository files navigation

M2D2: A Massively Multi-domain Language Modeling Dataset

Data

Evaluation Sets

Reproduction Scripts for Modeling

Helper Scripts for Wikipedia Data Collection

Helper Scripts for S2ORC Data Collection

Scripts to reproduce analyses in the paper

About

Releases

Packages

Languages

machelreid/m2d2

Folders and files

Latest commit

History

Repository files navigation

M2D2: A Massively Multi-domain Language Modeling Dataset

Data

Evaluation Sets

Reproduction Scripts for Modeling

Helper Scripts for Wikipedia Data Collection

Helper Scripts for S2ORC Data Collection

Scripts to reproduce analyses in the paper

About

Topics

Resources

Stars

Watchers

Forks

Languages