C³

Overview

This repository maintains C³, the first free-form multiple-Choice Chinese machine reading Comprehension dataset.

Paper: https://arxiv.org/abs/1904.09679

@article{sun2019investigating,
  title={Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension},
  author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
  url={https://arxiv.org/abs/1904.09679v3}
}

Files in this repository:

license.txt: the license of C³.
data/c3-{m,d}-{train,dev,test}.json: the dataset files, where m and d represent "mixed-genre" and "dialogue", respectively. The data format is as follows.

[
  [
    [
      document 1
    ],
    [
      {
        "question": document 1 / question 1,
        "choice": [
          document 1 / question 1 / answer option 1,
          document 1 / question 1 / answer option 2,
          ...
        ],
        "answer": document 1 / question 1 / correct answer option
      },
      {
        "question": document 1 / question 2,
        "choice": [
          document 1 / question 2 / answer option 1,
          document 1 / question 2 / answer option 2,
          ...
        ],
        "answer": document 1 / question 2 / correct answer option
      },
      ...
    ],
    document 1 / id
  ],
  [
    [
      document 2
    ],
    [
      {
        "question": document 2 / question 1,
        "choice": [
          document 2 / question 1 / answer option 1,
          document 2 / question 1 / answer option 2,
          ...
        ],
        "answer": document 2 / question 1 / correct answer option
      },
      {
        "question": document 2 / question 2,
        "choice": [
          document 2 / question 2 / answer option 1,
          document 2 / question 2 / answer option 2,
          ...
        ],
        "answer": document 2 / question 2 / correct answer option
      },
      ...
    ],
    document 2 / id
  ],
  ...
]

annotation/c3-{m,d}-{dev,test}.txt: question type annotations. Each file contains 150 annotated instances. We adopt the following abbreviations:

	Abbreviation	Question Type
Matching	m	Matching
Prior knowledge	l	Linguistic
	s	Domain-specific
	c-a	Arithmetic
	c-o	Connotation
	c-e	Cause-effect
	c-i	Implication
	c-p	Part-whole
	c-d	Precondition
	c-h	Scenario
	c-n	Other
Supporting Sentences	0	Single Sentence
	1	Multiple sentences
	2	Independent

bert folder: code of Chinese BERT, BERT-wwm, and BERT-wwm-ext baselines. The code is derived from this repository. Below are detailed instructions on fine-tuning Chinese BERT on C³.
1. Download and unzip the pre-trained Chinese BERT from here, and set up the environment variable for BERT by export BERT_BASE_DIR=/PATH/TO/BERT/DIR.
2. Copy the dataset folder data to bert/.
3. In bert, execute python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin.
4. Execute python run_classifier.py --task_name c3 --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 2e-5 --num_train_epochs 8.0 --output_dir c3_finetuned --gradient_accumulation_steps 3.
5. The resulting fine-tuned model, predictions, and evaluation results are stored in bert/c3_finetuned.

Note:

Fine-tuning Chinese BERT-wwm or BERT-wwm-ext follows the same steps except for downloading their pre-trained language models.
There is randomness in model training, so you may want to run multiple times to choose the best model based on development set performance. You may also want to set different seeds (specify --seed when executing run_classifier.py).
Depending on your hardware, you may need to change gradient_accumulation_steps.
The code has been tested with Python 3.6 and PyTorch 1.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
annotation		annotation
bert		bert
data		data
.gitignore		.gitignore
README.md		README.md
license.txt		license.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation

annotation

bert

bert

data

data

.gitignore

.gitignore

README.md

README.md

license.txt

license.txt

Repository files navigation

C³

Overview

About

Releases

Packages

Contributors 2

Languages

License

nlpdata/c3

Folders and files

Latest commit

History

Repository files navigation

C3

Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

C³