Source Data of ACL2021 paper "Syntax-Enhanced Pre-trained Model".
In this paper, we present SEPREM that leverage syntax information to enhance pre-trained models. To inject syntactic information, we introduce a syntax-aware attention layer and a newly designed pre-training task are proposed. Experimental results show that our method achieves state-of-the-art performance over six datasets. Further analysis shows that the proposed dependency distance prediction task performs better than dependency head prediction task.
For more details about our paper, we refer the interested readers to here.
We randomly collected 1B sentences from publicly released common crawl news datasets (CCNews) that contain English news articles crawled between December 2016 and March 2019. Then, we adopted off-the-shelf Stanza to automatically generate the syntax information for each sentence.
It took a month and a half to get the results when running on 64 V100-32G. The average token length of each sentence is 25.34, and the average depth of syntax trees is 5.15.
Now, we make the constructed 1B sentence public with the correponding syntax information to the community.
You can download the data from my OneDrive (Upload from 2021/05/11 and end on 2021/05/18).
Please note that the total size of all files should above 800GB, but we can only provide 722GB.
Since I am using my student certificate, the data on onedrive will expire in 2023.
Due to the large amount of data, we split the results of raw syntax information into 11 sections instead of storing in a single file.
Each section generally contains 10 folders with each folder contains about 10000 json files.
Unfortunatelly, the first section was deleted by mistake, so only the 2nd~11th sections can be provided. The 9/6 and 9/8 are also missing.
If you find that some json files are broken, this is due to unstable network transmission, please leave an issue and I will re-upload it as soon as possible.
We proviede the statistics of the results as follows:
Section Number | Number of Folder | Is provided | ToTal Size (GB) | Total Number of Sentence / Json File |
---|---|---|---|---|
1 | ❌ | |||
2 | 10 | 😀 | 78.7 | 96988985 / 9699 |
3 | 10 | 😀 | 76.1 | 94198706 / 9420 |
4 | 10 | 😀 | 72.7 | 90297083 / 9030 |
5 | 10 | 😀 | 73.1 | 91042200 / 9105 |
6 | 9 | 😀 | 68.3 | 86357503 / 8636 |
7 | 10 | 😀 | 73.5 | 91920280 / 9193 |
8 | 9 | 😀 | 71.3 | 89769348 / 8977 |
9 | 7 | 😀 | 53.5 | 66958763 / 6696 |
10 | 9 | 😀 | 69.8 | 86494425 / 8650 |
11 | 11 | 😀 | 85.3 | 109427451 / 10943 |
Sum | 722.3 | 903454744 / 90349 |
The storage unit of raw syntactic information is the json file mentioned above.
Each json file contains about 10000 raw syntax information, including lemma, xpos, upos, head and deprel.
We take one item from the file 2/1/1_1_10000.json as an example
lemma | ['you', 'should', 'stick', 'with', 'you', 'kid', '.'] |
xpos | ['PRP', 'MD', 'VB', 'IN', 'PRP$', 'NN', '.'] |
upos | ['PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'NOUN', 'PUNCT'] |
head | [3, 3, 0, 6, 6, 3, 3] |
deprel | ['nsubj', 'aux', 'root', 'case', 'nmod:poss', 'obl', 'punct'] |
How to make better use of xpos, upos and deprel information is still a challenge.
We evaluated our proposed SEPREM model on entity typing, question answering and relation classification tasks under the different corresponding benchmarks, e.g., Open Entity, FIGER, SearchQA, Quasar-T, CosmosQA, and TACRED, respectively. Thanks to RuiZe's help, we used the fine-tuning pipelines provided by K-adaper. Those piplelines are available from here.