Skip to content

uridr/GTWiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GTWiki

GTWiki is a non-parallel dataset for Text-To-Graph (parsing) & Graph-To-Text (generation) tasks. It is used in the framework implemented in our paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text".

Frame 27

Non-parallel data

GTWiki can be used for unsupervised learning. The text and graphs are collected from the same entities (176,000) regarding Wikipedia and Wikidata.

  • English text: 240,024 instances (one sentence or more per each) of 459.67 characters of average length.
  • Graphs: 271,095 instances (1 to 6 triples per each).

Data available at data/monolingual.txt and data/graphs.txt respectively.

Collection

Alternatively, you can run our collection script and customize it for your needs:

python3 collect.py [WIKIDATA_ID] [WIKIPEDIA_NAME] [MAX_DEPTH]

For example:

python3 collect.py Q762 "Leonardo da Vinci" 1

This execution will collect both, text and graphs, from Leonardo da Vinci and his children in the graph.

Please, for more information about the collection algorithm see our paper.

Requirements

Previous steps requires Python >= 3.6. One can install all requiremets executing:

pip3 install -r requirements.txt

Citation

If you find our work, data or the code useful, please consider to cite our paper.

@misc{domingo2022multitask,
      title={A multi-task semi-supervised framework for Text2Graph & Graph2Text}, 
      author={Oriol Domingo and Marta R. Costa-jussà and Carlos Escolano},
      year={2022},
      eprint={2202.06041},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Dataset for the paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages