GTWiki

GTWiki is a non-parallel dataset for Text-To-Graph (parsing) & Graph-To-Text (generation) tasks. It is used in the framework implemented in our paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text".

Non-parallel data

GTWiki can be used for unsupervised learning. The text and graphs are collected from the same entities (176,000) regarding Wikipedia and Wikidata.

English text: 240,024 instances (one sentence or more per each) of 459.67 characters of average length.
Graphs: 271,095 instances (1 to 6 triples per each).

Data available at data/monolingual.txt and data/graphs.txt respectively.

Collection

Alternatively, you can run our collection script and customize it for your needs:

python3 collect.py [WIKIDATA_ID] [WIKIPEDIA_NAME] [MAX_DEPTH]

For example:

python3 collect.py Q762 "Leonardo da Vinci" 1

This execution will collect both, text and graphs, from Leonardo da Vinci and his children in the graph.

Please, for more information about the collection algorithm see our paper.

Requirements

Previous steps requires Python >= 3.6. One can install all requiremets executing:

pip3 install -r requirements.txt

Citation

If you find our work, data or the code useful, please consider to cite our paper.

@misc{domingo2022multitask,
      title={A multi-task semi-supervised framework for Text2Graph & Graph2Text}, 
      author={Oriol Domingo and Marta R. Costa-jussà and Carlos Escolano},
      year={2022},
      eprint={2202.06041},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
static		static
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect.py		collect.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

static

static

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

collect.py

collect.py

requirements.txt

requirements.txt

Repository files navigation

GTWiki

Non-parallel data

Collection

Requirements

Citation

About

Releases

Packages

Languages

License

uridr/GTWiki

Folders and files

Latest commit

History

Repository files navigation

GTWiki

Non-parallel data

Collection

Requirements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages