Skip to content

tan92hl/Dataset-for-QA-over-Multilingual-KG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs

License: GPL v3

Knowledge Graph-based Multilingual Question Answering (KG-MLQA), as one of the essential subtasks in Knowledge Graph-based Question Answering (KGQA), emphasizes that questions on the KGQA task can be expressed in different languages to solve the lexical gap between questions and knowledge graph(s). However, the existing KG-MLQA works mainly focus on the semantic parsing of multilingual questions but ignore the questions that require integrating information from cross-lingual knowledge graphs (CLKG). This paper extends KG-MLQA to Cross-lingual KG-based multilingual Question Answering (CLKGQA) and constructs the first CLKGQA dataset over multilingual DBpedia named MLPQ, which contains 300K questions in English, Chinese, and French. We further propose a novel KG sampling algorithm based on subgraph structural features and obtain KGs for MLPQ, making the evaluated methods compatible with our datasets. To evaluate the dataset, we put forward a general question answering framework whose core idea is to transform CLKGQA into KG-MLQA. We first use the Cross-lingual Entity Alignment (CLEA) model to merge CLKG into a single KG and get the answer to the question by the Multi-hop QA model combined with the Multilingual pre-training model. Then we establish two baselines for MLPQ, one of which uses Google translation to obtain alignment entities, and the other adopts the recent CLEA model. Experiments show that the simple combination of the existing QA and CLEA methods fails to obtain the ideal performances on CLKGQA. Moreover, the availability of our benchmark contributes to the community of question answering and entity alignment.

Table of contents

  1. Datasets
    1. Overview
    2. Dataset creation
    3. Statistics
    4. Use of the datasets
  2. Baselines
  3. Versions and future work
    1. Version 1.3 update
    2. Version 1.2 update
    3. Version 1.1 update
    4. Current version
    5. Future work
  4. License

Datasets

Overview

There are a total of 300K questions in MLPQ, which covers three language pairs (English-Chinese, English/French, and Chinese/French), and requires a 2-hop or 3-hop cross-lingual path inference to answer each question.

Dataset creation

We establish MLPQ through a semi-automatic process shown in the following picture: Dataset Creation

Statistics

The statistics of the generated questions, each subset contains English, Chinese, and French versions, with a total scale of 314,479question:

KG pair

Language

2-hop

3-hop

Relation pairs in questions

Average length

2-hop

3-hop

2-hop

3-hop

en-zh

English

14,656

29,815

1,250

2,628

12.4

15.5

Chinese

14,852

29,643

1,251

2,637

17.2

21.7

French

15,169

30,360

1,251

2,626

11.3

16.1

en-fr

English

15,289

18,154

1,138

3,575

12.3

15.5

Chinese

15,831

18,035

1,141

3,578

17.8

21.8

French

15,867

17,993

1,144

3,580

11.7

14.7

zh-fr

English

8,373

17,800

759

1,674

11.6

16.0

Chinese

8,414

17,877

758

1,677

17.5

21.4

French

8,495

17,856

758

1,668

12.1

14.9

Sum

-

116,946

197,533

3,157

9,484

12.2/17.5/11.6

15.6/21.6/15.4

(English/Chinese/French)

Use of the datasets

  • The datasets are available in two formats. One is in RDF format, the other is in a custom format similar to the datasets used in IRN.
  • All the datasets are in the datasets directory. For explanation of file naming convensions and our custom format, please refer to this directory for further information.

Baselines

  • We established 3 baseline models of MLPQ.
  • The latest baseline combines NMN and UHop on our latest dataset that have integrated bilingual KGs. It is the one that achieves highest scores on our datasets.
  • The other 2 older models use MTransE and are tested on the 1.0 version of our datasets:
  • Baseline codes are in the baselines directory. To try these baselines, please refer to this directory for further information.

Versions and future work

Version 1.3 update

By using KGT(https://github.com/bisheng/KTG4KBQG) model, we have generated more paraphrases for the questions in MLPQ. We used these paraphrases to randomly replace 50% of the original questions, which further enhanced the diversity of MLPQ. In version 1.3, we provided the divided set of train/dev/test.

Version 1.2 update

Recreated the datasets to address the diversity problem and the redundancy problem in the datasets. As a result, we now have fewer questions. Also added a new baseline framework combining NMN and UHop with m-BERT.

Version 1.1 update

In this slightly improved version, we corrected many grammatical errors and added the RDF version of all the datasets.

Current version

  • Currently the MLPQ version is 1.3. We expect to further the work and provide datasets of higher quality and more variety in the future.
  • Because the generation of MLPQ is semi-automatic and relies on manually crafted templates and machine translation to some degree, there might be some minor problems in the text. We try to improve the quality of MLPQ by post-editing and there should be very few problems now. However, if you find any errors in the dataset, please contact us, thanks.

Future work

For now, MLPQ mainly contains 2-hop and 3-hop path questions. In the future, we plan to adopt retelling generation based on web resources to create a greater abundance of question expressions. The path question is merely one subset of complex questions; we also plan to update and augment factoriented complex questions with property information and to explore aggregate-typed complex questions.

License

This project is licensed under the GPL3 License - see the LICENSE file for details

About

MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages