Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL-2021-TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance #362

Open
BrambleXu opened this issue Feb 13, 2023 · 0 comments
Assignees
Labels
C Code Implementation D New Dataset Finance(D) Financial Domain QA(T) Question Answering/Machine Comprehension Task

Comments

@BrambleXu
Copy link
Owner

BrambleXu commented Feb 13, 2023

Summary:

从年报中抽取表格和文字,构建一个QA数据集。提出了一个新的QA模型,可以在表格和文字之间进行推理。

Resource:

Paper information:

  • Author:
  • Dataset: TAT-QA
  • keywords:

Notes:

image

The left box of Figure 1 shows a real example from some financial report, where
there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs
associated paragraphs to the table.

所谓的hybrid context,关注点在于表格和表格下面的描述语句。需要通过描述对表格里数字进行推理。

数据制作方面,在Annual reports上收集了过去两年500份报告,使用 (Li et al., 2019) 的table detection模型,然后使用Apache PDFBox来抽取表格内容。对于表格,只抽取330行,36列。最后,一共得到了2万个表格,这些表格都没有标准的格式。这些表格也可能包含一些错误,比如行很少或列很少,数字缺失。在标注阶段,会人工挑出这些表格,删除,或修正。

标注阶段

  • 首先把对应的段落添加到表格上。首先段落要多余2段,然后检查是否相关,即段落是否是描述,分成,补充表格的内容。奖金有3分之2的表格被弃用了
  • QA pair制作。标注者需要主要制作一些不需要高深金融知识的问题。根据hybrid context,至少制作6个问题,包含extracted, calculated问题。对于extracted问题,回答可以是表格或段落里的single span or multiple spans。对于calculated问题,回答需要进行一定的numerical reasoning,比如加减乘除,比较,排序等。必要的话需要标注 right scale for the numerical answer
  • Answer Type and Derivation Annotation。回答的结果有3种,a single span
    or multiple spans extracted from the table or text, as well as a generated answer (usually obtained through numerical reasoning). 标注者需要标注哪种类型。对于generated answer,还需要添加一些变形,方便扩展QA模型。
  • Answer Source Annotation。对每个回答,标注者需要特定答案在原文的位置。

2.3 Quality Control
TODO

Model Graph:

Result:

Thoughts:

Next Reading:

@BrambleXu BrambleXu self-assigned this Feb 13, 2023
@BrambleXu BrambleXu added C Code Implementation D New Dataset QA(T) Question Answering/Machine Comprehension Task Finance(D) Financial Domain labels Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C Code Implementation D New Dataset Finance(D) Financial Domain QA(T) Question Answering/Machine Comprehension Task
Projects
None yet
Development

No branches or pull requests

1 participant