Skip to content

Latest commit

 

History

History
76 lines (58 loc) · 2.69 KB

PRETRAIN_DATA.md

File metadata and controls

76 lines (58 loc) · 2.69 KB

Pre-Training Data

The pre-training data consists of 6.2 million table-text examples extracted from the English Wikipedia on December 2019. The associated text of a table is the page title and description, table caption as well as the section title and section text.

Example

This is an example in proto text format extracted from this page.

 table: {
    columns: { text: "Year" }
    columns: { text: "Film" }
    columns: { text: "Dialogue-writer(s)" }
    rows: {
      cells: { text: "2013\n(1st)" }
      cells: { text: "" }
      cells: { text: "" }
    }
    rows: {
      cells: { text: "2013\n(1st)" }
      cells: { text: "Main Hoon Shahid Afridi" }
      cells: { text: "Vasay Chaudhry" }
    }
    table_id: "http://en.wikipedia.org/wiki/ARY_Film_Award_for_Best_Dialogue_1"
  }
  questions: {
    id: "TITLE"
    original_text: "ARY Film Award for Best Dialogue"
  }
  questions: {
    id: "DESCRIPTION"
    original_text: "The ARY Film Award for Best Dialogue is the ARY Film Award for the best dialogues of the year in film. It is one of three writing awards in the Technical Awarding category."
  }
  questions: {
    id: "SEGMENT_TITLE"
    original_text: "2010s"
  }

Data

You can find the latest version of the data here. We also provide a small snapshot of the first 100 interactions.

Conversion to TF Examples

create_pretrain_examples_main.py converts the data to TF examples. It can be run locally (that will take a long time on a single machine) or as a Dataflow on Google Cloud. You can find command line snippets here.

Parsing Protobuffers in Text Format

In case you want to work with the data in ways we didn't anticipate you can simple parse them into proto objects line-by-line.

Here is a simple example:

from google.protobuf import text_format
from tapas.protos import interaction_pb2

for line in input_file:
  interaction = text_format.Parse(line, interaction_pb2.Interaction())

Licence

This data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
See also the Wikipedia Copyrights page.

How to cite this data?

You can cite the ACL 2020 paper.