TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Topological Information Enhanced (TIE) model leverages the informative topological structures of the web pages to tackle the web base Structure Reading Comprehension (SRC) task, and achieves the SOTA results on WebSRC dataset at the time of writing. This repository is the full implementation of our TIE model. For more details, please refer to our paper:

Requirements

The required python packages is listed in "requirements.txt". You can install them by

pip install -r requirements.txt

or

conda install --file requirements.txt

Data preparing

First, please following the data pre-processing guidelines in the WebSRC office repository. Then, in order to form the NPR graph efficiently afterwards, we calculate and store the NPR relations between valid tags of each web page in a dictionary format. To achieve this, run

python src/data_preprocess.py --root_dir ./data --task rect_mask

The resulting dictionary for each web page will be placed in the same directory as the corresponding html file while the name of the resulting file has an additional suffix .relation.json

Training

After completing the data preparing steps, TIE can be trained by running the train.sh file in the folder script/{backbone-PLM-for-CE}. As you can see, the backbone model used for the Content Encoder of TIE is specified in the directory of the bash files. For example, to train TIE with MarkupLM as its Content Encoder, run

bash ./script/MarkupLM/train.sh

Moreover, to reproduce the experiments in ablation study, you can use the argument --mask to specify the GAT masks used in TIE and the argument --direction to specify the relations used in NPR graph.

Evaluation

Similarly, the bash file for evaluation can be found in the same directory as the bash file for training. Specifically, the corresponding eval_stage_1.sh file evaluates the quality of TIE's tag predictions all the saved checkpoints on the development set, while eval_stage_2.sh file evaluates the final answer span predictions on the development set where an additional token-level QA model with its model type and a checkpoint of TIE need to be specified. For example, to evaluate the tag prediction quality of all the checkpoints saving by the previous example command, run

bash ./script/MarkupLM/eval_stage_1.sh

Then, for answer refining stage, suppose that we use MarkupLM which is stored in folder ./token_QA as the additional token-level QA model and the checkpoint we want to evaluate is located at ./result/MarkupLM/checkpoint-27000. Note that the previous example command will store the n best answer tag prediction in a corresponding json file, in this case, the json file will be ./result/MarkupLM/nbest_predictions_27000.json. Therefore, to evaluate the final performance, run

bash ./script/MarkupLM/eval_stage_2.sh markuplm ./token_QA ./result/MarkupLM/nbest_prediction_27000.json

Reference

If you find TIE useful or inspiring in your research, please cite the corresponding paper. The bibtex are listed below:

@article{zhao-etal-2022-tie,
  author    = {Zihan Zhao and
               Lu Chen and
               Ruisheng Cao and
               Hongshen Xu and
               Xingyu Chen and
               Kai Yu},
  title     = {{TIE:} Topological Information Enhanced Structural Reading Comprehension
               on Web Pages},
  journal   = {CoRR},
  volume    = {abs/2205.06435},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.06435},
  doi       = {10.48550/arXiv.2205.06435},
  eprinttype = {arXiv},
  eprint    = {2205.06435}
}

License

This project is licensed under the license found in the LICENSE file. Portions of the source code are based on the official code of WebSRC and MarkupLM

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
markuplmft		markuplmft
script		script
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

markuplmft

markuplmft

script

script

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Requirements

Data preparing

Training

Evaluation

Reference

License

About

Releases

Packages

Languages

License

X-LANCE/TIE

Folders and files

Latest commit

History

Repository files navigation

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Requirements

Data preparing

Training

Evaluation

Reference

License

About

Resources

License

Stars

Watchers

Forks

Languages