Skip to content

dell-research-harvard/HJDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HJDataset

A Large Dataset of Historical Japanese Documents with Complex Layouts

HJDataset is a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements for advanced analysis.

Download the dataset

All the annotations are available through this link. However, due to some copyright issues, we could not directly release the images in this dataset. Please fill out this form to send us a request for downloading, and we will send back the links.

Organization of the files

After downloading, we suggest organize the annotation and images in this fashion:

data/
├── train/
├── test/
├── val/
└── annotations/
    ├── instances_train.json 
    └── .... 

Environment configuration

You can also use the provided conda environment file to configure your own environment.

conda install -f environment.yml

However, when installing Detectron2, you may encounter some problems. Please check their official install instructions and Common Installation Issues for better reference.

Starter code

We provide some starter code in notebooks/.

  • 1-Dataloader and visualization.ipynb illustrates how to use the dataloder class to load and visualize layout elements in HJDataset.
  • 2-Training Using Detectron2.ipynb shows how to train segmentation models on the dataset using Detectron2.

Cite our work

If you find the dataset is helpful for your research, please cite our work:

@article{shen2020large,
  title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
  author={Shen, Zejiang and Zhang, Kaixuan and Dell, Melissa},
  journal={arXiv preprint arXiv:2004.08686},
  year={2020}
}