Skip to content

This work was supported by the European Research Council (ERC) project: “Industrialisation and Urban Growth from the mid-nineteenth century Ottoman Empire to Contemporary Turkey in a Comparative Perspective, 1850-2000 under the European Union’s Horizon 2020 research and innovation program Grant Agreement No. 679097, acronym UrbanOccupationsOETR.…

Notifications You must be signed in to change notification settings

ysaidcan/OttomanRegisterProcessing

Repository files navigation

OttomanRegisterProcessing

First, install dhSegment toolbox. Installation instructions can be found at the following link: https://dhsegment.readthedocs.io/en/latest/start/install.html

We have Ottoman population registers obtained from 1840-1860s. These registers contain demographic information about the male population. There are population-place start symbols, individuals and households. A sample that shows different types of objects in these registers can be found below figure:

NFS_d___01454_00030 (1)

Over 500000 people were read manually and entered Microsoft Access Databases. In this project, we annotated individuals and populated place names to train CNN models in these registers. See example below:

NFS_d___01452_00002

The annotated datasets can be found :

Zistovi/images Zistovi/labels

or

İznik_images İznik_labels

folders. To the training Python script, you have to provide path to original images, path to annotated images and a classes text files. You can select the pretrained model as Unet or Resnet50 architecture. Furthermore, you can select whether to use GPU. For more information see dhSegment toolbox:

Sofia Ares Oliveira, Benoit Seguin, and Frederic Kaplan. Dhsegment: a generic deep-learning approach for document segmentation. In Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on, 7–12. IEEE, 2018.

You can find the training file from this link

After you train your model, it will be saved to the provided output path. You can use the trained model for your test images. For that you will need demo files which can be found in:

You can find the demo file from this link

It will output a csv file which includes pixelwise positions of the found objects. It will also draw boxes around detected objects. An example of a register page with detected objects is shown below:

NFS_d___02865_00003_boxes

The detected objects can be sorted in a way that Arabic language requires. Arabic scripts start from the right top of the page. Therefore, sorting must be done in this way. Right top objects comes before. Another important point is that page is divided from the middle. Therefore, sorting must take into account this fact. Sorting script can be found below:

You can find the sorting Python script from this link

In some part of the Ottoman Population registers collected in 1840-1860, the age, person number and household number is written in red. To take advantage of this fact, we applied red color filter to spot the numerals.

You can find the red color mask Python script from this link

The original image of an example register page and red filtered version is demonstrated at below figure.

NFS_d___01452_00002

A numeral spotting model can be trained by using the masked registers under the numerals folder. The model can be tested by using a demo script file. It will again marked the numerals on the document images and output a csv files for the locations of these numerals. A sample detected numerals in a register page can be found below:

You can find the numeral recognition Python script

NFS_d___02865_00003_masked_boxes

Numbers and individuals can be combined with this Python script.

The predictions of the system can be visualized with this Python script.

Now, the last step is to merge the Access database which is entered manually with automatically recognized objects. We will do this by using the key 'PersonInPage'. However, it is not a column in Access database. We need to derive from PersonNumberRegistered. Insert a new column, PersonInPage. The formula will be: =IF(AND(N2=N3,O2=O3),AH2+1,1) and Fix 1/2 --> 1 in column FileNo.

Then run mergeDataset_Access.py. Remember to convert necessary column names in number_individuals_combined file.

The results will be saved as merged_access_CV.csv. Name the first column (empty name) as ID. In order to crop the numbers automatically and match with the manually annotated ground truth, run cropimageAuto.py.

About

This work was supported by the European Research Council (ERC) project: “Industrialisation and Urban Growth from the mid-nineteenth century Ottoman Empire to Contemporary Turkey in a Comparative Perspective, 1850-2000 under the European Union’s Horizon 2020 research and innovation program Grant Agreement No. 679097, acronym UrbanOccupationsOETR.…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages