Kaira - convert matrikels to datasets

This is a Python software which lets user to convert matrikel old finnish matrikel books to a csv- and json-format. Supported bookseries at the moment are "Suomen Rintamamiehet 1939-43", "Suomen Pienviljelijät", "Siirtokarjalaisten tie" and "Suuret maatilat". The book series were originally published in 1970s and they contain brief descriptions of the peope, their life, children, spouses etc. This data is scientifically interesting but difficult to analyze statistically in a written format.

Check Pikakäyttöohje and developer documentation from Wiki.

What does this tool do?

Kaira is meant to be used as a tool to extract interesting data from old matrikels books which have been scanned and OCR'd. Extracted data can then be edited and exported into csv- or json-formats for statistical analysis. The tool was originally developed in Lammi Biological Station in collaboration with John Loehr.

#How does it work?

First you need a digital scan of the book. Preferably as good quality as possible.
Run an OCR for the scanned documents to get the raw text in a simple .txt or .html format. Picking up a good OCR-software and settings is a bit trial and error. We first used Adobe's product but eventually found ABBYY Finereader. ABBYY could produce really good quality text and save it to handy html-files.
We run "chunker" for the raw text-file which tries to isolate every one person entry to a separate XML-tag for easy processing. Implementation depends on source material, but with soldiers this is done with a regex which looks for patterns common in beginning of the one soldier's entry. It works most of the time but might make mistakes which has to be fixed in the fixer-tool (more about that below). For other book series contents are picked from html-document.
Kaira then reads the XML-file and runs multiple tailored regexes and other domain-specific logic and generates a csv-file containing the data. At this point user can use GUI to find missing information, edit the xml-file to fix the extraction errors and rerun the process etc.

#GUI Kaira includes a simple GUI for user to read, export and edit the OCR files and related content. Check detailed usage instructions from wiki.

#Development Check project Wiki to see documentation about how to extend the software with new bookseries and more detailed information about how to set up dev-environment, what you need to know etc.

#Future On my part the development will likely stop in beginning of June 2015. Some critical bug fixes might be done afterwards.

#Attribution Please cite if you use this software or datasets generated by it in your research:

T. Salmi, J. Loehr. Kaira [computer software]. Lammi Biological Station 2015 Available at https://github.com/Tumetsu/Kaira

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
analysis_toolkit		analysis_toolkit
assets		assets
books		books
deploy_scripts		deploy_scripts
experimental_scripts/pyparsing		experimental_scripts/pyparsing
guide		guide
interface		interface
names		names
qtgui		qtgui
shared		shared
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
app_information.json		app_information.json
app_information.py		app_information.py
appveyor.yml		appveyor.yml
cx_setup.py		cx_setup.py
icon.ico		icon.ico
installscript.iss		installscript.iss
license.txt		license.txt
pytest.ini		pytest.ini
requirements.txt		requirements.txt
route_gui.py		route_gui.py
setup.py		setup.py
start.py		start.py

License

Learning-from-our-past/Kaira-old

Folders and files

Latest commit

History

Repository files navigation

Kaira - convert matrikels to datasets

What does this tool do?

About

Resources

License

Stars

Watchers

Forks

Languages