Improving accuracy and speeding up Document Image Classification through parallel systems

Paper: Improving accuracy and speeding up Document Image Classification through parallel systems

Datasets

SmallTobacco files can be downloaded here. In Data folder we provide the scripts for getting ocr .txt files (ocr_tobacco.py) and for creating .hdf5 files (ST_hdf5_dataset_creation.py) with images and ocr data.

BigTobacco files can be downloaded here. ./Data/BT_hdf5_dataset_creation.py creates train, test and validation .hdf5 files based on the aforementioned link partition.

Repository structure

Image model

├── image_model
	├── eff_big_training.py # EfficientNet training in BigTobacco
	├── eff_small_training.py # EfficientNet training in SmallTobacco
	├── eff_utils.py # EfficientNet helper with common functions for Small and Big training
	├── H5Dataset.py # Dataset class reading hdf5 file
	├── tensorflow
		├── distr_effnet_shear.py # EfficientNet

Text model

├── text_model
	├── main.py # BERT training in SmallTobacco
	├── bert_utils.py # BERT helpers
	├── training_modules
		├── data_utils.py # data cleaning and H5Dataset class
		├── finetuned_models.py # BERT model definition
		├── model_utils.py # train and test procedures

Ensemble

├── text_model
	├── ensemble.py # ensemble image and text predictions
	├── bert_utils.py # BERT helpers
	├── ensemble_modules
		├── data_utils2.py # data cleaning and H5Dataset_ensemble class
		├── model_utils_ensemble.py # BERT and EfficientNet predictions and ensemble

efficientnet_pytorch library downloads the models in .cache/torch/checkpoints.

pytorch_transformers library does it in .cache/torch/pytorch_transformers. Make sure you previously download and store in those paths the models if your machine has no internet access.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Data		Data
image_model		image_model
text_model		text_model
.Rhistory		.Rhistory
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

image_model

image_model

text_model

text_model

.Rhistory

.Rhistory

.gitignore

.gitignore

README.md

README.md

_config.yml

_config.yml

Repository files navigation

Improving accuracy and speeding up Document Image Classification through parallel systems

Datasets

Repository structure

Image model

Text model

Ensemble

About

Releases

Packages

Languages

javiferran/document-classification

Folders and files

Latest commit

History

Repository files navigation

Improving accuracy and speeding up Document Image Classification through parallel systems

Datasets

Repository structure

Image model

Text model

Ensemble

About

Resources

Stars

Watchers

Forks

Languages