Using OCR for extracting test questions from printed books

This mini-project deals with extracting test questions from printed books using OCR (image to text) and custom parser (text to semantic data, in this case test questions).

It uses Google Cloud Vision API for OCR. 👀

All code is written the in TypeScript.

👉 See also my other project memorio that uses the results created in this mini-project.

Usage

Requirements

Node.js >=18.x
Yarn 1.x
optional: globally installed nodemon for rerunning scripts on source changes

Set up

Install all dependencies with Yarn (run yarn).

Running

There are 4 scripts that implements the full pipeline from a book scan in a PDF to a machine-readable data (a collection of questions and categories, including all metadata such as numbering and correct answers).

The scripts were developed specifically for extracting the test questions from the book Modelové otázky z biologie k přijímacím zkouškám na 1. lékařskou fakultu Univerzity Karlovy v Praze, verze 2011. But they can be easily adapted to other similar use-cases too.

Note 1: The input PDFs are NOT published in this repository. However, the example output is and can be found here.

Note 2: Instead of nodemon, you can use node directly.

Note 3: If the input PDF is scanned book where each page contains an image of two real pages (an open book), it is better to manually split the images in the middle (e.g. using this online free service Split two-page layout scans to create separate PDF pages) before running the OCR using run-ocr.ts script.

run-ocr.ts {bucketName} {fileName} {outputPrefix}

Calls Google Cloud Vision API asyncBatchAnnotate (see also the official guide).

The PDF (image scan) {fileName} must be stored in a GCS bucket {bucketName}. The conversion result is a set of JSON files (one file for each 20 pages) that are stored in {outputPrefix} in the same bucket.

The script waits until the conversion finishes, and then it prints the output info.

An example:
```
nodemon -r ./register.js scripts/run-ocr.ts \
testbook-ocr \
test/Modelovky_Biologie_1LF_2011.pdf \
results/Modelovky_Biologie_1LF_2011
```
The script source code can be found in scripts/run-ocr.ts.
post-process.ts {ocrOutputDir} {pagesDir}

Takes the resulting JSON files from the first script and extracts the text. The input JSON files must in {ocrOutputDir} (on local filesystem). The output is placed in {pagesDir} (on local filesystem). The output is a set of page-XXXX.txt files that contain the text of the corresponding pages.

An example:
```
nodemon -r ./register.js -i 'data/' scripts/post-process.ts \
data/modelovky-biologie-1lf-2011/ocr-output/ \
data/modelovky-biologie-1lf-2011/pages-original/
```
The script source code can be found in scripts/post-process.ts.
parse-questions.ts {pagesDir} {questionsDir}

This script implements a use-case-specific semantic parser that turns the raw text pages into the machine-readable data (questions, categories).

It takes the output of the second script (which is in {pagesDir}) and creates a collection of JSON files (one categories.json and per-page page-XXXX.json that contains questions from the corresponding page).

When the parser encounters an unexpected token, it stops and prints the detailed information (page and line) where the error occurred. This allows of manual correction of the OCR text output files. The parsing can be rerun many times (after each correction) until there are no errors and all outputs are created.

An example:
```
nodemon -r ./register.js -i 'data/*/questions/' scripts/parse-questions.ts \
data/modelovky-biologie-1lf-2011/pages/ \
data/modelovky-biologie-1lf-2011/questions/
```
The script source code can be found in scripts/parse-questions.ts.
memorio-transform.ts {questionsDir} {memorioOutputDir}

Takes the parsed questions and categories from the third script (which are in {questionsDir}) and transforms them to the format that can be used in memorio app.

An example:
```
nodemon -r ./register.js -i 'data/*/memorio/' scripts/memorio-transform.ts \
data/modelovky-biologie-1lf-2011/questions/ \
data/modelovky-biologie-1lf-2011/memorio/
```
The script source code can be found in scripts/memorio-transform.ts.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.idea		.idea
data		data
scripts		scripts
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
babel.config.js		babel.config.js
nodemon.json		nodemon.json
package.json		package.json
register.js		register.js
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

pokusew/testbook-ocr

Folders and files

Latest commit

History

Repository files navigation

Using OCR for extracting test questions from printed books

Usage

Requirements

Set up

Running

run-ocr.ts {bucketName} {fileName} {outputPrefix}

post-process.ts {ocrOutputDir} {pagesDir}

parse-questions.ts {pagesDir} {questionsDir}

memorio-transform.ts {questionsDir} {memorioOutputDir}

Useful resources

Unicode

Google Cloud Vision

OCR in Python

About

Resources

Stars

Watchers

Forks

Languages

`run-ocr.ts {bucketName} {fileName} {outputPrefix}`

`post-process.ts {ocrOutputDir} {pagesDir}`

`parse-questions.ts {pagesDir} {questionsDir}`

`memorio-transform.ts {questionsDir} {memorioOutputDir}`