Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
align_sentences_use.py		align_sentences_use.py
extract_sentences.py		extract_sentences.py

Repository files navigation

pdf2parallel

Getting Started

Extract sentences from PDFs with Apache Tika (Thai sentences with pythainlp and English sentences with nltk)

python extract_sentences.py --en_dir en_data/ --th_dir th_data/

Align sentences using universal sentence encoder

python align_sentences_use.py --en_dir en_data/ --th_dir th_data/ --output_path assorted_government.csv

Authors

@attapol - Extraction and normalization of Thai texts from PDF
@pinedbean - Universal sentence encoder inference code
@cstorm125 - Sentence alignment with universal sentence encoder

Acknowledgement

@pnphannisa - Sourcing government document in PDF files

About

Extract en-th parallel sentences from PDFs

scb-mt-en-th-2020

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%