Skip to content

vistec-AI/pdf2parallel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pdf2parallel

Getting Started

  1. Extract sentences from PDFs with Apache Tika (Thai sentences with pythainlp and English sentences with nltk)
python extract_sentences.py --en_dir en_data/ --th_dir th_data/
  1. Align sentences using universal sentence encoder
python align_sentences_use.py --en_dir en_data/ --th_dir th_data/ --output_path assorted_government.csv

Authors

  • @attapol - Extraction and normalization of Thai texts from PDF
  • @pinedbean - Universal sentence encoder inference code
  • @cstorm125 - Sentence alignment with universal sentence encoder

Acknowledgement

  • @pnphannisa - Sourcing government document in PDF files

About

Extract en-th parallel sentences from PDFs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages