TextProc

TextProc is an automated text processing tool that efficiently and flexibly applies NLP to input documents in a relational database.

Architecture and documentation

This project is structured as a multi-module Maven project for Java 11 and later (tested up to Java 14), where each Maven module corresponds to one JPMS module too. The usage of the Java modules introduced in Java 9 allows for better coupling management, potentially increasing code quality.

This application executes a preprocessing process definition like the one at sample_process.xml, which is composed of steps that are applied in sequence to input text documents in a relational database, that by default is a SQLite database named corpus.db. The used database can be changed in TextProcPersistence/pom.xml to any that Hibernate supports, and a example of database schema compatible with this application is at TextProc.sql. On Unix-like systems, the application can be launched with the launch.sh script.

For more information about the application and the steps bundled with it, please check out the documentation website. The Javadoc for the project classes is available there, under the "Project Reports" section of the sidebar.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
AbstractTppTextProcStep		AbstractTppTextProcStep
CoreNLPEntityExtractionTextProcStep		CoreNLPEntityExtractionTextProcStep
CoreNLPKnowledgeBasePopulationTextProcStep		CoreNLPKnowledgeBasePopulationTextProcStep
CoreNLPLemmatizationTextProcStep		CoreNLPLemmatizationTextProcStep
CoreNLPTokenizationTextProcStep		CoreNLPTokenizationTextProcStep
EmptyFilteringTextProcStep		EmptyFilteringTextProcStep
LuceneIndexTextProcStep		LuceneIndexTextProcStep
MentionFilteringTextProcStep		MentionFilteringTextProcStep
TextProcLogging		TextProcLogging
TextProcMain		TextProcMain
TextProcPersistence		TextProcPersistence
TextProcStep		TextProcStep
TppLemmatizationTextProcStep		TppLemmatizationTextProcStep
TppStopwordFilteringTextProcStep		TppStopwordFilteringTextProcStep
TppTokenizationTextProcStep		TppTokenizationTextProcStep
ejml-shaded		ejml-shaded
lucene-shaded		lucene-shaded
src/site		src/site
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TextProc.sql		TextProc.sql
launch.sh		launch.sh
lombok.config		lombok.config
pom.xml		pom.xml
sample_process.xml		sample_process.xml

License

aggarcia3/TextProc

Folders and files

Latest commit

History

Repository files navigation

TextProc

Architecture and documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages