Skip to content

matheusmota/ddex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DDEx Project

Extracting data independently of file formats

DDEx - Document Data Extractor - is a framework that allows applications to transparently open and extract the content of documents, regardless of formats.

We are working to provide support for:

  • OLE2 file formats [.doc, .xls, .ppt]
  • OOXML file formats [.docx, .xlsx, .pptx]
  • ODF file formats [.odt, .ods, .odp]
  • CSV
  • PDF
  • Google Docs (minimal support)

Goal, Challenges, Differentials

DDEx is based on the Builder Design Pattern, and can be easily extended to support other formats. DDEx aims at decoupling the process of content extraction from the content processing, handling the diversity of file formats and providing access to the document's content independently of file formats.

DDEx manages the intersection between multiple APIs (such as Apache POI and ODFDOM) by offering a common interface, allowing applications to use document's content in other contexts, encapsulating and performing the extraction independently of formats.

Alt text

Who is using DDEx?

DDEx was born on the academia and ended up being used by other Ph.D. and MSc students during their research. DDEx is also being used by other projects and is associated with academic productions, such as:

About

API for document data extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages