Skip to content

Navigation Menu

Explore
For
- Enterprise
- Teams
- Startups
- Education
By Solution
Resources
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

ispras / dedoc Public

Notifications
Fork 12
Star 84

Code
Issues 1
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Wiki
Security
Insights

Releases: ispras/dedoc

Releases · ispras/dedoc

v2.2.2

21 May 13:00

NastyBoget

Compare

Choose a tag to compare

v2.2.2 Latest

Latest

Added images extraction to ArticleReader.
Added attachments and references to them in the HTML output representation (return_format="html").
Fixed functionality of parameter need_content_analysis.
Fixed CSVReader (exclude BOM character from the output).
Added handling files with wrong extension or without extension to DedocManager (detect file type by its content).
Update README.md.

Assets 2

All reactions

0 Join discussion

v2.2.1

03 May 13:23

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

v2.2.1

Added fintoc structure type for parsing financial prospects according to the FinTOC 2022 Shared task (FintocStructureExtractor).
Fixed small bugs in ArticleReader: colspan for tables, keywords, sections numbering, etc.
Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
Removed other_fields from LineMetadata and DocumentMetadata.
Update README.md.

Assets 2

All reactions

0 Join discussion

v2.2

17 Apr 10:02

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

v2.2

PdfTabbyReader improved: bugs fixes, speed increase of partial PDF extraction (with parameter pages).
Added benchmarks for evaluation of PDF readers performance.
Added ReferenceAnnotation class.
Fixed bug in can_read method for all readers.
Added article structure type for parsing scientific articles using GROBID (ArticleReader, ArticleStructureExtractor).

Assets 2

All reactions

0 Join discussion

v2.1.1

22 Mar 08:17

oksidgy

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

v2.1.1

Update README.md.
Update table and time benchmarks.
Re-label line-classifier datasets (law, diploma, paragraphs datasets).
Update tasker creators (for the labeling system).
Fix HTML table parsing.

Assets 2

All reactions

v2.1

05 Mar 10:44

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

v2.1

Custom loggers deleted (the common logger is used for all dedoc classes).
Do not change the document image if it has a correct orientation (orientation correction function changed).
Use only PdfTabbyReader during detection of a textual layer in PDF files.
Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
Added BoldAnnotation for words in PdfImageReader.
More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
Some fixes are made in a web-form of Dedoc.
Tutorial how to add a new structure type to Dedoc added.
Parsing of EML and HTML files fixed.

Assets 2

All reactions

v2.0

25 Dec 13:39

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

v2.0

Fix table extraction from PDF using empty config (see issue)
Add more benchmarks for Tesseract
Fix extension extraction for file names with several dots
Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to the Package reference of documentation for more details
Add AttachAnnotation and TableAnnotation to PPTX (see discussion)
Fix bugs in DOCX handling (see issues 378, 379

Assets 2

All reactions

0 Join discussion

v1.1.1

24 Nov 13:06

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

v1.1.1

Use older pydantic version for improving compatibility with other libraries.
Add support for RTF format.
Fix bug in handling files' names with dots and spaces.
Fix bug in non-integer values of text formatting in DocxReader.
Add support of on_gpu parameter in config.
Add attached images extraction for PdfTabbyReader.
Fix partial file reading for PdfTabbyReader.
Add tutorial how to create dedoc's basic data structures.
Fix attachments_dir parameter for readers and attachments extractors.

Assets 2

All reactions

v1.1.0

24 Oct 10:01

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

v1.1.0

Add BBoxAnnotation to table cells for PdfTabbyReader.
Fix swagger, add api schema classes, remove to_dict method from ParsedDocument.
Improve parsing PDF by PdfTxtlayerReader, add benchmarks.
Fix BBoxAnnotation extraction for tables in PdfImageReader using table_type=split_last_column parameter.
Change base method of metadata extractors, rename it to extract_metadata.
Unify BBoxAnnotation extraction for all PDF readers - return only words bboxes.
Increase timeout value for all converters.

Assets 2

All reactions

v1.0

10 Oct 15:19

NastyBoget

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

v1.0

Remove is_one_column_document_list parameter.
Add tutorial about support for a new document type to the documentation.
Improve textual layer correctness classifier.
Improve orientation and columns classifier.
Change table's output structure - added CellWithMeta instead of a textual string.
Add BBoxAnnotation to table cells for PdfTxtlayerReader and PdfImageReader.
Add ConfidenceAnnotation to table cells for PdfImageReader.
Remove insert_table parameter.
Added information about table and page rotation to the table and document metadata respectively.
Use dedoc-utils library for document images preprocessing.
Change web interface, fix online-examples of document processing.
Add comparison operator to LineWithMeta.

Assets 2

All reactions

v0.11.2

06 Sep 15:25

dronperminov

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Learn about vigilant mode.

Compare

Choose a tag to compare

v0.11.2

Remove plexus-utils-1.1.jar.
Update installation documentation.
Add documentation for Tesseract OCR installation.
Add documentation for annotations.
Add documentation for secure torch.
Fix examples.

Assets 2

All reactions

Previous 1 2 Next

Previous Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.