Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

Open
bitsgalore opened this issue Feb 1, 2023 · 0 comments

Comments

@bitsgalore
Copy link

When I use Textract on EPUBs from the Dutch DBNL site, textract.process results in an empty bytes object, even though other extraction tools (including Ebooklib, which is used by Textract) are able to extract text from these files without problems.

Take as an example the file below:

https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub

Here's some minimal code for extraction:

#! /usr/bin/env python3

import textract

fileIn = "berk011veel01_01.epub"
content = textract.process(fileIn, encoding='utf-8').decode()

print(content)
print(len(content))

Result when running the script:


0

I.e. the content is an empty (zero-length) string. This happened with most of the DBNL books I tried. In some cases just a few words were extracted.

Since Textract uses Ebooklib for EPUB reading, I tried using Ebooklib directly in order to rule out an Ebooklib problem. Below a minimal test script:

#! /usr/bin/env python3

from html.parser import HTMLParser
import ebooklib
from ebooklib import epub

class HTMLFilter(HTMLParser):
    # Source: https://stackoverflow.com/a/55825140/1209004
    text = ""
    def handle_data(self, data):
        self.text += data


fileIn = "berk011veel01_01.epub"

book = epub.read_epub(fileIn)

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        content = item.get_body_content().decode()
        f = HTMLFilter()
        f.feed(content)
        print(f.text)

Running this scripts extracts all text without any problems. Text extraction with Tika-python also works as expected. The EPUB files also passes validation with EPUBCheck 4.2.6 without any errors or warnings.

On a side note, Textract did work for me with some EPUBs I downloaded from Standard Ebooks, such as this one:

https://standardebooks.org/ebooks/robert-louis-stevenson/the-strange-case-of-dr-jekyll-and-mr-hyde/downloads/robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub

Desktop:

  • OS: Linux Mint 20.1, MATE edition
  • Textract version: 1.6.5
  • Python version: 3.8.10
  • Virtual environment: yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant