Textract.process returns empty bytes object for EPUBs from DBNL collection #455

bitsgalore · 2023-02-01T16:41:20Z

When I use Textract on EPUBs from the Dutch DBNL site, textract.process results in an empty bytes object, even though other extraction tools (including Ebooklib, which is used by Textract) are able to extract text from these files without problems.

Take as an example the file below:

https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub

Here's some minimal code for extraction:

#! /usr/bin/env python3

import textract

fileIn = "berk011veel01_01.epub"
content = textract.process(fileIn, encoding='utf-8').decode()

print(content)
print(len(content))

Result when running the script:

I.e. the content is an empty (zero-length) string. This happened with most of the DBNL books I tried. In some cases just a few words were extracted.

Since Textract uses Ebooklib for EPUB reading, I tried using Ebooklib directly in order to rule out an Ebooklib problem. Below a minimal test script:

#! /usr/bin/env python3

from html.parser import HTMLParser
import ebooklib
from ebooklib import epub

class HTMLFilter(HTMLParser):
    # Source: https://stackoverflow.com/a/55825140/1209004
    text = ""
    def handle_data(self, data):
        self.text += data


fileIn = "berk011veel01_01.epub"

book = epub.read_epub(fileIn)

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        content = item.get_body_content().decode()
        f = HTMLFilter()
        f.feed(content)
        print(f.text)

Running this scripts extracts all text without any problems. Text extraction with Tika-python also works as expected. The EPUB files also passes validation with EPUBCheck 4.2.6 without any errors or warnings.

On a side note, Textract did work for me with some EPUBs I downloaded from Standard Ebooks, such as this one:

https://standardebooks.org/ebooks/robert-louis-stevenson/the-strange-case-of-dr-jekyll-and-mr-hyde/downloads/robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub

Desktop:

OS: Linux Mint 20.1, MATE edition
Textract version: 1.6.5
Python version: 3.8.10
Virtual environment: yes

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

bitsgalore commented Feb 1, 2023

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

Textract.process returns empty bytes object for EPUBs from DBNL collection #455

Comments

bitsgalore commented Feb 1, 2023