compression error -2 #54

echan00 · 2020-10-20T04:32:50Z

Running into an error compression error -2. It would be great if anyone is able to provide some pointers

Attached the PDF with the issue:
5_EN.pdf

Error message:

Processing Pages: 1/28...mupdf: compression error -2
Traceback (most recent call last):
  File "/Users/erikchan/Downloads/convert.py", line 10, in <module>
    parse(pdf_files[i], docx_files[i])
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/main.py", line 31, in parse
    cv.make_docx(indexes, multi_processing)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 118, in make_docx
    self._make_docx(page_indexes)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 192, in _make_docx
    self.initialize(page).parse().make_page(self.doc_docx)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 172, in initialize
    images, paths = self._paths_extractor.extract_paths(page)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 61, in extract_paths
    image = largest.to_image(page) if largest.contains_curve else None
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 140, in to_image
    return ImagesExtractor.clip_page(page, bbox, zoom)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 60, in clip_page
    return cls.to_raw_dict(image, bbox)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 50, in to_raw_dict
    'image': image.getPNGData()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5899, in getPNGData
    barray = self._getImageData(1)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5868, in _getImageData
    return _fitz.Pixmap__getImageData(self, format)
RuntimeError: compression error -2

The text was updated successfully, but these errors were encountered:

dothinking · 2020-10-21T02:36:39Z

Thanks for providing this case.

Lots of vector graphics, i.e. path like a line, curve and their combination, exist in your pdf. However, currently clipping path is ignored by this library due to technical issue when extracting these paths from pdf. Some paths are out of page without being clipped, which results in this compression error -2 issue.

Besides, two more issues to convert this pdf:

The path color is incorrect. I guess the root cause is that currently only Device Color Space (Gray/RGB/CMYK) are considered, while this pdf sample may follow special color space like Indexed CS, DeviceN CS.
overlapped images are removed. python-docx is applied to write the converted docx, but python-docx doesn't support floating elements now. So, floating images are removed as a compromise.

So, unfortunately, pdf2docx is not able to convert your pdf for now. At least the following efforts should be made:

clip path when extract paths from pdf
implement more color space
introduce floating images

echan00 · 2020-10-21T16:59:25Z

Thanks @dothinking for the clear explanation. I'm surprised this library isn't more popular than it is. The current version is already very good and I know a lot of people can benefit from it.

Please let me know how I can help to resolve any of the issues you listed (I will need some guidance.) Whether resolving the bugs, testing, or otherwise.

dothinking · 2020-10-24T12:55:09Z

Thanks a lot @echan00.

Some progress on this issue:

floating image is supported.
clip path and color space -> good news that another upstream library PyMuPDF published new feature on extracting path. I'll look into it and hopefully can resolve this issue.

After that, any test or suggestions are appreciated.

Comment on 2020-12-31: the latest PyMuPDF 1.18.5 solved this issue partly, but not perfectly, especially clipping path.

dothinking · 2020-10-24T13:49:28Z

Since inline image is supported in python-docx, the steps to explore floating image:

create two docx files, one with an inline image and another a floating image (for this case, the behind text mode)
check the difference of source xml between these two files
implement floating image based on the observed structure and code for inline image

xml structure results:

inline image is a <wp:inline> node under <w:drawing>
floating image is a <wp:anchor> node under <w:drawing>
besides all sub-nodes of inline image, floating image contains also <wp:positionH> and <wp:positionV> to define the fixed position

So, the idea is to create <wp:anchor> node, then append sub-nodes:

all nodes same with inline image
<wp:positionH> and <wp:positionV>

dothinking · 2020-10-24T14:46:06Z

Seems that floating picture with python-docx is a common request, document here for sharing.

# -*- coding: utf-8 -*-

'''
Implement floating image based on python-docx.

- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.

Create a docx sample (Layout | Positions | More Layout Options) and explore the 
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''

from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne

# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
    """
    ``<w:anchor>`` element, container for a floating image.
    """
    extent = OneAndOnlyOne('wp:extent')
    docPr = OneAndOnlyOne('wp:docPr')
    graphic = OneAndOnlyOne('a:graphic')

    @classmethod
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
        """
        Return a new ``<wp:anchor>`` element populated with the values passed
        as parameters.
        """
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
        anchor.extent.cx = cx
        anchor.extent.cy = cy
        anchor.docPr.id = shape_id
        anchor.docPr.name = 'Picture %d' % shape_id
        anchor.graphic.graphicData.uri = (
            'http://schemas.openxmlformats.org/drawingml/2006/picture'
        )
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
        """
        Return a new `wp:anchor` element containing the `pic:pic` element
        specified by the argument values.
        """
        pic_id = 0  # Word doesn't seem to use this, but does not omit it
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def _anchor_xml(cls, pos_x, pos_y):
        return (
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
            '           %s>\n'
            '  <wp:simplePos x="0" y="0"/>\n'
            '  <wp:positionH relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionH>\n'
            '  <wp:positionV relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionV>\n'                    
            '  <wp:extent cx="914400" cy="914400"/>\n'
            '  <wp:wrapNone/>\n'
            '  <wp:docPr id="666" name="unnamed"/>\n'
            '  <wp:cNvGraphicFramePr>\n'
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'
            '  </wp:cNvGraphicFramePr>\n'
            '  <a:graphic>\n'
            '    <a:graphicData uri="URI not set"/>\n'
            '  </a:graphic>\n'
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
        )


# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
    """Return a newly-created `w:anchor` element.

    The element contains the image specified by *image_descriptor* and is scaled
    based on the values of *width* and *height*.
    """
    rId, image = part.get_or_add_image(image_descriptor)
    cx, cy = image.scaled_dimensions(width, height)
    shape_id, filename = part.next_id, image.filename    
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)


# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
    """
    run = p.add_run()
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
    run._r.add_drawing(anchor)

# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)


if __name__ == '__main__':

    from docx import Document
    from docx.shared import Inches, Pt

    document = Document()

    # add a floating image
    p = document.add_paragraph()
    add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))

    # add text
    p.add_run('Hello World'*50)


    document.save('output.docx')

echan00 · 2020-10-26T22:19:41Z

Nice @dothinking, it looks like you know what the issues are exactly. I have a variety of PDFs I can help test once you're ready

tonysepia · 2020-11-24T14:06:23Z

@dothinking thank you so much for your code sample! Solves my problem perfectly!!!!

dothinking · 2020-12-31T18:43:35Z

Didn't get time to this project for so long a time. New version v0.5.0 is now available to partly solve this issue:

floating image is now supported.
path extraction is supported by upstream library PyMuPDF, but not so good for complicated shapes, e.g. clipping path.

With this latest version, the sample pdf can be converted successfully, but still need lots of work to improve the quality of converted docx file, due to the complicated/gorgeous style.

echan00 · 2021-01-01T01:13:55Z

Wow this is a great upgrade. Thanks very much for your hard work @dothinking

dothinking · 2022-02-21T04:09:47Z

Close for now since this issue itself was resolved.

Still need lots of efforts to improve the conversion quality for complicated layouts like this test file.

dothinking self-assigned this Oct 21, 2020

dothinking added bug Something isn't working enhancement New feature or request labels Oct 21, 2020

This was referenced Oct 24, 2020

feature: floating image python-openxml/python-docx#159

Open

How to insert images behind the text or in front of the text python-openxml/python-docx#400

Open

dothinking added a commit that referenced this issue Oct 25, 2020

new feature: floating image #54

691a0d4

dothinking added a commit that referenced this issue Oct 26, 2020

new feature: extract path with PyMuPDF api #54

23a9d69

dothinking closed this as completed Feb 21, 2022

SystemError7 mentioned this issue Oct 26, 2023

Align floating elements python-openxml/python-docx#1279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression error -2 #54

compression error -2 #54

echan00 commented Oct 20, 2020

dothinking commented Oct 21, 2020 •

edited

echan00 commented Oct 21, 2020 •

edited

dothinking commented Oct 24, 2020 •

edited

dothinking commented Oct 24, 2020

dothinking commented Oct 24, 2020 •

edited

echan00 commented Oct 26, 2020

tonysepia commented Nov 24, 2020

dothinking commented Dec 31, 2020

echan00 commented Jan 1, 2021

dothinking commented Feb 21, 2022

compression error -2 #54

compression error -2 #54

Comments

echan00 commented Oct 20, 2020

dothinking commented Oct 21, 2020 • edited

echan00 commented Oct 21, 2020 • edited

dothinking commented Oct 24, 2020 • edited

dothinking commented Oct 24, 2020

dothinking commented Oct 24, 2020 • edited

echan00 commented Oct 26, 2020

tonysepia commented Nov 24, 2020

dothinking commented Dec 31, 2020

echan00 commented Jan 1, 2021

dothinking commented Feb 21, 2022

dothinking commented Oct 21, 2020 •

edited

echan00 commented Oct 21, 2020 •

edited

dothinking commented Oct 24, 2020 •

edited

dothinking commented Oct 24, 2020 •

edited