Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

harrylyf · 2021-02-04T01:51:11Z

您好，我这边在parse的时候遇到了这个报错：Ignore page due to error: 'TableBlock' object has no attribute 'lines'。我刚才尝试了一下，发现0.5.0是可以parse的但是效果不是特别好（不过这无伤大雅），但是我现在用的0.5.1版本会出现上面的这个报错。不知道是不是版本迭代的过程中修改了一些代码导致新出现的问题。

测试文件我已经发到您邮箱了。

谢谢！

dothinking · 2021-02-04T03:45:48Z

多谢提出问题和提供测试文件。

0.5.1版本在嵌套表格的支持上做了改进，导致原先只可能是文本块的区域，可能既有文本块又有表格块，这样之前默认仅针对文本块的处理就不适用于表格了。修改方法为增加是否文本块的判断：

找到Shape.py文件：

>>> import pdf2docx
>>> pdf2docx.shape.Shape.__file__

定位到semantic_type()方法（第89行左右），按下面注释增加一行

for block in blocks:
    if not block.is_text_block(): continue  # 增加这一行判断

    # not intersect yet
    if block.bbox.y1 < self.bbox.y0: continue

    # check it when intersected
    rect_type = self._check_semantic_type(block)
    if rect_type != RectType.UNDEFINED: break

    # no intersection any more
    if block.bbox.y0 > self.bbox.y1: break

再次感谢指出问题，最近这两个issue的修复都会累加到下一个版本。

dothinking · 2021-02-04T04:01:17Z

另外，对比了0.5.0和0.5.1的转换效果，发现并没有多大提升。不知你对pdf转word的需求是什么，提取文本、保留格式、还是便于编辑文字？有一些PDF工具（PDF-xchange、福昕等）也可以直接修改文本，相对更便捷。所以我有些不确定pdf2docx这个库的方向了。谢谢。

harrylyf · 2021-02-05T02:23:56Z

好的谢谢，我刚才试了一下，这完美的解决了我的问题。

harrylyf · 2021-02-05T02:28:05Z

我目前的需求是针对特定类型文档的批量转换，然后就是尽可能的保留格式。单纯用acrobat或其他工具的话会比较麻烦。因此就想用代码解决。我觉得您可以看一下solid framework，pdf2docx库我觉得已经是比较接近的库了。未来我觉得可以增加用户自助修改这块的功能（就比如改json），针对不同类型的文件和不同类型的需求，大家都可以根据自己的情况对一些参数、条件进行优化。

dothinking · 2021-02-05T03:47:00Z

很好的提议，谢谢。

dothinking self-assigned this Feb 4, 2021

dothinking added the bug Something isn't working label Feb 4, 2021

harrylyf closed this as completed Feb 5, 2021

dothinking added a commit that referenced this issue Feb 7, 2021

check shape semantic type with text block only #70

61ecb89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

harrylyf commented Feb 4, 2021

dothinking commented Feb 4, 2021 •

edited

dothinking commented Feb 4, 2021

harrylyf commented Feb 5, 2021

harrylyf commented Feb 5, 2021

dothinking commented Feb 5, 2021

Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

Comments

harrylyf commented Feb 4, 2021

dothinking commented Feb 4, 2021 • edited

dothinking commented Feb 4, 2021

harrylyf commented Feb 5, 2021

harrylyf commented Feb 5, 2021

dothinking commented Feb 5, 2021

dothinking commented Feb 4, 2021 •

edited