Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore page due to error: 'TableBlock' object has no attribute 'lines' #70

Closed
harrylyf opened this issue Feb 4, 2021 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@harrylyf
Copy link

harrylyf commented Feb 4, 2021

您好,我这边在parse的时候遇到了这个报错:Ignore page due to error: 'TableBlock' object has no attribute 'lines'。我刚才尝试了一下,发现0.5.0是可以parse的但是效果不是特别好(不过这无伤大雅),但是我现在用的0.5.1版本会出现上面的这个报错。不知道是不是版本迭代的过程中修改了一些代码导致新出现的问题。

测试文件我已经发到您邮箱了。

谢谢!

@dothinking dothinking self-assigned this Feb 4, 2021
@dothinking dothinking added the bug Something isn't working label Feb 4, 2021
@dothinking
Copy link
Collaborator

dothinking commented Feb 4, 2021

多谢提出问题和提供测试文件。

0.5.1版本在嵌套表格的支持上做了改进,导致原先只可能是文本块的区域,可能既有文本块又有表格块,这样之前默认仅针对文本块的处理就不适用于表格了。修改方法为增加是否文本块的判断:

  1. 找到Shape.py文件:
>>> import pdf2docx
>>> pdf2docx.shape.Shape.__file__
  1. 定位到semantic_type()方法(第89行左右),按下面注释增加一行
for block in blocks:
    if not block.is_text_block(): continue  # 增加这一行判断

    # not intersect yet
    if block.bbox.y1 < self.bbox.y0: continue

    # check it when intersected
    rect_type = self._check_semantic_type(block)
    if rect_type != RectType.UNDEFINED: break

    # no intersection any more
    if block.bbox.y0 > self.bbox.y1: break

再次感谢指出问题,最近这两个issue的修复都会累加到下一个版本。

@dothinking
Copy link
Collaborator

另外,对比了0.5.00.5.1的转换效果,发现并没有多大提升。不知你对pdf转word的需求是什么,提取文本、保留格式、还是便于编辑文字?有一些PDF工具(PDF-xchange、福昕等)也可以直接修改文本,相对更便捷。所以我有些不确定pdf2docx这个库的方向了。谢谢。

@harrylyf
Copy link
Author

harrylyf commented Feb 5, 2021

好的谢谢,我刚才试了一下,这完美的解决了我的问题。

@harrylyf
Copy link
Author

harrylyf commented Feb 5, 2021

我目前的需求是针对特定类型文档的批量转换,然后就是尽可能的保留格式。单纯用acrobat或其他工具的话会比较麻烦。因此就想用代码解决。我觉得您可以看一下solid framework,pdf2docx库我觉得已经是比较接近的库了。未来我觉得可以增加用户自助修改这块的功能(就比如改json),针对不同类型的文件和不同类型的需求,大家都可以根据自己的情况对一些参数、条件进行优化。

@harrylyf harrylyf closed this as completed Feb 5, 2021
@dothinking
Copy link
Collaborator

很好的提议,谢谢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants