Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line break within a word leads to 'non-searchability' #1131

Open
nordleuchte opened this issue Oct 23, 2019 · 3 comments
Open

Line break within a word leads to 'non-searchability' #1131

nordleuchte opened this issue Oct 23, 2019 · 3 comments

Comments

@nordleuchte
Copy link

Hi everybody,

redirected here from Asciidoctor people, because this seems to be an issue with Prawn.

I would like to point out a problem that I noticed during the rendering into a PDF document.

In my asciidoc document I import some csv tables which contain very long property names like the following:

webcontroller.outboundservice.cmis.repositoryId

The problem is that these properties are too long to be displayed within a cell of the table column - the line is therefore wrapped at the end.
In the rendered PDF the table cell looks like this:

webcontroller.outboundservice.cm
is.repositoryId
If I now search for this property in the created pdf document, it cannot be found, because the line break "cuts" the text. So the PDF document interprets each line as a separate text.

My first thought was that it was the PDF format itself. But if I use the same table in Microsoft Word and save it as "accessible" pdf, I can find the property as "whole word".

Are there any workarounds to fix this problem in Asciidoctor/Prawn?

Thanks in advance

@mojavelinux
Copy link
Contributor

According the the PDF specification (14.9.4 Replacement Text), the recommended way to deal with this situation is to use the ActualText tag. This is also used to make shy hyphens invisible when searching.

@McFly83 it would probably help if you provided a simple PDF document that uses these feature so the source can be studied.

@nordleuchte
Copy link
Author

I'm not familiar with the PDF specification or any technical details about that. However, I created a PDF using MS Word that contains a text that is automatically broken into multiple lines but is still searchable. Find it attached. Hope this helps.
MultiLineText.pdf

@mojavelinux
Copy link
Contributor

is still searchable

This seems highly dependent on the PDF viewer. The feature you're describing does not work in evince or the default PDF viewer on Windows. It does work in Adobe Reader, but in a very quirky way. Instead of highlighting the search term, it highlights the whole span of text in which the word is found. That tells me that it's searching the accessible text, then highlighting the rendered text with which the accessible text is associated. It seems to have something to do with /StructElem, but I can't find where the text is located in that PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants