Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Parse link hrefs assuming they are percent encoded #5273

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

daveallie
Copy link

@daveallie daveallie commented Nov 15, 2022

Changes

  • Parse href in findLinks with QUrl::fromPercentEncoding
  • When parsing QUrl or converting between QUrl and QString, use QUrl::fromEncoded() and QUrl::toEncoded() instead of QUrl(string) and QUrl::toString()

Details

This allows for both internal and external links containing percent escaped characters to be parsed correctly. Previously, http://example.com/foo%26bar was parsed as http://example.com/foo%2526bar instead of http://example.com/foo%26bar (http://example.com/foo&bar). Same issue for other escaped characters.


Fixes #4406
Fixes #5257

@daveallie daveallie changed the title Parse link hrefs assume they are percent encoded Parse link hrefs assuming they are percent encoded Nov 16, 2022
@daveallie
Copy link
Author

After some more testing, we should be using QUrl::TolerantMode instead of QUrl::StrictMode or some special characters won't be encoded on parse

@daveallie daveallie force-pushed the fix/correctly-parse-percent-encoded-links branch from 6308033 to aec68f1 Compare November 18, 2022 02:42
@daveallie
Copy link
Author

daveallie commented Nov 21, 2022

It looks like this change means that percent encoded URL reserved characters are transformed to their actual character instead of being left as percent encoded

@daveallie
Copy link
Author

daveallie commented Nov 22, 2022

More digging into this has made me realise my first pass at this was pretty naive. There's a couple of things going on here:

  • The href from the HTML is being parsed here from a QString to a QUrl with implicit TolerantMode
  • if settings.resolveRelativeLinks is set (default true), then the QUrl is converted back to a QString here using QUrl::toString()
    • In the false case, it just uses the original QString value
  • That QString is then parsed back into QUrl here (again, under the implicit TolerantMode) and passed off to the painter
  • QUrl::toEncoded() is called on the QUrl in the printer here to convert the QUrl back to text one last time before it's dumped into the PDF file.

The problem is that pdfconverter.cc is passing the printer an effectively invalid QUrl, or at least one which results in an incorrect URL when toEncoded is called on it.

Here's a quick example:

  1. Original href in HTML: "http://example.com/asdf%3Fasdf"
  2. After being converted to a QUrl and back to a QString, it is: "http://example.com/asdf%2Basdf" (still correct)
  3. Just before being passed to the painter, it's converted again to a QUrl and passed over to Qt: QUrl( "http://example.com/asdf%3Fasdf" )
  4. When Qt calls toEncoded on it: "http://example.com/asdf%253Fasdf" (bad!)

The documentation from QUrl suggests favouring fromEncoded and toEncoded when converting back and forth to strings: https://github.com/wkhtmltopdf/qt/blob/cb88757e7c9383a0504689c8373c101a44214132/src/corelib/io/qurl.cpp#L118-L120

The problem is occuring because we're passing an "unparsed" QUrl to the printer which calling toEncoded, forcing a parse of the URL string and double encoding values.


The fix: By changing the calls where the QString and QUrl gets transformed to one another (using QUrl::fromEncoded() and QUrl::toEncoded() as recommended instead of QUrl(string) and QUrl::toString()) resolves the issue.

I have tested these changes with a mix of querystrings, fragments, reserved and unreserved characters in links. Additionally I have tested using the test file from #4406 and #5257, both of which passed.

I will rebase the changes onto this PR.

This allows for both internal and external links containing percent escaped characters to be parsed correctly.
Previously, http://example.com/foo%26bar was parsed as http://example.com/foo%2526bar instead of
http://example.com/foo%26bar (http://example.com/foo&bar). Same issue for other escaped characters.

According to the docs for the QUrl constructor, in it's default parsing using TolerantMode, it'll only
transform %20 back to space, and will treat all other % signs as raw symbols to be encoded.
@daveallie daveallie force-pushed the fix/correctly-parse-percent-encoded-links branch from aec68f1 to 7b6eaad Compare November 22, 2022 10:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Non-ascii links getting destroyed Internal links are not normalized
1 participant