Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for passing remote url encoded? #250

Open
mattpr opened this issue Oct 7, 2020 · 7 comments
Open

Option for passing remote url encoded? #250

mattpr opened this issue Oct 7, 2020 · 7 comments

Comments

@mattpr
Copy link

mattpr commented Oct 7, 2020

Hey, I know the docs say it isn't supported (https://github.com/willnorris/imageproxy#remote-url) and advise against using query strings in upstream image source urls.

  • We use imageproxy in order to fetch/resize/cache client images on the fly.
  • We can't generally tell our clients to change the URL scheme they use for loading images and often on e-commerce sites images are retrieved via query string as opposed to fixed/static-hosting type paths.
  • We have a CDN sitting in front of an nginx origin server which hosts imageproxy. On the CDN we turn off caching of unique query strings in order to avoid caching the same static file over and over due to random query string parameters.
    • the cdn essentially ignores these query parameters on the remote url since the remote url is passed unencoded and so the CDN serves the same image for many different query strings (incorrect in this case).

The simple/clean thing would be to URL encode the upstream/source image url before appending it to our imageproxy url. This isn't supported today unfortunately. Perhaps there is a good reason?

This could be done either with a config switch ("always decode 1x") or by having imageproxy just check the url parameter and decode up to 2 times or until substring http:// or https:// is found at index 0 of the url param. The advantage of the config switch is not changing current behaviour for existing users and potentially letting users avoid the decoding performance penalty if they don't need it. The non-config/implicit behaviour is interesting because it shouldn't decode unless it doesn't find http:// or https:// at the beginning of the remote url param...in which case I guess imageproxy would error anyway.

Not sure if there are compelling reasons why passing the upstream url encoded isn't supported (since this is documented as the way it works), but hopefully you will consider the change.

Meanwhile I'm trying to just do the url decoding on the nginx side before passing to imageproxy...however this doesn't appear trivial because decoded nginx vars (like $uri$is_args$args are "normalized" which isn't just url decoding...also things like condensing two slashes to one (so https://... becomes https:/... in the $uri var, even when passed to nginx encoded in the first place...and of course imageproxy doesn't like that either (invalid request URL: malformed URL)

@willnorris
Copy link
Owner

Yeah, I'm totally supportive of adding that. I suspect the performance hit would be negligible, and we might be able to come up with some clever shortcuts to keep the common case faster (e.g. check if raw string before decoding starts with "http", in which case try to use it as-is, etc)

@mattpr
Copy link
Author

mattpr commented Oct 9, 2020

keep the common case faster

Well if the param isn't encoded, we shouldn't decode it as that may decode encoded params within the url and break things.
So the only real performance penalty is the check to see if the url param starts with http:// or https://, not sure the performance hit for regex or substring matching in go.

Pseudocode...

if (param doesn't start with http)
    // definitely not a url, don't bother trying to decode.  error.
while (param doesn't start with 'http://' or 'https://') {
    // if we have tried more than x times to decode, error.
    param = urlDecodeOnce(param);
}
return param

My opinion is that for these special-case url decoding algorithms where we expect a parameter to contain a URL, it is best to check for http:// or https:// at the beginning of the string before attempting to decode. The :// will be encoded if the parameter is url encoded and so it gives us an easy indicator of whether this is a url and whether it needs decoding.

It is easy to support urls that have been double/triple/etc encoded as well by iterating the decode (up to a limit) and checking to see if we are done before each iteration (we don't want to over-decode because we may decode encoded parameters in the url that will then break the URL).

Checking for just http (or http: or http% or https: or https%) at the beginning of the string can also give a shortcut (to error) if the param doesn't look like a URL...so we don't bother trying to decode.

My go isn't great but I can look into creating a pull request for this at some point if that is interesting for you.

@mattpr
Copy link
Author

mattpr commented Nov 17, 2020

Howdy. I started looking at what would be involved to make this change.

I was wondering if you could clarify which direction you want to go with DefaultBaseUrl as allowing relative URLs as well as allowing options to be optional seems to add a fair bit of complexity to parsing the URI.

DefaultBaseUrl is documented, so I think it needs to stay because people might be using it.
https://github.com/willnorris/imageproxy#default-base-url

However the relative URL case doesn't appear to be covered in data_test.go.

func TestNewRequest(t *testing.T) {

The comment on NewRequest says url must be absolute

// simply contain /{remote_url}. The remote URL must be an absolute "http" or

...but the code currently implements the relative url case
if baseURL != nil {

The only reason this is a problem is because of the following:

  • options are optional so valid urls can look like:
    • http://example.com/imageproxy/OPTIONS/http...
    • http://example.com/imageproxy//http...
    • http://example.com/imageproxy/http...
  • url can be relative (ie starting with a /)
    • http://example.com/imageproxy/OPTIONS//remote/path
    • http://example.com/imageproxy///remote/path
    • http://example.com/imageproxy//remote/path
  • imageproxy parameters are slash separated instead of using some other delimiter that would make the user's intent more obvious (example of what url patterns would look like with semicolon separated params)
    • http://example.com/imageproxy/OPTIONS;http...
    • http://example.com/imageproxy/;http...
    • http://example.com/imageproxy/http...
    • http://example.com/imageproxy/OPTIONS;/remote/path
    • http://example.com/imageproxy/;/remote/path
    • http://example.com/imageproxy//remote/path
      • this one still might be a problem with slash collapse but could add a special case since this is the only time we would have sequential slashes

If options weren't optional (maybe x as no-op option to avoid the sequential slash problem?) then could just parse to the next slash and then depending on whether we get another slash (relative) or http (absolute) or something else (error) we know how to handle it.

For instance the comment on NewRequest (data.go) says both of the following are valid.

// 	http://localhost//http://example.com/image.jpg
// 	http://localhost/http://example.com/image.jpg

Both of those omit the options however the first one could in some cases be intended as a relative url...ie I might have set a DefaultBaseUrl := "http://someotherproxy.com/nexturl" and want you to request an image from: http://someotherproxy.com/nexturl/http://example.com/image.jpg. Perhaps that ambiguity is resolved by always assuming the UPSTREAM url is relative if DefaultBaseUrl is set (ie you can't do both absolute and relative UPSTREAM urls with the same imageproxy daemon config).

I could definitely imagine a case where you want to run all your upstream urls through an additional imageproxy that does some other type of transform (e.g. background replacement) so you might pass absolute URLs to imageproxy but also set DefaultBaseUrl to the additional proxy base url. Not saying this needs to be supported by imageproxy (easy enough to build a 3 URL compound on the client-side)...just saying someone might try it.

This

return nil, URLError{"must provide absolute remote URL", r.URL}

is confusing as well because the user doesn't need to provide an absolute remote url...baseURL.ResolveReference(req.URL) just needs to be absolute.

Lots of ways to fix that ambiguity (use different parameter delimiters, require options, etc)...but all of the changes would likely break someone's in-use pattern. I would argue that if imageproxy wants to accept relative url paths (e.g. /path/to/image in combination with DefaultBaseUrl) that maybe slash-delimited parameters is not the best approach due to the double-slash problem (imageproxy will often be running behind nginx or something else which may collapse double slashes).

So it would be great to get your thoughts on this. On one hand don't want to break anyone that upgrades imageproxy's currently functioning url patterns. On the other hand it looks like adding in support for encoded url parameters may add even more edge cases that don't behave as people expect. It would be nice not to have to rely on trying to decode the remains of the path as a URL to detect whether we are looking at the url parameter yet or not.

First way to solve this

Just stick with the existing logic which depends on whether or not parseUrl returns an error and/or absolute url.
Enhance parseUrl (more compute complexity) which is run twice in the case when options are provided.
Accept the performance hit because maybe it doesn't matter relative to fetching images off disk

Second way to reason this out would be the following:

  • strip leading slash from path
  • if next character is slash
    • implies no options provided
    • strip the slash from path
    • two sequential slashes case - potential problem with some webservers
    • example: http://localhost/imageproxy//UPSTREAM_URL
  • else
    • options present. parse options up to next slash
    • remove options following slash from path
    • example: http://localhost/imageproxy/OPTIONS/UPSTREAM_URL
  • if remaining path (lowercase) starts with "http"
    • implies absolute url.
    • run through url decode logic until we get http:// or https:// or give up or error
  • else if remaining path starts with /
    • implies relative path
    • combine with DefaultBaseUrl or error
    • this is also a two sequential slash case - potential problem with some webservers
      • http://localhost/imageproxy///relative/path/to/file
      • http://localhost/imageproxy/OPTIONS//relative/path/to/file
  • else if remaining path starts with something else
    • could guess this is a multi-slash problem and try using as relative path (prepend slash before appending to DefaultBaseUrl

The impact of this approach is that it requires the slashes to be there: http://localhost/imageproxy/UPSTREAM_URL would no longer work and this is currently a supported case. Also any webservers in front of imageproxy that are collapsing multiple sequential slashes before passing the request to imageproxy will cause problems.

A third way

would be to check if DefaultBaseUrl is set and if so, use the current codepath and do not support encoded upstream URLs. If DefaultBaseUrl not set, implies we shouldn't get any relative URLs, only absolute. So we can use presence of http as an indicator that we found URL (since http is not a valid start to valid options).

  • If DefaultBaseUrl not set
    • strip all leading slashes from path
    • if remaining path starts with (lowercase) http
      • implies no options
    • if options
      • parse options up to next / and remove options and / from path
    • if !options || remaining path starts with lowercase http
      • do url decode logic (see previous case) and try to get image
    • else error, no url provided.
  • Else
    • use current code path
    • relative remote urls supported
    • encoded upstream urls not supported

Fourth way...break current apis

I'm not particularly happy with either of the above options as it seems like they are maybe going in the wrong direction in terms of making behaviour more predictable and robust.

Personally I would prefer less flexibility in how the user's build imageproxy requests (e.g. excluding options) in favor of more predictable behaviour.

The "try to url decode" and if it errors indicates options are present...doesn't really give me a good feeling as the common case (options) has an extra url decode running. Granted this probably takes no time relative to fetching an image from disk (let alone remote server)...still.

One way would be to no longer allow options to be optional (e.g. always have pattern imageproxy/x/UPSTREAM and then UPSTREAM either starts with / or http).

Another would be to change the delimiter to make it more obvious where parameter boundaries start/end. Doubleclick/Google uses ; as a parameter delimiter for many URLs especially in cases where urls are passed in urls and not guaranteed to be encoded (problems with dropped query string parameters, etc). Yahoo used , for years.

Any of these options are likely to break someone who upgrades imageproxy without updating their generated URL formats.


Sorry for writing so much. What do you think about how to approach this?

mattpr pushed a commit to mattpr/imageproxy that referenced this issue Jan 21, 2021
Ability to pass remote url encoded (or not).
Option 4 implementation.

Requires `{options}` to be present in request URL.
`x` or `0x0` can be used as no-op options.
The ability to exclude options was not documented in README anyway,
but nevertheless a breaking change.
@mattpr
Copy link
Author

mattpr commented Jan 21, 2021

Hi @willnorris, happy new year!

As I didn't get any feedback from you in terms of preference on how to approach this I went ahead and did an implementation for option 4.

Happy for feedback if you want to go another way. I'm not a golang guy so I happy for feedback on that front as well if I missed something.

I didn't bother to open a pull request because I'm not sure what your feeling is on the various approaches outlined above.

mattpr pushed a commit to mattpr/imageproxy that referenced this issue Jan 21, 2021
Ability to pass remote url encoded (or not).
Option 4 implementation.

Requires `{options}` to be present in request URL.
`x` or `0x0` can be used as no-op options.
The ability to exclude options was not documented in README anyway,
but nevertheless a breaking change.
mattpr pushed a commit to mattpr/imageproxy that referenced this issue Jan 26, 2021
Ability to pass remote url encoded (or not).
Option 4 implementation.

Requires `{options}` to be present in request URL.
`x` or `0x0` can be used as no-op options.
The ability to exclude options was not documented in README anyway,
but nevertheless a breaking change.
@mattpr
Copy link
Author

mattpr commented Jan 26, 2021

Another benefit to supporting url encoding....

The README says:

In order to optimize caching, it is recommended that URLs not contain query strings.

This is due to how webservers/browsers deal with caching when query string is present.

With encoded remote URL support:

  • if the remote image url contains a query string to retrieve the correct image
  • if imageproxy has image caching enabled (or webserver in front of imageproxy or CDN in front of webserver in front of imageproxy)
  • if the remote image url (with query string) is passed to imageproxy encoded...

...then the end-user browser and webserver can optimize caching between them more easily because there will be no query string present on the imageproxy requests (encoded as part of remote url).

So if we can support URL Encoding, then we can update the README to just advise users to url encode remote urls that contain query strings to optimize caching (rather than avoiding them altogether).

@mattpr
Copy link
Author

mattpr commented Feb 13, 2021

Hey @willnorris let me know if I can do anything to make this easier for you time-wise. I know I wrote a lot of ideas for discussion earlier. As I have implemented a proposed change, you can ignore the earlier thoughts.

The TLDR of the change I made is:

  • Options parameter is no longer optional. You can pass //http... or /x/http... or /0x0/http... but not /http....
  • URL encoding of remote urls is now supported (but still optional).
  • Added more test cases.

Before this change the code would attempt to URL-parse the first param and if that failed then assumed there were options present. That meant in the common case (options and url present) both the options and url would be attempted to be URL parsed in the logic. Now we expect parseable options to always be the first parameter.

The fact options were optional was not documented in the README, only in the code...so hoping this isn't a big breaking change for most users. The fix is just to add an empty/no-op option (// or /x/ or /0x0/).

If you are open to this approach, let me know and I will open a pull request. Thanks!

@mattpr
Copy link
Author

mattpr commented Jun 6, 2021

I went ahead an opened a PR for this (maybe that is better for you than looking at my fork)...and squeaky wheel gets the grease as they say. ;)

mattpr added a commit to mattpr/imageproxy that referenced this issue Mar 8, 2022
- URL encoding of remote urls is now supported (but still optional).
- Options parameter is no longer optional.
  - You can pass empty/no-op options (`//http...` or `/x/http...` or `/0x0/http...`),
  - but not `/http...`.
- Added more test cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants