Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL not properly formed with diacritics/accents not encoded #3991

Open
wrobelda opened this issue Feb 26, 2024 · 12 comments
Open

URL not properly formed with diacritics/accents not encoded #3991

wrobelda opened this issue Feb 26, 2024 · 12 comments
Labels
Bug-Report Confirmed bug report

Comments

@wrobelda
Copy link
Contributor

wrobelda commented Feb 26, 2024

Describe the bug
If any of the feed query parameters contains diacritic (accent) characters, they are left as is and not encoded, which will results in some of the clients fail to add the RSS feed with a "URL invalid" error. See: https://stackoverflow.com/questions/33211310/convert-french-accent-to-specific-encoding-in-php

To Reproduce
Steps to reproduce the behavior:

  1. For any chosen bridge which takes a text character parameter, use a string containing diacritic/accent characters (or copy and paste this: ąśćż)
  2. Generate feed URL
  3. Copy that feed to a RSS client of choice (it fails here with TT-RSS at least)
  4. See error

Expected behavior
Diacritics/accents should be properly encoded

@wrobelda wrobelda added the Bug-Report Confirmed bug report label Feb 26, 2024
@dvikan
Copy link
Contributor

dvikan commented Feb 26, 2024

i think this is a bug in TT-RSS or your browser. im not sure

@wrobelda
Copy link
Contributor Author

wrobelda commented Feb 26, 2024

i think this is a bug in TT-RSS or your browser. im not sure

Sorry, what bug? Per RFC 3986, section 2.3, the URL should consist of only comprise of specific character set, which does not contain non-ascii characters, period. Any other characters need to be UTF-8 encoded, per RFC3987.

Meanwhile RSS-Bridge allows those characters to make it to the URL. Sure, modern browsers or some clients will automatically UTF-8 encode such query before they send it outside to webservers, but RSS-Bridge should not rely on that and instead generate a feed URL that conforms to the standards.

See also:
https://www.w3schools.com/tags/ref_urlencode.ASP

@dvikan
Copy link
Contributor

dvikan commented Feb 27, 2024

are you copy pasting url from browser?

are you talking about those urls that are produced inside <link> tags?

i was unable to reproduce. using firefox.

@Bockiii
Copy link
Contributor

Bockiii commented Feb 27, 2024

Reproducible.

Search and result on reddit with a german umlaut "ä". Similar problem than the accented french characters.
image

RSS bridge config
image

Result on dvikans public instance
image

@dvikan
Copy link
Contributor

dvikan commented Feb 27, 2024

okay i get it.

it happens when parameters are used in http requests without url encoding them.

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

related: #3091

@wrobelda
Copy link
Contributor Author

wrobelda commented Mar 12, 2024

in the particular case of RedditBridge a solution is to manually url encode the user input parts.

That means each and every bridge has to handle encoding themselves for each of their arbitrary string inputs, whereas RSS-Bridge could do this itself once by encoding the complete feed URL it generated. There's no harm here: any characters needing encoding will get encoded, otherwise it will be left as is.

Not to mention the bridge code should not be concerned with things like that — its scope is to prepare articles and their content in UTF-8, not handle the intrinsics of HTTP communication between the RSS-Bridge server and an RSS client.

No offense, but I think you downplay the seriousness of this issue for any non-ASCII languages.

@dvikan
Copy link
Contributor

dvikan commented Mar 12, 2024

I like your arguments. Okay let me dwell a bit on it.

@dvikan
Copy link
Contributor

dvikan commented Mar 12, 2024

@Bockiii fixed for reddit in #4010

@dvikan
Copy link
Contributor

dvikan commented Mar 31, 2024

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

@wrobelda
Copy link
Contributor Author

wrobelda commented Apr 1, 2024

i have discovered that curl will automatically escape the url if needed.

but if curl detects an already escaped url, it will NOT escape.

so this particular error only happens if a url is already partially escaped (as was the case with RedditBridge),

The problem here is not with how RSS handles that internally (i.e. the curl lib that it uses), but on the outside, i.e. with the RSS clients that you pass unescaped RSS-Bridge URL to.

In other words, we need to make sure that the URL generated and returned to the user (opened in a new browser tab) by the RSS Bridge after you click "Generate Feed" needs to be properly formed.

@dvikan
Copy link
Contributor

dvikan commented Apr 1, 2024

im confused now. can you give an example?

@dvikan
Copy link
Contributor

dvikan commented Apr 4, 2024

for the record i did some changes related to this issue in 545dc96 but they are a refactor (should not be externally visible changes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug-Report Confirmed bug report
Projects
None yet
Development

No branches or pull requests

3 participants